mmocr.datasets¶
- class mmocr.datasets.ConcatDataset(datasets, pipeline=[], verify_meta=True, force_apply=False, lazy_init=False)[源代码]¶
A wrapper of concatenated dataset.
Same as
torch.utils.data.dataset.ConcatDataset
and support lazy_init.注解
ConcatDataset
should not inherit fromBaseDataset
sinceget_subset
andget_subset_
could produce ambiguous meaning sub-dataset which conflicts with original dataset. If you want to use a sub-dataset ofConcatDataset
, you should setindices
arguments for wrapped dataset which inherit fromBaseDataset
.- 参数
datasets (Sequence[BaseDataset] or Sequence[dict]) – A list of datasets which will be concatenated.
pipeline (list, optional) – Processing pipeline to be applied to all of the concatenated datasets. Defaults to [].
verify_meta (bool) – Whether to verify the consistency of meta information of the concatenated datasets. Defaults to True.
force_apply (bool) – Whether to force apply pipeline to all datasets if any of them already has the pipeline configured. Defaults to False.
lazy_init (bool, optional) – Whether to load annotation during instantiation. Defaults to False.
- class mmocr.datasets.IcdarDataset(*args, proposal_file=None, file_client_args={'backend': 'disk'}, **kwargs)[源代码]¶
Dataset for text detection while ann_file in coco format.
- 参数
ann_file (str) – Annotation file path. Defaults to ‘’.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for
data_prefix
andann_file
. Defaults to ‘’.data_prefix (dict) – Prefix for training data. Defaults to dict(img_path=’’).
filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
Basedataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
Basedataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
- class mmocr.datasets.OCRDataset(ann_file='', metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]¶
OCRDataset for text detection and text recognition.
The annotation format is shown as follows.
{ "metainfo": { "dataset_type": "test_dataset", "task_name": "test_task" }, "data_list": [ { "img_path": "test_img.jpg", "height": 604, "width": 640, "instances": [ { "bbox": [0, 0, 10, 20], "bbox_label": 1, "mask": [0,0,0,10,10,20,20,0], "text": '123' }, { "bbox": [10, 10, 110, 120], "bbox_label": 2, "mask": [10,10],10,110,110,120,120,10]], "extra_anns": '456' } ] }, ] }
- 参数
ann_file (str) – Annotation file path. Defaults to ‘’.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str, optional) – The root directory for
data_prefix
andann_file
. Defaults to None.data_prefix (dict, optional) – Prefix for training data. Defaults to dict(img=None, ann=None).
filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
OCRdataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
OCRdataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
注解
OCRDataset collects meta information from annotation file (the lowest priority), ``OCRDataset.METAINFO``(medium) and metainfo parameter (highest) passed to constructors. The lower priority meta information will be overwritten by higher one.
实际案例
Assume the annotation file is given above. >>> class CustomDataset(OCRDataset): >>> METAINFO: dict = dict(task_name=’custom_task’, >>> dataset_type=’custom_type’) >>> metainfo=dict(task_name=’custom_task_name’) >>> custom_dataset = CustomDataset( >>> ‘path/to/ann_file’, >>> metainfo=metainfo) >>> # meta information of annotation file will be overwritten by >>> # CustomDataset.METAINFO. The merged meta information will >>> # further be overwritten by argument metainfo. >>> custom_dataset.metainfo {‘task_name’: custom_task_name, dataset_type: custom_type}
- class mmocr.datasets.RecogLMDBDataset(ann_file='', parser_cfg={'keys': ['filename', 'text'], 'type': 'LineJsonParser'}, metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]¶
RecogLMDBDataset for text recognition.
The annotation format should be in lmdb format. We support two lmdb formats, one is the lmdb file with only labels generated by txt2lmdb (deprecated), and another one is the lmdb file generated by recog2lmdb.
The former format stores string in filename text format directly in lmdb, while the latter uses image_key as well as label_key for querying.
- 参数
ann_file (str) – Annotation file path. Defaults to ‘’.
parse_cfg (dict, optional) – Config of parser for parsing annotations. Use
LineJsonParser
when the annotation file is in jsonl format with keys offilename
andtext
. The keys in parse_cfg should be consistent with the keys in jsonl annotations. The first key in parse_cfg should be the key of the path in jsonl annotations. The second key in parse_cfg should be the key of the text in jsonl UseLineStrParser
when the annotation file is in txt format. Defaults todict(type='LineJsonParser', keys=['filename', 'text'])
.metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for
data_prefix
andann_file
. Defaults to ‘’.data_prefix (dict) – Prefix for training data. Defaults to
dict(img_path='')
.filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
RecogLMDBDataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
RecogLMDBdataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
- class mmocr.datasets.RecogTextDataset(ann_file='', file_client_args=None, parser_cfg={'keys': ['filename', 'text'], 'type': 'LineJsonParser'}, metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]¶
RecogTextDataset for text recognition.
The annotation format can be both in jsonl and txt. If the annotation file is in jsonl format, it should be a list of dicts. If the annotation file is in txt format, it should be a list of lines.
The annotation formats are shown as follows. - txt format .. code-block:: none
test_img1.jpg OpenMMLab
test_img2.jpg MMOCR
jsonl format
``{"filename": "test_img1.jpg", "text": "OpenMMLab"}`` ``{"filename": "test_img2.jpg", "text": "MMOCR"}``
- 参数
ann_file (str) – Annotation file path. Defaults to ‘’.
file_client_args (dict, optional) – Arguments to instantiate a FileClient. See
mmengine.fileio.FileClient
for details. Default: None.parse_cfg (dict, optional) – Config of parser for parsing annotations. Use
LineJsonParser
when the annotation file is in jsonl format with keys offilename
andtext
. The keys in parse_cfg should be consistent with the keys in jsonl annotations. The first key in parse_cfg should be the key of the path in jsonl annotations. The second key in parse_cfg should be the key of the text in jsonl UseLineStrParser
when the annotation file is in txt format. Defaults todict(type='LineJsonParser', keys=['filename', 'text'])
.metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for
data_prefix
andann_file
. Defaults to ‘’.data_prefix (dict) – Prefix for training data. Defaults to
dict(img_path='')
.filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
RecogTextDataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
RecogTextDataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
- class mmocr.datasets.WildReceiptDataset(directed=False, ann_file='', metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=Ellipsis, test_mode=False, lazy_init=False, max_refetch=1000)[源代码]¶
WildReceipt Dataset for key information extraction. There are two files to be loaded: metainfo and annotation. The metainfo file contains the mapping between classes and labels. The annotation file contains the all necessary information about the image, such as bounding boxes, texts, and labels etc.
The metainfo file is a text file with the following format:
0 Ignore 1 Store_name_value 2 Store_name_key
The annotation format is shown as follows.
{ "file_name": "a.jpeg", "height": 348, "width": 348, "annotations": [ { "box": [ 114.0, 19.0, 230.0, 19.0, 230.0, 1.0, 114.0, 1.0 ], "text": "CHOEUN", "label": 1 }, { "box": [ 97.0, 35.0, 236.0, 35.0, 236.0, 19.0, 97.0, 19.0 ], "text": "KOREANRESTAURANT", "label": 2 } ] }
- 参数
directed (bool) – Whether to use directed graph. Defaults to False.
ann_file (str) – Annotation file path. Defaults to ‘’.
metainfo (str or dict, optional) – Meta information for dataset, such as class information. If it’s a string, it will be treated as a path to the class file from which the class information will be loaded. Defaults to None.
data_root (str, optional) – The root directory for
data_prefix
andann_file
. Defaults to ‘’.data_prefix (dict, optional) – Prefix for training data. Defaults to dict(img_path=’’).
filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
Basedataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
Basedataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
- load_data_list()[源代码]¶
Load data list from annotation file.
- 返回
A list of annotation dict.
- 返回类型
List[dict]
- parse_data_info(raw_data_info)[源代码]¶
Parse data info from raw data info.
- 参数
raw_data_info (dict) – Raw data info.
- 返回
Parsed data info.
img_path (str): Path to the image.
img_shape (tuple(int, int)): Image shape in (H, W).
instances (list[dict]): A list of instances. - bbox (ndarray(dtype=np.float32)): Shape (4, ). Bounding box. - text (str): Annotation text. - edge_label (int): Edge label. - bbox_label (int): Bounding box label.
- 返回类型
Dataset Types¶
- class mmocr.datasets.ocr_dataset.OCRDataset(ann_file='', metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]¶
OCRDataset for text detection and text recognition.
The annotation format is shown as follows.
{ "metainfo": { "dataset_type": "test_dataset", "task_name": "test_task" }, "data_list": [ { "img_path": "test_img.jpg", "height": 604, "width": 640, "instances": [ { "bbox": [0, 0, 10, 20], "bbox_label": 1, "mask": [0,0,0,10,10,20,20,0], "text": '123' }, { "bbox": [10, 10, 110, 120], "bbox_label": 2, "mask": [10,10],10,110,110,120,120,10]], "extra_anns": '456' } ] }, ] }
- 参数
ann_file (str) – Annotation file path. Defaults to ‘’.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str, optional) – The root directory for
data_prefix
andann_file
. Defaults to None.data_prefix (dict, optional) – Prefix for training data. Defaults to dict(img=None, ann=None).
filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
OCRdataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
OCRdataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
注解
OCRDataset collects meta information from annotation file (the lowest priority), ``OCRDataset.METAINFO``(medium) and metainfo parameter (highest) passed to constructors. The lower priority meta information will be overwritten by higher one.
实际案例
Assume the annotation file is given above. >>> class CustomDataset(OCRDataset): >>> METAINFO: dict = dict(task_name=’custom_task’, >>> dataset_type=’custom_type’) >>> metainfo=dict(task_name=’custom_task_name’) >>> custom_dataset = CustomDataset( >>> ‘path/to/ann_file’, >>> metainfo=metainfo) >>> # meta information of annotation file will be overwritten by >>> # CustomDataset.METAINFO. The merged meta information will >>> # further be overwritten by argument metainfo. >>> custom_dataset.metainfo {‘task_name’: custom_task_name, dataset_type: custom_type}
- class mmocr.datasets.icdar_dataset.IcdarDataset(*args, proposal_file=None, file_client_args={'backend': 'disk'}, **kwargs)[源代码]¶
Dataset for text detection while ann_file in coco format.
- 参数
ann_file (str) – Annotation file path. Defaults to ‘’.
metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for
data_prefix
andann_file
. Defaults to ‘’.data_prefix (dict) – Prefix for training data. Defaults to dict(img_path=’’).
filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
Basedataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
Basedataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
- class mmocr.datasets.recog_lmdb_dataset.RecogLMDBDataset(ann_file='', parser_cfg={'keys': ['filename', 'text'], 'type': 'LineJsonParser'}, metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]¶
RecogLMDBDataset for text recognition.
The annotation format should be in lmdb format. We support two lmdb formats, one is the lmdb file with only labels generated by txt2lmdb (deprecated), and another one is the lmdb file generated by recog2lmdb.
The former format stores string in filename text format directly in lmdb, while the latter uses image_key as well as label_key for querying.
- 参数
ann_file (str) – Annotation file path. Defaults to ‘’.
parse_cfg (dict, optional) – Config of parser for parsing annotations. Use
LineJsonParser
when the annotation file is in jsonl format with keys offilename
andtext
. The keys in parse_cfg should be consistent with the keys in jsonl annotations. The first key in parse_cfg should be the key of the path in jsonl annotations. The second key in parse_cfg should be the key of the text in jsonl UseLineStrParser
when the annotation file is in txt format. Defaults todict(type='LineJsonParser', keys=['filename', 'text'])
.metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for
data_prefix
andann_file
. Defaults to ‘’.data_prefix (dict) – Prefix for training data. Defaults to
dict(img_path='')
.filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
RecogLMDBDataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
RecogLMDBdataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
- class mmocr.datasets.recog_text_dataset.RecogTextDataset(ann_file='', file_client_args=None, parser_cfg={'keys': ['filename', 'text'], 'type': 'LineJsonParser'}, metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]¶
RecogTextDataset for text recognition.
The annotation format can be both in jsonl and txt. If the annotation file is in jsonl format, it should be a list of dicts. If the annotation file is in txt format, it should be a list of lines.
The annotation formats are shown as follows. - txt format .. code-block:: none
test_img1.jpg OpenMMLab
test_img2.jpg MMOCR
jsonl format
``{"filename": "test_img1.jpg", "text": "OpenMMLab"}`` ``{"filename": "test_img2.jpg", "text": "MMOCR"}``
- 参数
ann_file (str) – Annotation file path. Defaults to ‘’.
file_client_args (dict, optional) – Arguments to instantiate a FileClient. See
mmengine.fileio.FileClient
for details. Default: None.parse_cfg (dict, optional) – Config of parser for parsing annotations. Use
LineJsonParser
when the annotation file is in jsonl format with keys offilename
andtext
. The keys in parse_cfg should be consistent with the keys in jsonl annotations. The first key in parse_cfg should be the key of the path in jsonl annotations. The second key in parse_cfg should be the key of the text in jsonl UseLineStrParser
when the annotation file is in txt format. Defaults todict(type='LineJsonParser', keys=['filename', 'text'])
.metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.
data_root (str) – The root directory for
data_prefix
andann_file
. Defaults to ‘’.data_prefix (dict) – Prefix for training data. Defaults to
dict(img_path='')
.filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
RecogTextDataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
RecogTextDataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
- class mmocr.datasets.wildreceipt_dataset.WildReceiptDataset(directed=False, ann_file='', metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=Ellipsis, test_mode=False, lazy_init=False, max_refetch=1000)[源代码]¶
WildReceipt Dataset for key information extraction. There are two files to be loaded: metainfo and annotation. The metainfo file contains the mapping between classes and labels. The annotation file contains the all necessary information about the image, such as bounding boxes, texts, and labels etc.
The metainfo file is a text file with the following format:
0 Ignore 1 Store_name_value 2 Store_name_key
The annotation format is shown as follows.
{ "file_name": "a.jpeg", "height": 348, "width": 348, "annotations": [ { "box": [ 114.0, 19.0, 230.0, 19.0, 230.0, 1.0, 114.0, 1.0 ], "text": "CHOEUN", "label": 1 }, { "box": [ 97.0, 35.0, 236.0, 35.0, 236.0, 19.0, 97.0, 19.0 ], "text": "KOREANRESTAURANT", "label": 2 } ] }
- 参数
directed (bool) – Whether to use directed graph. Defaults to False.
ann_file (str) – Annotation file path. Defaults to ‘’.
metainfo (str or dict, optional) – Meta information for dataset, such as class information. If it’s a string, it will be treated as a path to the class file from which the class information will be loaded. Defaults to None.
data_root (str, optional) – The root directory for
data_prefix
andann_file
. Defaults to ‘’.data_prefix (dict, optional) – Prefix for training data. Defaults to dict(img_path=’’).
filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all
data_infos
.serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) –
test_mode=True
means in test phase. Defaults to False.lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file.
Basedataset
can skip load annotations to save time by setlazy_init=False
. Defaults to False.max_refetch (int, optional) – If
Basedataset.prepare_data
get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.
- load_data_list()[源代码]¶
Load data list from annotation file.
- 返回
A list of annotation dict.
- 返回类型
List[dict]
- parse_data_info(raw_data_info)[源代码]¶
Parse data info from raw data info.
- 参数
raw_data_info (dict) – Raw data info.
- 返回
Parsed data info.
img_path (str): Path to the image.
img_shape (tuple(int, int)): Image shape in (H, W).
instances (list[dict]): A list of instances. - bbox (ndarray(dtype=np.float32)): Shape (4, ). Bounding box. - text (str): Annotation text. - edge_label (int): Edge label. - bbox_label (int): Bounding box label.
- 返回类型
Transforms¶
- class mmocr.datasets.transforms.BoundedScaleAspectJitter(long_size_bound, short_size_bound, ratio_range=(0.7, 1.3), aspect_ratio_range=(0.9, 1.1), resize_type='Resize', **resize_kwargs)[源代码]¶
First randomly rescale the image so that the longside and shortside of the image are around the bound; then jitter its aspect ratio.
Required Keys:
img
img_shape
gt_bboxes (optional)
gt_polygons (optional)
Modified Keys:
img
img_shape
gt_bboxes (optional)
gt_polygons (optional)
Added Keys:
scale
scale_factor
keep_ratio
- 参数
long_size_bound (int) – The approximate bound for long size.
short_size_bound (int) – The approximate bound for short size.
size_jitter_range (tuple(float, float)) – Range of the ratio used to jitter the size. Defaults to (0.7, 1.3).
aspect_ratio_jitter_range (tuple(float, float)) – Range of the ratio used to jitter its aspect ratio. Defaults to (0.9, 1.1).
resize_type (str) – The type of resize class to use. Defaults to “Resize”.
**resize_kwargs – Other keyword arguments for the
resize_type
.
- 返回类型
- transform(results)[源代码]¶
The transform function. All subclass of BaseTransform should override this method.
This function takes the result dict as the input, and can add new items to the dict or modify existing items in the dict. And the result dict will be returned in the end, which allows to concate multiple transforms into a pipeline.
- class mmocr.datasets.transforms.FixInvalidPolygon(mode='fix', min_poly_points=3)[源代码]¶
Fix invalid polygons in the dataset.
Required Keys:
gt_polygons
gt_ignored
Modified Keys:
gt_polygons
gt_ignored
- 参数
mode (str) – The mode of fixing invalid polygons. Options are ‘fix’ and ‘ignore’. For the ‘fix’ mode, the transform will try to fix the invalid polygons to a valid one by eliminating the self-intersection. For the ‘ignore’ mode, the invalid polygons will be ignored during training. Defaults to ‘fix’.
min_poly_points (int) – Minimum number of the coordinate points in a polygon. Defaults to 3.
- 返回类型
- class mmocr.datasets.transforms.ImgAugWrapper(args=None)[源代码]¶
A wrapper around imgaug https://github.com/aleju/imgaug.
Find available augmenters at https://imgaug.readthedocs.io/en/latest/source/overview_of_augmenters.html.
Required Keys:
img
gt_polygons (optional for text recognition)
gt_bboxes (optional for text recognition)
gt_bboxes_labels (optional for text recognition)
gt_ignored (optional for text recognition)
gt_texts (optional)
Modified Keys:
img
gt_polygons (optional for text recognition)
gt_bboxes (optional for text recognition)
gt_bboxes_labels (optional for text recognition)
gt_ignored (optional for text recognition)
img_shape (optional)
gt_texts (optional)
- 参数
args (list[list or dict]], optional) – The argumentation list. For details, please refer to imgaug document. Take args=[[‘Fliplr’, 0.5], dict(cls=’Affine’, rotate=[-10, 10]), [‘Resize’, [0.5, 3.0]]] as an example. The args horizontally flip images with probability 0.5, followed by random rotation with angles in range [-10, 10], and resize with an independent scale in range [0.5, 3.0] for each side of images. Defaults to None.
- 返回类型
- class mmocr.datasets.transforms.LoadImageFromFile(to_float32=False, color_type='color', imdecode_backend='cv2', file_client_args={'backend': 'disk'}, min_size=0, ignore_empty=False)[源代码]¶
Load an image from file.
Required Keys:
img_path
Modified Keys:
img
img_shape
ori_shape
- 参数
to_float32 (bool) – Whether to convert the loaded image to a float32 numpy array. If set to False, the loaded image is an uint8 array. Defaults to False.
color_type (str) – The flag argument for :func:
mmcv.imfrombytes
. Defaults to ‘color’.imdecode_backend (str) – The image decoding backend type. The backend argument for :func:
mmcv.imfrombytes
. See :func:mmcv.imfrombytes
for details. Defaults to ‘cv2’.file_client_args (dict) – Arguments to instantiate a FileClient. See
mmengine.fileio.FileClient
for details. Defaults todict(backend='disk')
.ignore_empty (bool) – Whether to allow loading empty image or file path not existent. Defaults to False.
min_size (int) – The minimum size of the image to be loaded. If the image is smaller than the minimum size, it will be regarded as a broken image. Defaults to 0.
- 返回类型
- class mmocr.datasets.transforms.LoadImageFromLMDB(to_float32=False, color_type='color', imdecode_backend='cv2', file_client_args={}, ignore_empty=False)[源代码]¶
Load an image from lmdb file. Only support LMDB file at disk.
Required Keys:
img_path (In LMDB img_path is a key in the format of “image-{i:09d}”.)
Modified Keys:
img
img_shape
ori_shape
- 参数
to_float32 (bool) – Whether to convert the loaded image to a float32 numpy array. If set to False, the loaded image is an uint8 array. Defaults to False.
color_type (str) – The flag argument for :func:
mmcv.imfrombytes
. Defaults to ‘color’.imdecode_backend (str) – The image decoding backend type. The backend argument for :func:
mmcv.imfrombytes
. See :func:mmcv.imfrombytes
for details. Defaults to ‘cv2’.file_client_args (dict) – Arguments to instantiate a FileClient except for
backend
anddb_path
. Seemmengine.fileio.FileClient
for details. Defaults todict()
.ignore_empty (bool) – Whether to allow loading empty image or file path not existent. Defaults to False.
- 返回类型
- class mmocr.datasets.transforms.LoadImageFromNDArray(to_float32=False, color_type='color', imdecode_backend='cv2', file_client_args={'backend': 'disk'}, min_size=0, ignore_empty=False)[源代码]¶
Load an image from
results['img']
.Similar with
LoadImageFromFile
, but the image has been loaded asnp.ndarray
inresults['img']
. Can be used when loading image from webcam.Required Keys:
img
Modified Keys:
img
img_path
img_shape
ori_shape
- 参数
- 返回类型
- class mmocr.datasets.transforms.LoadKIEAnnotations(with_bbox=True, with_label=True, with_text=True, directed=False, key_node_idx=None, value_node_idx=None, **kwargs)[源代码]¶
Load and process the
instances
annotation provided by dataset.The annotation format is as the following:
{ # A nested list of 4 numbers representing the bounding box of the # instance, in (x1, y1, x2, y2) order. 'bbox': np.array([[x1, y1, x2, y2], [x1, y1, x2, y2], ...], dtype=np.int32), # Labels of boxes. Shape is (N,). 'bbox_labels': np.array([0, 2, ...], dtype=np.int32), # Labels of edges. Shape (N, N). 'edge_labels': np.array([0, 2, ...], dtype=np.int32), # List of texts. "texts": ['text1', 'text2', ...], }
After this module, the annotation has been changed to the format below:
{ # In (x1, y1, x2, y2) order, float type. N is the number of bboxes # in np.float32 'gt_bboxes': np.ndarray(N, 4), # In np.int64 type. 'gt_bboxes_labels': np.ndarray(N, ), # In np.int32 type. 'gt_edges_labels': np.ndarray(N, N), # In list[str] 'gt_texts': list[str], # tuple(int) 'ori_shape': (H, W) }
Required Keys:
bboxes
bbox_labels
edge_labels
texts
Added Keys:
gt_bboxes (np.float32)
gt_bboxes_labels (np.int64)
gt_edges_labels (np.int64)
gt_texts (list[str])
ori_shape (tuple[int])
- 参数
with_bbox (bool) – Whether to parse and load the bbox annotation. Defaults to True.
with_label (bool) – Whether to parse and load the label annotation. Defaults to True.
with_text (bool) – Whether to parse and load the text annotation. Defaults to True.
directed (bool) – Whether build edges as a directed graph. Defaults to False.
key_node_idx (int, optional) – Key node label, used to mask out edges that are not connected from key nodes to value nodes. It has to be specified together with
value_node_idx
. Defaults to None.value_node_idx (int, optional) – Value node label, used to mask out edges that are not connected from key nodes to value nodes. It has to be specified together with
key_node_idx
. Defaults to None.
- 返回类型
- class mmocr.datasets.transforms.LoadOCRAnnotations(with_bbox=False, with_label=False, with_polygon=False, with_text=False, **kwargs)[源代码]¶
Load and process the
instances
annotation provided by dataset.The annotation format is as the following:
{ 'instances': [ { # List of 4 numbers representing the bounding box of the # instance, in (x1, y1, x2, y2) order. # used in text detection or text spotting tasks. 'bbox': [x1, y1, x2, y2], # Label of instance, usually it's 0. # used in text detection or text spotting tasks. 'bbox_label': 0, # List of n numbers representing the polygon of the # instance, in (xn, yn) order. # used in text detection/ textspotter. "polygon": [x1, y1, x2, y2, ... xn, yn], # The flag indicating whether the instance should be ignored. # used in text detection or text spotting tasks. "ignore": False, # The groundtruth of text. # used in text recognition or text spotting tasks. "text": 'tmp', } ] }
After this module, the annotation has been changed to the format below:
{ # In (x1, y1, x2, y2) order, float type. N is the number of bboxes # in np.float32 'gt_bboxes': np.ndarray(N, 4) # In np.int64 type. 'gt_bboxes_labels': np.ndarray(N, ) # In (x1, y1,..., xk, yk) order, float type. # in list[np.float32] 'gt_polygons': list[np.ndarray(2k, )] # In np.bool_ type. 'gt_ignored': np.ndarray(N, ) # In list[str] 'gt_texts': list[str] }
Required Keys:
instances
bbox (optional)
bbox_label (optional)
polygon (optional)
ignore (optional)
text (optional)
Added Keys:
gt_bboxes (np.float32)
gt_bboxes_labels (np.int64)
gt_polygons (list[np.float32])
gt_ignored (np.bool_)
gt_texts (list[str])
- 参数
with_bbox (bool) – Whether to parse and load the bbox annotation. Defaults to False.
with_label (bool) – Whether to parse and load the label annotation. Defaults to False.
with_polygon (bool) – Whether to parse and load the polygon annotation. Defaults to False.
with_text (bool) – Whether to parse and load the text annotation. Defaults to False.
- 返回类型
- class mmocr.datasets.transforms.MMDet2MMOCR[源代码]¶
Convert transforms’s data format from MMDet to MMOCR.
Required Keys:
gt_masks (PolygonMasks | BitmapMasks) (optional)
gt_ignore_flags (np.bool) (optional)
Added Keys:
gt_polygons (list[np.ndarray])
gt_ignored (np.ndarray)
- class mmocr.datasets.transforms.MMOCR2MMDet(poly2mask=False)[源代码]¶
Convert transforms’s data format from MMOCR to MMDet.
Required Keys:
img_shape
gt_polygons (List[ndarray]) (optional)
gt_ignored (np.bool) (optional)
Added Keys:
gt_masks (PolygonMasks | BitmapMasks) (optional)
gt_ignore_flags (np.bool) (optional)
- class mmocr.datasets.transforms.PackKIEInputs(meta_keys=())[源代码]¶
Pack the inputs data for key information extraction.
The type of outputs is dict:
inputs: image converted to tensor, whose shape is (C, H, W).
data_samples: Two components of
TextDetDataSample
will be updated:gt_instances (InstanceData): Depending on annotations, a subset of the following keys will be updated:
bboxes (torch.Tensor((N, 4), dtype=torch.float32)): The groundtruth of bounding boxes in the form of [x1, y1, x2, y2]. Renamed from ‘gt_bboxes’.
labels (torch.LongTensor(N)): The labels of instances. Renamed from ‘gt_bboxes_labels’.
edge_labels (torch.LongTensor(N, N)): The edge labels. Renamed from ‘gt_edges_labels’.
texts (list[str]): The groundtruth texts. Renamed from ‘gt_texts’.
metainfo (dict): ‘metainfo’ is always populated. The contents of the ‘metainfo’ depends on
meta_keys
. By default it includes:“img_path”: Path to the image file.
“img_shape”: Shape of the image input to the network as a tuple (h, w). Note that the image may be zero-padded afterward on the bottom/right if the batch tensor is larger than this shape.
“scale_factor”: A tuple indicating the ratio of width and height of the preprocessed image to the original one.
“ori_shape”: Shape of the preprocessed image as a tuple (h, w).
- 参数
meta_keys (Sequence[str], optional) – Meta keys to be converted to the metainfo of
TextDetSample
. Defaults to('img_path', 'ori_shape', 'img_shape', 'scale_factor', 'flip', 'flip_direction')
.
- class mmocr.datasets.transforms.PackTextDetInputs(meta_keys=('img_path', 'ori_shape', 'img_shape', 'scale_factor', 'flip', 'flip_direction'))[源代码]¶
Pack the inputs data for text detection.
The type of outputs is dict:
inputs: image converted to tensor, whose shape is (C, H, W).
data_samples: Two components of
TextDetDataSample
will be updated:gt_instances (InstanceData): Depending on annotations, a subset of the following keys will be updated:
bboxes (torch.Tensor((N, 4), dtype=torch.float32)): The groundtruth of bounding boxes in the form of [x1, y1, x2, y2]. Renamed from ‘gt_bboxes’.
labels (torch.LongTensor(N)): The labels of instances. Renamed from ‘gt_bboxes_labels’.
polygons(list[np.array((2k,), dtype=np.float32)]): The groundtruth of polygons in the form of [x1, y1,…, xk, yk]. Each element in polygons may have different number of points. Renamed from ‘gt_polygons’. Using numpy instead of tensor is that polygon usually is not the output of model and operated on cpu.
ignored (torch.BoolTensor((N,))): The flag indicating whether the corresponding instance should be ignored. Renamed from ‘gt_ignored’.
texts (list[str]): The groundtruth texts. Renamed from ‘gt_texts’.
metainfo (dict): ‘metainfo’ is always populated. The contents of the ‘metainfo’ depends on
meta_keys
. By default it includes:“img_path”: Path to the image file.
“img_shape”: Shape of the image input to the network as a tuple (h, w). Note that the image may be zero-padded afterward on the bottom/right if the batch tensor is larger than this shape.
“scale_factor”: A tuple indicating the ratio of width and height of the preprocessed image to the original one.
“ori_shape”: Shape of the preprocessed image as a tuple (h, w).
“pad_shape”: Image shape after padding (if any Pad-related transform involved) as a tuple (h, w).
“flip”: A boolean indicating if the image has been flipped.
flip_direction
: the flipping direction.
- 参数
meta_keys (Sequence[str], optional) – Meta keys to be converted to the metainfo of
TextDetSample
. Defaults to('img_path', 'ori_shape', 'img_shape', 'scale_factor', 'flip', 'flip_direction')
.
- class mmocr.datasets.transforms.PackTextRecogInputs(meta_keys=('img_path', 'ori_shape', 'img_shape', 'pad_shape', 'valid_ratio'))[源代码]¶
Pack the inputs data for text recognition.
The type of outputs is dict:
inputs: Image as a tensor, whose shape is (C, H, W).
data_samples: Two components of
TextRecogDataSample
will be updated:gt_text (LabelData):
item(str): The groundtruth of text. Rename from ‘gt_texts’.
metainfo (dict): ‘metainfo’ is always populated. The contents of the ‘metainfo’ depends on
meta_keys
. By default it includes:“img_path”: Path to the image file.
“ori_shape”: Shape of the preprocessed image as a tuple (h, w).
“img_shape”: Shape of the image input to the network as a tuple (h, w). Note that the image may be zero-padded afterward on the bottom/right if the batch tensor is larger than this shape.
“valid_ratio”: The proportion of valid (unpadded) content of image on the x-axis. It defaults to 1 if not set in pipeline.
- 参数
meta_keys (Sequence[str], optional) – Meta keys to be converted to the metainfo of
TextRecogDataSampel
. Defaults to('img_path', 'ori_shape', 'img_shape', 'pad_shape', 'valid_ratio')
.
- class mmocr.datasets.transforms.PadToWidth(width, pad_cfg={'type': 'Pad'})[源代码]¶
Only pad the image’s width.
Required Keys:
img
Modified Keys:
img
img_shape
Added Keys:
pad_shape
pad_fixed_size
pad_size_divisor
valid_ratio
- 参数
- 返回类型
- class mmocr.datasets.transforms.PyramidRescale(factor=4, base_shape=(128, 512), randomize_factor=True)[源代码]¶
Resize the image to the base shape, downsample it with gaussian pyramid, and rescale it back to original size.
Adapted from https://github.com/FangShancheng/ABINet.
Required Keys:
img (ndarray)
Modified Keys:
img (ndarray)
- 参数
- 返回类型
- class mmocr.datasets.transforms.RandomCrop(min_side_ratio=0.4)[源代码]¶
Randomly crop images and make sure to contain at least one intact instance.
Required Keys:
img
gt_polygons
gt_bboxes
gt_bboxes_labels
gt_ignored
gt_texts (optional)
Modified Keys:
img
img_shape
gt_polygons
gt_bboxes
gt_bboxes_labels
gt_ignored
gt_texts (optional)
- 参数
min_side_ratio (float) – The ratio of the shortest edge of the cropped image to the original image size.
- 返回类型
- class mmocr.datasets.transforms.RandomFlip(prob=None, direction='horizontal')[源代码]¶
Flip the image & bbox polygon.
There are 3 flip modes:
prob
is float,direction
is string: the image will bedirection``ly flipped with probability of ``prob
. E.g.,prob=0.5
,direction='horizontal'
, then image will be horizontally flipped with probability of 0.5.
prob
is float,direction
is list of string: the image willbe
direction[i]``ly flipped with probability of ``prob/len(direction)
. E.g.,prob=0.5
,direction=['horizontal', 'vertical']
, then image will be horizontally flipped with probability of 0.25, vertically with probability of 0.25.
prob
is list of float,direction
is list of string:given
len(prob) == len(direction)
, the image will bedirection[i]``ly flipped with probability of ``prob[i]
. E.g.,prob=[0.3, 0.5]
,direction=['horizontal', 'vertical']
, then image will be horizontally flipped with probability of 0.3, vertically with probability of 0.5.
- Required Keys:
img
gt_bboxes (optional)
gt_polygons (optional)
- Modified Keys:
img
gt_bboxes (optional)
gt_polygons (optional)
- Added Keys:
flip
flip_direction
- 参数
- 返回类型
- flip_polygons(polygons, img_shape, direction)[源代码]¶
Flip polygons horizontally, vertically or diagonally.
- 参数
polygons (list[numpy.ndarray) – polygons.
direction (str) – Flip direction. Options are ‘horizontal’, ‘vertical’ and ‘diagonal’.
- 返回
Flipped polygons.
- 返回类型
- class mmocr.datasets.transforms.RandomRotate(max_angle=10, pad_with_fixed_color=False, pad_value=(0, 0, 0), use_canvas=False)[源代码]¶
Randomly rotate the image, boxes, and polygons. For recognition task, only the image will be rotated. If set
use_canvas
as True, the shape of rotated image might be modified based on the rotated angle size, otherwise, the image will keep the shape before rotation.Required Keys:
img
img_shape
gt_bboxes (optional)
gt_polygons (optional)
Modified Keys:
img
img_shape (optional)
gt_bboxes (optional)
gt_polygons (optional)
Added Keys:
rotated_angle
- 参数
max_angle (int) – The maximum rotation angle (can be bigger than 180 or a negative). Defaults to 10.
pad_with_fixed_color (bool) – The flag for whether to pad rotated image with fixed value. Defaults to False.
pad_value (tuple[int, int, int]) – The color value for padding rotated image. Defaults to (0, 0, 0).
use_canvas (bool) – Whether to create a canvas for rotated image. Defaults to False. If set true, the image shape may be modified.
- 返回类型
- class mmocr.datasets.transforms.RescaleToHeight(height, min_width=None, max_width=None, width_divisor=1, resize_type='Resize', **resize_kwargs)[源代码]¶
Rescale the image to the height according to setting and keep the aspect ratio unchanged if possible. However, if any of
min_width
,max_width
orwidth_divisor
are specified, aspect ratio may still be changed to ensure the width meets these constraints.Required Keys:
img
Modified Keys:
img
img_shape
Added Keys:
scale
scale_factor
keep_ratio
- 参数
height (int) – Height of rescaled image.
min_width (int, optional) – Minimum width of rescaled image. Defaults to None.
max_width (int, optional) – Maximum width of rescaled image. Defaults to None.
width_divisor (int) – The divisor of width size. Defaults to 1.
resize_type (str) – The type of resize class to use. Defaults to “Resize”.
**resize_kwargs – Other keyword arguments for the
resize_type
.
- 返回类型
- class mmocr.datasets.transforms.Resize(scale=None, scale_factor=None, keep_ratio=False, clip_object_border=True, backend='cv2', interpolation='bilinear')[源代码]¶
Resize image & bboxes & polygons.
This transform resizes the input image according to
scale
orscale_factor
. Bboxes and polygons are then resized with the same scale factor. ifscale
andscale_factor
are both set, it will usescale
to resize.Required Keys:
img
img_shape
gt_bboxes
gt_polygons
Modified Keys:
img
img_shape
gt_bboxes
gt_polygons
Added Keys:
scale
scale_factor
keep_ratio
- 参数
scale (int or tuple) – Image scales for resizing. Defaults to None.
scale_factor (float or tuple[float, float]) – Scale factors for resizing. It’s either a factor applicable to both dimensions or in the form of (scale_w, scale_h). Defaults to None.
keep_ratio (bool) – Whether to keep the aspect ratio when resizing the image. Defaults to False.
clip_object_border (bool) – Whether to clip the objects outside the border of the image. Defaults to True.
backend (str) – Image resize backend, choices are ‘cv2’ and ‘pillow’. These two backends generates slightly different results. Defaults to ‘cv2’.
interpolation (str) – Interpolation method, accepted values are “nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’ backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults to ‘bilinear’.
- 返回类型
- class mmocr.datasets.transforms.ShortScaleAspectJitter(short_size=736, ratio_range=(0.7, 1.3), aspect_ratio_range=(0.9, 1.1), scale_divisor=1, resize_type='Resize', **resize_kwargs)[源代码]¶
First rescale the image for its shorter side to reach the short_size and then jitter its aspect ratio, final rescale the shape guaranteed to be divided by scale_divisor.
Required Keys:
img
img_shape
gt_bboxes (optional)
gt_polygons (optional)
Modified Keys:
img
img_shape
gt_bboxes (optional)
gt_polygons (optional)
Added Keys:
scale
scale_factor
keep_ratio
- 参数
short_size (int) – Target shorter size before jittering the aspect ratio. Defaults to 736.
short_size_jitter_range (tuple(float, float)) – Range of the ratio used to jitter the target shorter size. Defaults to (0.7, 1.3).
aspect_ratio_jitter_range (tuple(float, float)) – Range of the ratio used to jitter its aspect ratio. Defaults to (0.9, 1.1).
scale_divisor (int) – The scale divisor. Defaults to 1.
resize_type (str) – The type of resize class to use. Defaults to “Resize”.
**resize_kwargs – Other keyword arguments for the
resize_type
.
- 返回类型
- class mmocr.datasets.transforms.SourceImagePad(target_scale, crop_ratio=0.1111111111111111)[源代码]¶
Pad Image to target size. It will randomly crop an area from the original image and resize it to the target size, then paste the original image to its top left corner.
Required Keys:
img
Modified Keys:
img
img_shape
Added Keys: - pad_shape - pad_fixed_size
- 参数
target_scale (int or tuple[int, int]]) – The target size of padded image. If it’s an integer, then the padding size would be (target_size, target_size). If it’s tuple, then
target_scale[0]
should be the width andtarget_scale[1]
should be the height. The size of the padded image will be (target_scale[1], target_scale[0])crop_ratio (float or Tuple[float, float]) – Relative size for the crop region. If
crop_ratio
is a float, then the initial crop size would be(crop_ratio * img.shape[0], crop_ratio * img.shape[1])
. Ifcrop_ratio
is a tuple, thencrop_ratio[0]
is for the width andcrop_ratio[1]
is for the height. The initial crop size would be(crop_ratio[1] * img.shape[0], crop_ratio[0] * img.shape[1])
. Defaults to 1./9.
- 返回类型
- transform(results)[源代码]¶
Pad Image to target size. It will randomly select a small area from the original image and resize it to the target size, then paste the original image to its top left corner.
- 参数
results (Dict) – Result dict containing the data to transform.
- 返回
The transformed data.
- 返回类型
(Dict)
- class mmocr.datasets.transforms.TextDetRandomCrop(target_size, positive_sample_ratio=0.625)[源代码]¶
Randomly select a region and crop images to a target size and make sure to contain text region. This transform may break up text instances, and for broken text instances, we will crop it’s bbox and polygon coordinates. This transform is recommend to be used in segmentation-based network.
Required Keys:
img
gt_polygons
gt_bboxes
gt_bboxes_labels
gt_ignored
Modified Keys:
img
img_shape
gt_polygons
gt_bboxes
gt_bboxes_labels
gt_ignored
- 参数
target_size (tuple(int, int) or int) – Target size for the cropped image. If it’s a tuple, then target width and target height will be
target_size[0]
andtarget_size[1]
, respectively. If it’s an integer, them both target width and target height will betarget_size
.positive_sample_ratio (float) – The probability of sampling regions that go through text regions. Defaults to 5. / 8.
- 返回类型
- class mmocr.datasets.transforms.TextDetRandomCropFlip(pad_ratio=0.1, crop_ratio=0.5, iter_num=1, min_area_ratio=0.2, epsilon=0.01)[源代码]¶
Random crop and flip a patch in the image. Only used in text detection task.
Required Keys:
img
gt_bboxes
gt_polygons
Modified Keys:
img
gt_bboxes
gt_polygons
- 参数
pad_ratio (float) – The ratio of padding. Defaults to 0.1.
crop_ratio (float) – The ratio of cropping. Defaults to 0.5.
iter_num (int) – Number of operations. Defaults to 1.
min_area_ratio (float) – Minimal area ratio between cropped patch and original image. Defaults to 0.2.
epsilon (float) – The threshold of polygon IoU between cropped area and polygon, which is used to avoid cropping text instances. Defaults to 0.01.
- 返回类型
- class mmocr.datasets.transforms.TorchVisionWrapper(op, **kwargs)[源代码]¶
A wrapper around torchvision trasnforms. It applies specific transform to
img
and updatesheight
andwidth
accordingly.Required Keys:
img (ndarray): The input image.
Modified Keys:
img (ndarray): The modified image.
img_shape (tuple(int, int)): The shape of the image in (height, width).
警告
This transform only affects the image but not its associated annotations, such as word bounding boxes and polygons. Therefore, it may only be applicable to text recognition tasks.
- 参数
op (str) – The name of any transform class in
torchvision.transforms()
.**kwargs – Arguments that will be passed to initializer of torchvision transform.
- 返回类型
mmocr.engine¶
Hooks¶
- class mmocr.engine.hooks.VisualizationHook(enable=False, interval=50, score_thr=0.3, show=False, draw_pred=False, draw_gt=False, wait_time=0.0, file_client_args={'backend': 'disk'})[源代码]¶
Detection Visualization Hook. Used to visualize validation and testing process prediction results.
- 参数
enable (bool) – Whether to enable this hook. Defaults to False.
interval (int) – The interval of visualization. Defaults to 50.
score_thr (float) – The threshold to visualize the bboxes and masks. It’s only useful for text detection. Defaults to 0.3.
show (bool) – Whether to display the drawn image. Defaults to False.
wait_time (float) – The interval of show in seconds. Defaults to 0.
file_client_args (dict) – Arguments to instantiate a FileClient. See
mmengine.fileio.FileClient
for details. Defaults todict(backend='disk')
.draw_pred (bool) –
draw_gt (bool) –
- 返回类型
- after_test_iter(runner, batch_idx, data_batch, outputs)[源代码]¶
Run after every testing iterations.
- 参数
runner (
Runner
) – The runner of the testing process.batch_idx (int) – The index of the current batch in the val loop.
data_batch (Sequence[dict]) – Data from dataloader.
outputs (Sequence[Union[mmocr.structures.textdet_data_sample.TextDetDataSample, mmocr.structures.textrecog_data_sample.TextRecogDataSample]]) –
- 返回类型
:param outputs (Sequence[
TextDetDataSample
or:TextRecogDataSample
]): Outputs from model.
- after_val_iter(runner, batch_idx, data_batch, outputs)[源代码]¶
Run after every
self.interval
validation iterations.- 参数
runner (
Runner
) – The runner of the validation process.batch_idx (int) – The index of the current batch in the val loop.
data_batch (Sequence[dict]) – Data from dataloader.
outputs (Sequence[Union[mmocr.structures.textdet_data_sample.TextDetDataSample, mmocr.structures.textrecog_data_sample.TextRecogDataSample]]) –
- 返回类型
:param outputs (Sequence[
TextDetDataSample
or:TextRecogDataSample
]): Outputs from model.
mmocr.evaluation¶
Evaluator¶
- class mmocr.evaluation.evaluator.MultiDatasetsEvaluator(metrics, dataset_prefixes)[源代码]¶
Wrapper class to compose class: ConcatDataset and multiple
BaseMetric
instances. The metrics will be evaluated on each dataset slice separately. The name of the each metric is the concatenation of the dataset prefix, the metric prefix and the key of metric - e.g. dataset_prefix/metric_prefix/accuracy.- 参数
- 返回类型
- evaluate(size)[源代码]¶
Invoke
evaluate
method of each metric and collect the metrics dictionary.- 参数
size (int) – Length of the entire validation dataset. When batch size > 1, the dataloader may pad some data samples to make sure all ranks have the same length of dataset slice. The
collect_results
function will drop the padded data based on this size.- 返回
Evaluation results of all metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型
Functional¶
Metric¶
- class mmocr.evaluation.metrics.CharMetric(valid_symbol='[^A-Z^a-z^0-9^一-龥]', collect_device='cpu', prefix=None)[源代码]¶
Character metrics for text recognition task.
- 参数
valid_symbol (str) – Valid characters. Defaults to ‘[^A-Z^a-z^0-9^一-龥]’
collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.
- 返回类型
- compute_metrics(results)[源代码]¶
Compute the metrics from processed results.
- 参数
results (list[Dict]) – The processed results of each batch.
- 返回
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型
Dict
- process(data_batch, data_samples)[源代码]¶
Process one batch of data_samples. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数
data_batch (Sequence[Dict]) – A batch of gts.
data_samples (Sequence[Dict]) – A batch of outputs from the model.
- 返回类型
- class mmocr.evaluation.metrics.F1Metric(num_classes, key='labels', mode='micro', cared_classes=[], ignored_classes=[], collect_device='cpu', prefix=None)[源代码]¶
Compute F1 scores.
- 参数
num_classes (int) – Number of labels.
key (str) – The key name of the predicted and ground truth labels. Defaults to ‘labels’.
Options are: - ‘micro’: Calculate metrics globally by counting the total true
positives, false negatives and false positives.
’macro’: Calculate metrics for each label, and find their unweighted mean.
If mode is a list, then metrics in mode will be calculated separately. Defaults to ‘micro’.
cared_classes (list[int]) – The indices of the labels particpated in the metirc computing. If both
cared_classes
andignored_classes
are empty, all classes will be taken into account. Defaults to []. Note:cared_classes
andignored_classes
cannot be specified together.ignored_classes (list[int]) – The index set of labels that are ignored when computing metrics. If both
cared_classes
andignored_classes
are empty, all classes will be taken into account. Defaults to []. Note:cared_classes
andignored_classes
cannot be specified together.collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.
- 返回类型
警告
Only non-negative integer labels are involved in computing. All negative ground truth labels will be ignored.
- process(data_batch, data_samples)[源代码]¶
Process one batch of data_samples. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数
data_batch (Sequence[Dict]) – A batch of gts.
data_samples (Sequence[Dict]) – A batch of outputs from the model.
- 返回类型
- class mmocr.evaluation.metrics.HmeanIOUMetric(match_iou_thr=0.5, ignore_precision_thr=0.5, pred_score_thrs={'start': 0.3, 'step': 0.1, 'stop': 0.9}, strategy='vanilla', collect_device='cpu', prefix=None)[源代码]¶
HmeanIOU metric.
This method computes the hmean iou metric, which is done in the following steps:
Filter the prediction polygon:
Scores is smaller than minimum prediction score threshold.
The proportion of the area that intersects with gt ignored polygon is greater than ignore_precision_thr.
Computing an M x N IoU matrix, where each element indexing E_mn represents the IoU between the m-th valid GT and n-th valid prediction.
Based on different prediction score threshold: - Obtain the ignored predictions according to prediction score.
The filtered predictions will not be involved in the later metric computations.
Based on the IoU matrix, get the match metric according to
match_iou_thr
. - Based on different strategy, accumulate the match number.calculate H-mean under different prediction score threshold.
- 参数
match_iou_thr (float) – IoU threshold for a match. Defaults to 0.5.
ignore_precision_thr (float) – Precision threshold when prediction and gt ignored polygons are matched. Defaults to 0.5.
pred_score_thrs (dict) – Best prediction score threshold searching space. Defaults to dict(start=0.3, stop=0.9, step=0.1).
strategy (str) – Polygon matching strategy. Options are ‘max_matching’ and ‘vanilla’. ‘max_matching’ refers to the optimum strategy that maximizes the number of matches. Vanilla strategy matches gt and pred polygons if both of them are never matched before. It was used in MMOCR 0.x and and academia. Defaults to ‘vanilla’.
collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None
- 返回类型
- process(data_batch, data_samples)[源代码]¶
Process one batch of data samples and predictions. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数
data_batch (Sequence[Dict]) – A batch of data from dataloader.
data_samples (Sequence[Dict]) – A batch of outputs from the model.
- 返回类型
- class mmocr.evaluation.metrics.OneMinusNEDMetric(valid_symbol='[^A-Z^a-z^0-9^一-龥]', collect_device='cpu', prefix=None)[源代码]¶
One minus NED metric for text recognition task.
- 参数
valid_symbol (str) – Valid characters. Defaults to ‘[^A-Z^a-z^0-9^一-龥]’
collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None
- 返回类型
- compute_metrics(results)[源代码]¶
Compute the metrics from processed results.
- 参数
results (list[Dict]) – The processed results of each batch.
- 返回
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型
Dict
- process(data_batch, data_samples)[源代码]¶
Process one batch of data_samples. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数
data_batch (Sequence[Dict]) – A batch of gts.
data_samples (Sequence[Dict]) – A batch of outputs from the model.
- 返回类型
- class mmocr.evaluation.metrics.WordMetric(mode='ignore_case_symbol', valid_symbol='[^A-Z^a-z^0-9^一-龥]', collect_device='cpu', prefix=None)[源代码]¶
Word metrics for text recognition task.
- 参数
Options are: - ‘exact’: Accuracy at word level. - ‘ignore_case’: Accuracy at word level, ignoring letter
case.
’ignore_case_symbol’: Accuracy at word level, ignoring letter case and symbol. (Default metric for academic evaluation)
If mode is a list, then metrics in mode will be calculated separately. Defaults to ‘ignore_case_symbol’
valid_symbol (str) – Valid characters. Defaults to ‘[^A-Z^a-z^0-9^一-龥]’
collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.
- 返回类型
- compute_metrics(results)[源代码]¶
Compute the metrics from processed results.
- 参数
results (list[Dict]) – The processed results of each batch.
- 返回
The computed metrics. The keys are the names of the metrics, and the values are corresponding results.
- 返回类型
Dict
- process(data_batch, data_samples)[源代码]¶
Process one batch of data_samples. The processed results should be stored in
self.results
, which will be used to compute the metrics when all batches have been processed.- 参数
data_batch (Sequence[Dict]) – A batch of gts.
data_samples (Sequence[Dict]) – A batch of outputs from the model.
- 返回类型
mmocr.utils¶
Point utils¶
Bbox utils¶
- mmocr.utils.bbox_utils.bbox2poly(bbox)[源代码]¶
Converting a bounding box to a polygon.
- 参数
bbox (ArrayLike) –
A bbox. In any form can be accessed by 1-D indices. E.g. list[float], np.ndarray, or torch.Tensor. bbox is written in
[x1, y1, x2, y2].
- 返回
The converted polygon [x1, y1, x2, y1, x2, y2, x1, y2].
- 返回类型
np.array
- mmocr.utils.bbox_utils.bbox_center_distance(box1, box2)[源代码]¶
Calculate the distance between the center points of two bounding boxes.
- 参数
box1 (ArrayLike) – The first bounding box represented in [x1, y1, x2, y2].
box2 (ArrayLike) – The second bounding box represented in [x1, y1, x2, y2].
- 返回
The distance between the center points of two bounding boxes.
- 返回类型
- mmocr.utils.bbox_utils.bbox_diag_distance(box)[源代码]¶
Calculate the diagonal length of a bounding box (distance between the top-left and bottom-right).
- 参数
box (ArrayLike) – The bounding box represented in
[x1 –
y1 –
x2 –
y2 –
x3 –
y3 –
x4 –
or [x1 (y4]) –
y1 –
x2 –
y2] –
- 返回
The diagonal length of the bounding box.
- 返回类型
- mmocr.utils.bbox_utils.bbox_jitter(points_x, points_y, jitter_ratio_x=0.5, jitter_ratio_y=0.1)[源代码]¶
Jitter on the coordinates of bounding box.
- mmocr.utils.bbox_utils.bezier2polygon(bezier_points, num_sample=20)[源代码]¶
Sample points from the boundary of a polygon enclosed by two Bezier curves, which are controlled by
bezier_points
.- 参数
bezier_points (ndarray) – A \((2, 4, 2)\) array of 8 Bezeir points or its equalivance. The first 4 points control the curve at one side and the last four control the other side.
num_sample (int) – The number of sample points at each Bezeir curve. Defaults to 20.
- 返回
A list of 2*num_sample points representing the polygon extracted from Bezier curves.
- 返回类型
list[ndarray]
警告
The points are not guaranteed to be ordered. Please use
mmocr.utils.sort_points()
to sort points if necessary.
- mmocr.utils.bbox_utils.is_on_same_line(box_a, box_b, min_y_overlap_ratio=0.8)[源代码]¶
Check if two boxes are on the same line by their y-axis coordinates.
Two boxes are on the same line if they overlap vertically, and the length of the overlapping line segment is greater than min_y_overlap_ratio * the height of either of the boxes.
- mmocr.utils.bbox_utils.rescale_bbox(bbox, scale_factor, mode='mul')[源代码]¶
Rescale a bounding box according to scale_factor.
The behavior is different depending on the mode. When mode is ‘mul’, the coordinates will be multiplied by scale_factor, which is usually used in preprocessing transforms such as
Resize()
. The coordinates will be divided by scale_factor if mode is ‘div’. It can be used in postprocessors to recover the bbox in the original image size.
- mmocr.utils.bbox_utils.rescale_bboxes(bboxes, scale_factor, mode='mul')[源代码]¶
Rescale bboxes according to scale_factor.
The behavior is different depending on the mode. When mode is ‘mul’, the coordinates will be multiplied by scale_factor, which is usually used in preprocessing transforms such as
Resize()
. The coordinates will be divided by scale_factor if mode is ‘div’. It can be used in postprocessors to recover the bboxes in the original image size.
- mmocr.utils.bbox_utils.sort_points(points)[源代码]¶
Sort arbitory points in clockwise order. Reference: https://stackoverflow.com/a/6989383.
- mmocr.utils.bbox_utils.sort_vertex(points_x, points_y)[源代码]¶
Sort box vertices in clockwise order from left-top first.
- mmocr.utils.bbox_utils.sort_vertex8(points)[源代码]¶
Sort vertex with 8 points [x1 y1 x2 y2 x3 y3 x4 y4]
- mmocr.utils.bbox_utils.stitch_boxes_into_lines(boxes, max_x_dist=10, min_y_overlap_ratio=0.8)[源代码]¶
Stitch fragmented boxes of words into lines.
Note: part of its logic is inspired by @Johndirr (https://github.com/faustomorales/keras-ocr/issues/22)
- 参数
- 返回
List of merged boxes and texts
- 返回类型
Polygon utils¶
- mmocr.utils.polygon_utils.boundary_iou(src, target, zero_division=0)[源代码]¶
Calculate the IOU between two boundaries.
- mmocr.utils.polygon_utils.crop_polygon(polygon, crop_box)[源代码]¶
Crop polygon to be within a box region.
- 参数
polygon (ndarray) – polygon in shape (N, ).
crop_box (ndarray) – target box region in shape (4, ).
- 返回
- Cropped polygon. If the polygon is not within the
crop box, return None.
- 返回类型
np.array or None
- mmocr.utils.polygon_utils.is_poly_inside_rect(poly, rect)[源代码]¶
Check if the polygon is inside the target region. :param poly: Polygon in shape (N, ). :type poly: ArrayLike :param rect: Target region [x1, y1, x2, y2]. :type rect: ndarray
- 返回
Whether the polygon is inside the cropping region.
- 返回类型
- 参数
poly (Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], numpy.typing._array_like._SupportsArray[numpy.dtype], Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]], Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]], Sequence[Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]]], bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]]) –
rect (numpy.ndarray) –
- mmocr.utils.polygon_utils.offset_polygon(poly, distance)[源代码]¶
Offset (expand/shrink) the polygon by the target distance. It’s a wrapper around pyclipper based on Vatti clipping algorithm.
警告
Polygon coordinates will be casted to int type in PyClipper. Mind the potential precision loss caused by the casting.
- 参数
poly (ArrayLike) – A polygon. In any form can be converted to an 1-D numpy array. E.g. list[float], np.ndarray, or torch.Tensor. Polygon is written in [x1, y1, x2, y2, …].
distance (float) – The offset distance. Positive value means expanding, negative value means shrinking.
- 返回
1-D Offsetted polygon ndarray in float32 type. If the result polygon is invalid or has been split into several parts, return an empty array.
- 返回类型
np.array
- mmocr.utils.polygon_utils.poly2bbox(polygon)[源代码]¶
Converting a polygon to a bounding box.
- 参数
polygon – A polygon. In any form can be converted to an 1-D numpy array. E.g. list[float], np.ndarray, or torch.Tensor. Polygon is written in [x1, y1, x2, y2, …].
- 返回类型
numpy.array
- mmocr.utils.polygon_utils.poly2shapely(polygon)[源代码]¶
Convert a polygon to shapely.geometry.Polygon.
- 参数
polygon (ArrayLike) – A set of points of 2k shape.
- 返回
A polygon object.
- 返回类型
polygon (Polygon)
- mmocr.utils.polygon_utils.poly_intersection(poly_a, poly_b, invalid_ret=None, return_poly=False)[源代码]¶
Calculate the intersection area between two polygons.
- 参数
poly_a (Polygon) – Polygon a.
poly_b (Polygon) – Polygon b.
invalid_ret (float or int, optional) – The return value when the invalid polygon exists. If it is not specified, the function allows the computation to proceed with invalid polygons by cleaning the their self-touching or self-crossing parts. Defaults to None.
return_poly (bool) – Whether to return the polygon of the intersection Defaults to False.
- 返回
Returns the intersection area or a tuple
(area, Optional[poly_obj])
, where the area is the intersection area between two polygons and poly_obj is The Polygon object of the intersection area. Set as None if the input is invalid. Set as None if the input is invalid. poly_obj will be returned only if return_poly is True.- 返回类型
- mmocr.utils.polygon_utils.poly_iou(poly_a, poly_b, zero_division=0.0)[源代码]¶
Calculate the IOU between two polygons.
- mmocr.utils.polygon_utils.poly_make_valid(poly)[源代码]¶
Convert a potentially invalid polygon to a valid one by eliminating self-crossing or self-touching parts.
- 参数
poly (Polygon) – A polygon needed to be converted.
- 返回
A valid polygon.
- 返回类型
Polygon
- mmocr.utils.polygon_utils.poly_union(poly_a, poly_b, invalid_ret=None, return_poly=False)[源代码]¶
Calculate the union area between two polygons. :param poly_a: Polygon a. :type poly_a: Polygon :param poly_b: Polygon b. :type poly_b: Polygon :param invalid_ret: The return value when the
invalid polygon exists. If it is not specified, the function allows the computation to proceed with invalid polygons by cleaning the their self-touching or self-crossing parts. Defaults to False.
- 参数
- 返回
Returns a tuple
(area, Optional[poly_obj])
, where the area is the union between two polygons and poly_obj is the Polygon or MultiPolygon object of the union of the inputs. The type of object depends on whether they intersect or not. Set as None if the input is invalid. poly_obj will be returned only if return_poly is True.- 返回类型
- mmocr.utils.polygon_utils.polys2shapely(polygons)[源代码]¶
Convert a nested list of boundaries to a list of Polygons.
- mmocr.utils.polygon_utils.rescale_polygon(polygon, scale_factor, mode='mul')[源代码]¶
Rescale a polygon according to scale_factor.
The behavior is different depending on the mode. When mode is ‘mul’, the coordinates will be multiplied by scale_factor, which is usually used in preprocessing transforms such as
Resize()
. The coordinates will be divided by scale_factor if mode is ‘div’. It can be used in postprocessors to recover the polygon in the original image size.
- mmocr.utils.polygon_utils.rescale_polygons(polygons, scale_factor, mode='mul')[源代码]¶
Rescale polygons according to scale_factor.
The behavior is different depending on the mode. When mode is ‘mul’, the coordinates will be multiplied by scale_factor, which is usually used in preprocessing transforms such as
Resize()
. The coordinates will be divided by scale_factor if mode is ‘div’. It can be used in postprocessors to recover the polygon in the original image size.- 参数
- 返回
Rescaled polygons.
- 返回类型
list[np.ndarray]
- mmocr.utils.polygon_utils.shapely2poly(polygon)[源代码]¶
Convert a nested list of boundaries to a list of Polygons.
- 参数
polygon (Polygon) – A polygon represented by shapely.Polygon.
- 返回
Converted numpy array
- 返回类型
np.array
- mmocr.utils.polygon_utils.sort_points(points)[源代码]¶
Sort arbitory points in clockwise order. Reference: https://stackoverflow.com/a/6989383.
Mask utils¶
- mmocr.utils.mask_utils.fill_hole(input_mask)[源代码]¶
Fill holes in matrix.
- Input:
- [[0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 0], [0, 1, 0, 0, 0, 1, 0], [0, 1, 1, 1, 1, 1, 0], [0, 0, 0, 0, 0, 0, 0]]
- Output:
- [[0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 0], [0, 1, 1, 1, 1, 1, 0], [0, 1, 1, 1, 1, 1, 0], [0, 0, 0, 0, 0, 0, 0]]
- 参数
input_mask (ArrayLike) – The input mask.
- 返回
The output mask that has been filled.
- 返回类型
np.array
String utils¶
- class mmocr.utils.string_utils.StringStripper(strip=True, strip_pos='both', strip_str=None)[源代码]¶
Removing the leading and/or the trailing characters based on the string argument passed.
- 参数
strip (bool) – Whether remove characters from both left and right of the string. Default: True.
strip_pos (str) – Which position for removing, can be one of (‘both’, ‘left’, ‘right’), Default: ‘both’.
strip_str (str|None) – A string specifying the set of characters to be removed from the left and right part of the string. If None, all leading and trailing whitespaces are removed from the string. Default: None.
Image utils¶
- mmocr.utils.img_utils.crop_img(src_img, box, long_edge_pad_ratio=0.4, short_edge_pad_ratio=0.2)[源代码]¶
Crop text region given the bounding box which might be slightly padded. The bounding box is assumed to be a quadrangle and tightly bound the text region.
- 参数
src_img (np.array) – The original image.
box (list[float | int]) – Points of quadrangle.
long_edge_pad_ratio (float) – The ratio of padding to the long edge. The padding will be the length of the short edge * long_edge_pad_ratio. Defaults to 0.4.
short_edge_pad_ratio (float) – The ratio of padding to the short edge. The padding will be the length of the long edge * short_edge_pad_ratio. Defaults to 0.2.
- 返回
The cropped image.
- 返回类型
np.array
File IO utils¶
Others¶
- mmocr.utils.data_converter_utils.dump_ocr_data(image_infos, out_json_name, task_name, **kwargs)[源代码]¶
Dump the annotation in openmmlab style.
- 参数
- 返回类型
Dict
实际案例
Here is the general structure of image_infos for textdet/textspotter tasks:
[ # A list of dicts. Each dict stands for a single image. { "file_name": "1.jpg", "height": 100, "width": 200, "segm_file": "seg.txt" # (optional) path to segmap "anno_info": [ # a list of dicts. Each dict # stands for a single text instance. { "iscrowd": 0, # 0: don't ignore this instance # 1: ignore "category_id": 0, # Instance class id. Must be 0 # for OCR tasks to permanently # be mapped to 'text' category "bbox": [x, y, w, h], "segmentation": [x1, y1, x2, y2, ...], "text": "demo_text" # for textspotter only. } ] }, ]
The input for textrecog task is much simpler:
[ # A list of dicts. Each dict stands for a single image. { "file_name": "1.jpg", "anno_info": [ # a list of dicts. Each dict # stands for a single text instance. # However, in textrecog, usually each # image only has one text instance. { "text": "demo_text" } ] }, ]
- mmocr.utils.data_converter_utils.recog_anno_to_imginfo(file_paths, labels)[源代码]¶
Convert a list of file_paths and labels for recognition tasks into the format of image_infos acceptable by
dump_ocr_data()
. It’s meant to maintain compatibility with the legacy annotation format in MMOCR 0.x.In MMOCR 0.x, data converters for recognition usually converts the annotations into a list of file paths and a list of labels, which look like the following:
file_paths = ['1.jpg', '2.jpg', ...] labels = ['aaa', 'bbb', ...]
This utility merges them into a list of dictionaries parsable by
dump_ocr_data()
:[ # A list of dicts. Each dict stands for a single image. { "file_name": "1.jpg", "anno_info": [ { "text": "aaa" } ] }, { "file_name": "2.jpg", "anno_info": [ { "text": "bbb" } ] }, ... ]
- class mmocr.utils.parsers.LineJsonParser(keys=['filename', 'text'])[源代码]¶
Parse json-string of one line in annotation file to dict format.
mmocr.models¶
Common¶
- class mmocr.models.common.backbones.UNet(in_channels=3, base_channels=64, num_stages=5, strides=(1, 1, 1, 1, 1), enc_num_convs=(2, 2, 2, 2, 2), dec_num_convs=(2, 2, 2, 2), downsamples=(True, True, True, True), enc_dilations=(1, 1, 1, 1, 1), dec_dilations=(1, 1, 1, 1), with_cp=False, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, upsample_cfg={'type': 'InterpConv'}, norm_eval=False, dcn=None, plugins=None, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': ['_BatchNorm', 'GroupNorm'], 'val': 1}])[源代码]¶
UNet backbone. U-Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/pdf/1505.04597.pdf
- 参数
in_channels (int) – Number of input image channels. Default” 3.
base_channels (int) – Number of base channels of each stage. The output channels of the first stage. Default: 64.
num_stages (int) – Number of stages in encoder, normally 5. Default: 5.
strides (Sequence[int 1 | 2]) – Strides of each stage in encoder. len(strides) is equal to num_stages. Normally the stride of the first stage in encoder is 1. If strides[i]=2, it uses stride convolution to downsample in the correspondence encoder stage. Default: (1, 1, 1, 1, 1).
enc_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence encoder stage. Default: (2, 2, 2, 2, 2).
dec_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence decoder stage. Default: (2, 2, 2, 2).
downsamples (Sequence[int]) – Whether use MaxPool to downsample the feature map after the first stage of encoder (stages: [1, num_stages)). If the correspondence encoder stage use stride convolution (strides[i]=2), it will never use MaxPool to downsample, even downsamples[i-1]=True. Default: (True, True, True, True).
enc_dilations (Sequence[int]) – Dilation rate of each stage in encoder. Default: (1, 1, 1, 1, 1).
dec_dilations (Sequence[int]) – Dilation rate of each stage in decoder. Default: (1, 1, 1, 1).
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict | None) – Config dict for convolution layer. Default: None.
norm_cfg (dict | None) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict | None) – Config dict for activation layer in ConvModule. Default: dict(type=’ReLU’).
upsample_cfg (dict) – The upsample config of the upsample module in decoder. Default: dict(type=’InterpConv’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
dcn (bool) – Use deformable convolution in convolutional layer or not. Default: None.
plugins (dict) – plugins for convolutional layers. Default: None.
- Notice:
The input image size should be divisible by the whole downsample rate of the encoder. More detail of the whole downsample rate can be found in UNet._check_input_divisible.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.common.losses.CrossEntropyLoss(weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean', label_smoothing=0.0)[源代码]¶
Cross entropy loss.
- 参数
weight (Optional[torch.Tensor]) –
ignore_index (int) –
reduction (str) –
label_smoothing (float) –
- 返回类型
- class mmocr.models.common.losses.MaskedBCELoss(eps=1e-06)[源代码]¶
Masked BCE loss.
- forward(pred, gt, mask=None)[源代码]¶
Forward function.
- 参数
pred (torch.Tensor) – The prediction in any shape.
gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.
mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.
- 返回
The loss value.
- 返回类型
- class mmocr.models.common.losses.MaskedBCEWithLogitsLoss(eps=1e-06)[源代码]¶
This loss combines a Sigmoid layers and a masked BCE loss in one single class. It’s AMP-eligible.
- forward(pred, gt, mask=None)[源代码]¶
Forward function.
- 参数
pred (torch.Tensor) – The prediction in any shape.
gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.
mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.
- 返回
The loss value.
- 返回类型
- class mmocr.models.common.losses.MaskedBalancedBCELoss(reduction='none', negative_ratio=3, fallback_negative_num=0, eps=1e-06)[源代码]¶
Masked Balanced BCE loss.
- 参数
reduction (str, optional) – The method to reduce the loss. Options are ‘none’, ‘mean’ and ‘sum’. Defaults to ‘none’.
negative_ratio (float or int) – Maximum ratio of negative samples to positive ones. Defaults to 3.
fallback_negative_num (int) – When the mask contains no positive samples, the number of negative samples to be sampled. Defaults to 0.
eps (float) – Eps to avoid zero-division error. Defaults to 1e-6.
- 返回类型
- forward(pred, gt, mask=None)[源代码]¶
Forward function.
- 参数
pred (torch.Tensor) – The prediction in any shape.
gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.
mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.
- 返回
The loss value.
- 返回类型
- class mmocr.models.common.losses.MaskedBalancedBCEWithLogitsLoss(reduction='none', negative_ratio=3, fallback_negative_num=0, eps=1e-06)[源代码]¶
This loss combines a Sigmoid layers and a masked balanced BCE loss in one single class. It’s AMP-eligible.
- 参数
reduction (str, optional) – The method to reduce the loss. Options are ‘none’, ‘mean’ and ‘sum’. Defaults to ‘none’.
negative_ratio (float or int, optional) – Maximum ratio of negative samples to positive ones. Defaults to 3.
fallback_negative_num (int, optional) – When the mask contains no positive samples, the number of negative samples to be sampled. Defaults to 0.
eps (float, optional) – Eps to avoid zero-division error. Defaults to 1e-6.
- 返回类型
- forward(pred, gt, mask=None)[源代码]¶
Forward function.
- 参数
pred (torch.Tensor) – The prediction in any shape.
gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.
mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.
- 返回
The loss value.
- 返回类型
- class mmocr.models.common.losses.MaskedDiceLoss(eps=1e-06)[源代码]¶
Masked dice loss.
- forward(pred, gt, mask=None)[源代码]¶
Forward function.
- 参数
pred (torch.Tensor) – The prediction in any shape.
gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.
mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.
- 返回
The loss value.
- 返回类型
- class mmocr.models.common.losses.MaskedSmoothL1Loss(beta=1, eps=1e-06)[源代码]¶
Masked Smooth L1 loss.
- 参数
- 返回类型
- forward(pred, gt, mask=None)[源代码]¶
Forward function.
- 参数
pred (torch.Tensor) – The prediction in any shape.
gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.
mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.
- 返回
The loss value.
- 返回类型
- class mmocr.models.common.losses.MaskedSquareDiceLoss(eps=0.001)[源代码]¶
Masked square dice loss.
- forward(pred, gt, mask=None)[源代码]¶
Forward function.
- 参数
pred (torch.Tensor) – The prediction in any shape.
gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.
mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.
- 返回
The loss value.
- 返回类型
- class mmocr.models.common.losses.SmoothL1Loss(size_average=None, reduce=None, reduction='mean', beta=1.0)[源代码]¶
Smooth L1 loss.
- class mmocr.models.common.dictionary.Dictionary(dict_file, with_start=False, with_end=False, same_start_end=False, with_padding=False, with_unknown=False, start_token='<BOS>', end_token='<EOS>', start_end_token='<BOS/EOS>', padding_token='<PAD>', unknown_token='<UKN>')[源代码]¶
The class generates a dictionary for recognition. It pre-defines four special tokens:
start_token
,end_token
,pad_token
, andunknown_token
, which will be sequentially placed at the end of the dictionary when their corresponding flags are True.- 参数
dict_file (str) – The path of Character dict file which a single character must occupies a line.
with_start (bool) – The flag to control whether to include the start token. Defaults to False.
with_end (bool) – The flag to control whether to include the end token. Defaults to False.
same_start_end (bool) – The flag to control whether the start token and end token are the same. It only works when both
with_start
andwith_end
are True. Defaults to False.with_padding (bool) – The padding token may represent more than a padding. It can also represent tokens like the blank token in CTC or the background token in SegOCR. Defaults to False.
with_unknown (bool) – The flag to control whether to include the unknown token. Defaults to False.
start_token (str) – The start token as a string. Defaults to ‘<BOS>’.
end_token (str) – The end token as a string. Defaults to ‘<EOS>’.
start_end_token (str) – The start/end token as a string. if start and end is the same. Defaults to ‘<BOS/EOS>’.
padding_token (str) – The padding token as a string. Defaults to ‘<PAD>’.
unknown_token (str, optional) – The unknown token as a string. If it’s set to None and
with_unknown
is True, the unknown token will be skipped when converting string to index. Defaults to ‘<UKN>’.
- 返回类型
- property dict: list¶
Returns a list of characters to recognize, where special tokens are counted.
- Type
- class mmocr.models.common.layers.TFDecoderLayer(d_model=512, d_inner=256, n_head=8, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, act_cfg={'type': 'mmengine.GELU'}, operation_order=None)[源代码]¶
Transformer Decoder Layer.
- 参数
d_model (int) – The number of expected features in the decoder inputs (default=512).
d_inner (int) – The dimension of the feedforward network model (default=256).
n_head (int) – The number of heads in the multiheadattention models (default=8).
d_k (int) – Total number of features in key.
d_v (int) – Total number of features in value.
dropout (float) – Dropout layer on attn_output_weights.
qkv_bias (bool) – Add bias in projection layer. Default: False.
act_cfg (dict) – Activation cfg for feedforward module.
operation_order (tuple[str]) – The execution order of operation in transformer. Such as (‘self_attn’, ‘norm’, ‘enc_dec_attn’, ‘norm’, ‘ffn’, ‘norm’) or (‘norm’, ‘self_attn’, ‘norm’, ‘enc_dec_attn’, ‘norm’, ‘ffn’). Default:None.
- forward(dec_input, enc_output, self_attn_mask=None, dec_enc_attn_mask=None)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.common.layers.TFEncoderLayer(d_model=512, d_inner=256, n_head=8, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, act_cfg={'type': 'mmengine.GELU'}, operation_order=None)[源代码]¶
Transformer Encoder Layer.
- 参数
d_model (int) – The number of expected features in the decoder inputs (default=512).
d_inner (int) – The dimension of the feedforward network model (default=256).
n_head (int) – The number of heads in the multiheadattention models (default=8).
d_k (int) – Total number of features in key.
d_v (int) – Total number of features in value.
dropout (float) – Dropout layer on attn_output_weights.
qkv_bias (bool) – Add bias in projection layer. Default: False.
act_cfg (dict) – Activation cfg for feedforward module.
operation_order (tuple[str]) – The execution order of operation in transformer. Such as (‘self_attn’, ‘norm’, ‘ffn’, ‘norm’) or (‘norm’, ‘self_attn’, ‘norm’, ‘ffn’). Default:None.
- forward(x, mask=None)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.common.modules.MultiHeadAttention(n_head=8, d_model=512, d_k=64, d_v=64, dropout=0.1, qkv_bias=False)[源代码]¶
Multi-Head Attention module.
- 参数
n_head (int) – The number of heads in the multiheadattention models (default=8).
d_model (int) – The number of expected features in the decoder inputs (default=512).
d_k (int) – Total number of features in key.
d_v (int) – Total number of features in value.
dropout (float) – Dropout layer on attn_output_weights.
qkv_bias (bool) – Add bias in projection layer. Default: False.
- forward(q, k, v, mask=None)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.common.modules.PositionalEncoding(d_hid=512, n_position=200, dropout=0)[源代码]¶
Fixed positional encoding with sine and cosine functions.
- class mmocr.models.common.modules.PositionwiseFeedForward(d_in, d_hid, dropout=0.1, act_cfg={'type': 'Relu'})[源代码]¶
Two-layer feed-forward module.
- 参数
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.common.modules.ScaledDotProductAttention(temperature, attn_dropout=0.1)[源代码]¶
Scaled Dot-Product Attention Module. This code is adopted from https://github.com/jadore801120/attention-is-all-you-need-pytorch.
- 参数
- forward(q, k, v, mask=None)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
Text Detection Detectors¶
- class mmocr.models.textdet.detectors.DBNet(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]¶
The class for implementing DBNet text detector: Real-time Scene Text Detection with Differentiable Binarization.
[https://arxiv.org/abs/1911.08947].
- 参数
backbone (Dict) –
det_head (Dict) –
neck (Optional[Dict]) –
data_preprocessor (Optional[Dict]) –
init_cfg (Optional[Dict]) –
- 返回类型
- class mmocr.models.textdet.detectors.DRRG(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]¶
The class for implementing DRRG text detector. Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.
[https://arxiv.org/abs/2003.07493]
- 参数
backbone (Dict) –
det_head (Dict) –
neck (Optional[Dict]) –
data_preprocessor (Optional[Dict]) –
init_cfg (Optional[Dict]) –
- 返回类型
- class mmocr.models.textdet.detectors.FCENet(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]¶
The class for implementing FCENet text detector FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text
Detection
[https://arxiv.org/abs/2104.10442]
- 参数
backbone (Dict) –
det_head (Dict) –
neck (Optional[Dict]) –
data_preprocessor (Optional[Dict]) –
init_cfg (Optional[Dict]) –
- 返回类型
- class mmocr.models.textdet.detectors.MMDetWrapper(cfg, text_repr_type='poly')[源代码]¶
A wrapper of MMDet’s model.
- 参数
- 返回类型
- adapt_predictions(data, data_samples)[源代码]¶
Convert Instance datas from MMDet into MMOCR’s format.
- 参数
data (List[mmdet.structures.det_data_sample.DetDataSample]) –
(list[DetDataSample]): Detection results of the input images. Each DetDataSample usually contain ‘pred_instances’. And the
pred_instances
usually contains following keys. - scores (Tensor): Classification scores, has a shape(num_instance, )
- labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
- bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
masks (Tensor, Optional): Has a shape (num_instances, H, W).
data_samples (list[
TextDetDataSample
]) – The annotation data of every samples.
- 返回
- A list of N datasamples containing ground
truth and prediction results. The polygon results are saved in
TextDetDataSample.pred_instances.polygons
The confidence scores are saved inTextDetDataSample.pred_instances.scores
.
- 返回类型
- forward(inputs, data_samples=None, mode='tensor', **kwargs)[源代码]¶
The unified entry for a forward process in both training and test.
The method works in three modes: “tensor”, “predict” and “loss”:
“tensor”: Forward the whole network and return tensor or tuple of
tensor without any post-processing, same as a common nn.Module. - “predict”: Forward and return the predictions, which are fully processed to a list of
DetDataSample
. - “loss”: Forward and return a dict of losses according to the given inputs and data samples.Note that this method doesn’t handle either back propagation or parameter update, which are supposed to be done in
train_step()
.- 参数
inputs (torch.Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (Optional[Union[List[mmocr.structures.textdet_data_sample.TextDetDataSample], List[mmdet.structures.det_data_sample.DetDataSample]]]) –
mode (str) –
- 返回类型
Union[Dict[str, torch.Tensor], List[mmdet.structures.det_data_sample.DetDataSample], Tuple[torch.Tensor], torch.Tensor]
- :param data_samples (list[
DetDataSample
] or: list[TextDetDataSample
]): The annotation data of every sample. When in “predict” mode, it should be a list of
TextDetDataSample
. Otherwise they are :obj:`DetDataSample`s. Defaults to None.
- 参数
mode (str) – Running mode. Defaults to ‘tensor’.
inputs (torch.Tensor) –
data_samples (Optional[Union[List[mmocr.structures.textdet_data_sample.TextDetDataSample], List[mmdet.structures.det_data_sample.DetDataSample]]]) –
- 返回
The return type depends on
mode
.If
mode="tensor"
, return a tensor or a tuple of tensor.If
mode="predict"
, return a list ofTextDetDataSample
.If
mode="loss"
, return a dict of tensor.
- 返回类型
Union[Dict[str, torch.Tensor], List[mmdet.structures.det_data_sample.DetDataSample], Tuple[torch.Tensor], torch.Tensor]
- class mmocr.models.textdet.detectors.PANet(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]¶
The class for implementing PANet text detector:
Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network [https://arxiv.org/abs/1908.05900].
- 参数
backbone (Dict) –
det_head (Dict) –
neck (Optional[Dict]) –
data_preprocessor (Optional[Dict]) –
init_cfg (Optional[Dict]) –
- 返回类型
- class mmocr.models.textdet.detectors.PSENet(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]¶
The class for implementing PSENet text detector: Shape Robust Text Detection with Progressive Scale Expansion Network.
[https://arxiv.org/abs/1806.02559].
- 参数
backbone (Dict) –
det_head (Dict) –
neck (Optional[Dict]) –
data_preprocessor (Optional[Dict]) –
init_cfg (Optional[Dict]) –
- 返回类型
- class mmocr.models.textdet.detectors.SingleStageTextDetector(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]¶
The class for implementing single stage text detector.
Single-stage text detectors directly and densely predict bounding boxes or polygons on the output features of the backbone + neck (optional).
- 参数
backbone (dict) – Backbone config.
neck (dict, optional) – Neck config. If None, the output from backbone will be directly fed into
det_head
.det_head (dict) – Head config.
data_preprocessor (dict, optional) – Model preprocessing config for processing the input image data. Keys allowed are ``to_rgb``(bool), ``pad_size_divisor``(int), ``pad_value``(int or float), ``mean``(int or float) and ``std``(int or float). Preprcessing order: 1. to rgb; 2. normalization 3. pad. Defaults to None.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- extract_feat(inputs)[源代码]¶
Extract features.
- 参数
inputs (Tensor) – Image tensor with shape (N, C, H ,W).
- 返回
Multi-level features that may have different resolutions.
- 返回类型
Tensor or tuple[Tensor]
- loss(inputs, data_samples)[源代码]¶
Calculate losses from a batch of inputs and data samples.
- 参数
inputs (torch.Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
data_samples (list[TextDetDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.
- 返回
A dictionary of loss components.
- 返回类型
- predict(inputs, data_samples)[源代码]¶
Predict results from a batch of inputs and data samples with post- processing.
- 参数
inputs (torch.Tensor) – Images of shape (N, C, H, W).
data_samples (list[TextDetDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.
- 返回
A list of N datasamples of prediction results. Each DetDataSample usually contain ‘pred_instances’. And the
pred_instances
usually contains following keys.- scores (Tensor): Classification scores, has a shape
(num_instance, )
- labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
- bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
- polygons (list[np.ndarray]): The length is num_instances.
Each element represents the polygon of the instance, in (xn, yn) order.
- 返回类型
- class mmocr.models.textdet.detectors.TextSnake(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]¶
The class for implementing TextSnake text detector: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.
[https://arxiv.org/abs/1807.01544]
- 参数
backbone (Dict) –
det_head (Dict) –
neck (Optional[Dict]) –
data_preprocessor (Optional[Dict]) –
init_cfg (Optional[Dict]) –
- 返回类型
Text Detection Heads¶
- class mmocr.models.textdet.heads.BaseTextDetHead(module_loss, postprocessor, init_cfg=None)[源代码]¶
Base head for text detection, build the loss and postprocessor.
1. The
init_weights
method is used to initialize head’s model parameters. After detector initialization,init_weights
is triggered whendetector.init_weights()
is called externally.2. The
loss
method is used to calculate the loss of head, which includes two steps: (1) the head model performs forward propagation to obtain the feature maps (2) Themodule_loss
method is called based on the feature maps to calculate the loss.loss(): forward() -> module_loss()
3. The
predict
method is used to predict detection results, which includes two steps: (1) the head model performs forward propagation to obtain the feature maps (2) Thepostprocessor
method is called based on the feature maps to predict detection results including post-processing.predict(): forward() -> postprocessor()
4. The
loss_and_predict
method is used to return loss and detection results at the same time. It will call head’sforward
,module_loss
andpostprocessor
methods in order.loss_and_predict(): forward() -> module_loss() -> postprocessor()
- 参数
- 返回类型
- loss(x, data_samples)[源代码]¶
Perform forward propagation and loss calculation of the detection head on the features of the upstream network.
- loss_and_predict(x, data_samples)[源代码]¶
Perform forward propagation of the head, then calculate loss and predictions from the features and data samples.
- 参数
x (tuple[Tensor]) – Features from FPN.
data_samples (list[
DetDataSample
]) – Each item contains the meta information of each image and corresponding annotations.
- 返回
the return value is a tuple contains:
losses: (dict[str, Tensor]): A dictionary of loss components.
predictions (list[
InstanceData
]): Detection results of each image after the post process.
- 返回类型
- predict(x, data_samples)[源代码]¶
Perform forward propagation of the detection head and predict detection results on the features of the upstream network.
- 参数
x (tuple[Tensor]) – Multi-level features from the upstream network, each is a 4D-tensor.
data_samples (List[
DetDataSample
]) – The Data Samples. It usually includes information such as gt_instance, gt_panoptic_seg and gt_sem_seg.
- 返回
Detection results of each image after the post process.
- 返回类型
SampleList
- class mmocr.models.textdet.heads.DBHead(in_channels, with_bias=False, module_loss={'type': 'DBModuleLoss'}, postprocessor={'text_repr_type': 'quad', 'type': 'DBPostprocessor'}, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv'}, {'type': 'Constant', 'layer': 'BatchNorm', 'val': 1.0, 'bias': 0.0001}])[源代码]¶
The class for DBNet head.
This was partially adapted from https://github.com/MhLiao/DB
- 参数
in_channels (int) – The number of input channels.
with_bias (bool) – Whether add bias in Conv2d layer. Defaults to False.
module_loss (dict) – Config of loss for dbnet. Defaults to
dict(type='DBModuleLoss')
postprocessor (dict) – Config of postprocessor for dbnet.
init_cfg (dict or list[dict], optional) – Initialization configs.
- 返回类型
- forward(img, data_samples=None, mode='predict')[源代码]¶
- 参数
img (Tensor) – Shape \((N, C, H, W)\).
data_samples (list[TextDetDataSample], optional) – A list of data samples. Defaults to None.
mode (str) –
Forward mode. It affects the return values. Options are “loss”, “predict” and “both”. Defaults to “predict”.
loss
: Run the full network and return the prob logits, threshold map and binary map.predict
: Run the binarzation part and return the prob map only.both
: Run the full network and return prob logits, threshold map, binary map and prob map.
- 返回
Its type depends on
mode
, read its docstring for details. Each has the shape of \((N, 4H, 4W)\).- 返回类型
Tensor or tuple(Tensor)
- loss(x, batch_data_samples)[源代码]¶
Perform forward propagation and loss calculation of the detection head on the features of the upstream network.
- loss_and_predict(x, batch_data_samples)[源代码]¶
Perform forward propagation of the head, then calculate loss and predictions from the features and data samples.
- 参数
x (tuple[Tensor]) – Features from FPN.
batch_data_samples (list[
DetDataSample
]) – Each item contains the meta information of each image and corresponding annotations.
- 返回
the return value is a tuple contains:
losses: (dict[str, Tensor]): A dictionary of loss components.
predictions (list[
InstanceData
]): Detection results of each image after the post process.
- 返回类型
- predict(x, batch_data_samples)[源代码]¶
Perform forward propagation of the detection head and predict detection results on the features of the upstream network.
- 参数
x (tuple[Tensor]) – Multi-level features from the upstream network, each is a 4D-tensor.
batch_data_samples (List[
DetDataSample
]) – The Data Samples. It usually includes information such as gt_instance, gt_panoptic_seg and gt_sem_seg.
- 返回
Detection results of each image after the post process.
- 返回类型
SampleList
- class mmocr.models.textdet.heads.DRRGHead(in_channels, k_at_hops=(8, 4), num_adjacent_linkages=3, node_geo_feat_len=120, pooling_scale=1.0, pooling_output_size=(4, 3), nms_thr=0.3, min_width=8.0, max_width=24.0, comp_shrink_ratio=1.03, comp_ratio=0.4, comp_score_thr=0.3, text_region_thr=0.2, center_region_thr=0.2, center_region_area_thr=50, local_graph_thr=0.7, module_loss={'type': 'DRRGModuleLoss'}, postprocessor={'link_thr': 0.85, 'type': 'DRRGPostprocessor'}, init_cfg={'mean': 0, 'override': {'name': 'out_conv'}, 'std': 0.01, 'type': 'Normal'})[源代码]¶
The class for DRRG head: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.
- 参数
in_channels (int) – The number of input channels.
k_at_hops (tuple(int)) – The number of i-hop neighbors, i = 1, 2. Defaults to (8, 4).
num_adjacent_linkages (int) – The number of linkages when constructing adjacent matrix. Defaults to 3.
node_geo_feat_len (int) – The length of embedded geometric feature vector of a component. Defaults to 120.
pooling_scale (float) – The spatial scale of rotated RoI-Align. Defaults to 1.0.
pooling_output_size (tuple(int)) – The output size of RRoI-Aligning. Defaults to (4, 3).
nms_thr (float) – The locality-aware NMS threshold of text components. Defaults to 0.3.
min_width (float) – The minimum width of text components. Defaults to 8.0.
max_width (float) – The maximum width of text components. Defaults to 24.0.
comp_shrink_ratio (float) – The shrink ratio of text components. Defaults to 1.03.
comp_ratio (float) – The reciprocal of aspect ratio of text components. Defaults to 0.4.
comp_score_thr (float) – The score threshold of text components. Defaults to 0.3.
text_region_thr (float) – The threshold for text region probability map. Defaults to 0.2.
center_region_thr (float) – The threshold for text center region probability map. Defaults to 0.2.
center_region_area_thr (int) – The threshold for filtering small-sized text center region. Defaults to 50.
local_graph_thr (float) – The threshold to filter identical local graphs. Defaults to 0.7.
module_loss (dict) – The config of loss that DRRGHead uses. Defaults to
dict(type='DRRGModuleLoss')
.postprocessor (dict) – Config of postprocessor for Drrg. Defaults to
dict(type='DrrgPostProcessor', link_thr=0.85)
.init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to
dict(type='Normal', override=dict(name='out_conv'), mean=0, std=0.01)
.
- 返回类型
- forward(inputs, data_samples=None)[源代码]¶
Run DRRG head in prediction mode, and return the raw tensors only. :param inputs: Shape of \((1, C, H, W)\). :type inputs: Tensor :param data_samples: A list of data
samples. Defaults to None.
- 返回
Returns (edge, score, text_comps).
edge (ndarray): The edge array of shape \((N_{edges}, 2)\) where each row is a pair of text component indices that makes up an edge in graph.
score (ndarray): The score array of shape \((N_{edges},)\), corresponding to the edge above.
text_comps (ndarray): The text components of shape \((M, 9)\) where each row corresponds to one box and its score: (x1, y1, x2, y2, x3, y3, x4, y4, score).
- 返回类型
- 参数
inputs (torch.Tensor) –
data_samples (list[TextDetDataSample], optional) –
- loss(inputs, data_samples)[源代码]¶
Loss function.
- 参数
inputs (Tensor) – Shape of \((N, C, H, W)\).
data_samples (List[TextDetDataSample]) – List of data samples.
- 返回
- pred_maps (Tensor): Prediction map with shape
\((N, 6, H, W)\).
- gcn_pred (Tensor): Prediction from GCN module, with
shape \((N, 2)\).
- gt_labels (Tensor): Ground-truth label of shape
\((m, n)\) where \(m * n = N\).
- 返回类型
tuple(pred_maps, gcn_pred, gt_labels)
- class mmocr.models.textdet.heads.FCEHead(in_channels, fourier_degree=5, module_loss={'num_sample': 50, 'type': 'FCEModuleLoss'}, postprocessor={'alpha': 1.0, 'beta': 2.0, 'num_reconstr_points': 50, 'score_thr': 0.3, 'text_repr_type': 'poly', 'type': 'FCEPostprocessor'}, init_cfg={'mean': 0, 'override': [{'name': 'out_conv_cls'}, {'name': 'out_conv_reg'}], 'std': 0.01, 'type': 'Normal'})[源代码]¶
The class for implementing FCENet head.
FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection
- 参数
in_channels (int) – The number of input channels.
fourier_degree (int) – The maximum Fourier transform degree k. Defaults to 5.
module_loss (dict) – Config of loss for FCENet. Defaults to
dict(type='FCEModuleLoss', num_sample=50)
.postprocessor (dict) – Config of postprocessor for FCENet.
init_cfg (dict, optional) – Initialization configs.
- 返回类型
- forward(inputs, data_samples=None)[源代码]¶
- 参数
inputs (List[Tensor]) – Each tensor has the shape of \((N, C_i, H_i, W_i)\).
data_samples (list[TextDetDataSample], optional) – A list of data samples. Defaults to None.
- 返回
A list of dict with keys of
cls_res
,reg_res
corresponds to the classification result and regression result computed from the input tensor with the same index. They have the shapes of \((N, C_{cls,i}, H_i, W_i)\) and \((N, C_{out,i}, H_i, W_i)\).- 返回类型
- class mmocr.models.textdet.heads.PANHead(in_channels, hidden_dim, out_channel, module_loss={'type': 'PANModuleLoss'}, postprocessor={'text_repr_type': 'poly', 'type': 'PANPostprocessor'}, init_cfg=[{'type': 'Normal', 'mean': 0, 'std': 0.01, 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'bias': 0, 'layer': 'BN'}])[源代码]¶
The class for PANet head.
- 参数
in_channels (list[int]) – A list of 4 numbers of input channels.
hidden_dim (int) – The hidden dimension of the first convolutional layer.
out_channel (int) – Number of output channels.
module_loss (dict) – Configuration dictionary for loss type. Defaults to dict(type=’PANModuleLoss’)
postprocessor (dict) – Config of postprocessor for PANet. Defaults to dict(type=’PANPostprocessor’, text_repr_type=’poly’).
Initialization configs. Defaults to [dict(type=’Normal’, mean=0, std=0.01, layer=’Conv2d’),
dict(type=’Constant’, val=1, bias=0, layer=’BN’)]
- 返回类型
- forward(inputs, data_samples=None)[源代码]¶
PAN head forward. :param inputs: Each tensor has the shape of
\((N, C_i, W, H)\), where \(\sum_iC_i=C_{in}\) and \(C_{in}\) is
input_channels
.- 参数
data_samples (list[TextDetDataSample], optional) – A list of data samples. Defaults to None.
inputs (list[Tensor] | Tensor) –
- 返回
A tensor of shape \((N, C_{out}, W, H)\) where \(C_{out}\) is
output_channels
.- 返回类型
Tensor
- class mmocr.models.textdet.heads.PSEHead(in_channels, hidden_dim, out_channel, module_loss={'type': 'PSEModuleLoss'}, postprocessor={'text_repr_type': 'poly', 'type': 'PSEPostprocessor'}, init_cfg=None)[源代码]¶
The class for PSENet head.
- 参数
in_channels (list[int]) – A list of numbers of input channels.
hidden_dim (int) – The hidden dimension of the first convolutional layer.
out_channel (int) – Number of output channels.
module_loss (dict) – Configuration dictionary for loss type. Supported loss types are “PANModuleLoss” and “PSEModuleLoss”. Defaults to PSEModuleLoss.
postprocessor (dict) – Config of postprocessor for PSENet.
init_cfg (dict or list[dict], optional) – Initialization configs.
- 返回类型
- class mmocr.models.textdet.heads.TextSnakeHead(in_channels, out_channels=5, downsample_ratio=1.0, module_loss={'type': 'TextSnakeModuleLoss'}, postprocessor={'text_repr_type': 'poly', 'type': 'TextSnakePostprocessor'}, init_cfg={'mean': 0, 'override': {'name': 'out_conv'}, 'std': 0.01, 'type': 'Normal'})[源代码]¶
The class for TextSnake head: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.
TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.
- 参数
in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.
downsample_ratio (float) – Downsample ratio.
module_loss (dict) – Configuration dictionary for loss type. Defaults to
dict(type='TextSnakeModuleLoss')
.postprocessor (dict) – Config of postprocessor for TextSnake.
init_cfg (dict or list[dict], optional) – Initialization configs.
- 返回类型
- forward(inputs, data_samples=None)[源代码]¶
- 参数
inputs (torch.Tensor) – Shape \((N, C_{in}, H, W)\), where \(C_{in}\) is
in_channels
. \(H\) and \(W\) should be the same as the input of backbone.data_samples (list[TextDetDataSample], optional) – A list of data samples. Defaults to None.
- 返回
A tensor of shape \((N, 5, H, W)\), where the five channels represent [0]: text score, [1]: center score, [2]: sin, [3] cos, [4] radius, respectively.
- 返回类型
Tensor
Text Detection Necks¶
- class mmocr.models.textdet.necks.FPEM_FFM(in_channels, conv_out=128, fpem_repeat=2, align_corners=False, init_cfg={'distribution': 'uniform', 'layer': 'Conv2d', 'type': 'Xavier'})[源代码]¶
This code is from https://github.com/WenmuZhou/PAN.pytorch.
- 参数
in_channels (list[int]) – A list of 4 numbers of input channels.
conv_out (int) – Number of output channels.
fpem_repeat (int) – Number of FPEM layers before FFM operations.
align_corners (bool) – The interpolation behaviour in FFM operation, used in
torch.nn.functional.interpolate()
.init_cfg (dict or list[dict], optional) – Initialization configs.
- class mmocr.models.textdet.necks.FPNC(in_channels, lateral_channels=256, out_channels=64, bias_on_lateral=False, bn_re_on_lateral=False, bias_on_smooth=False, bn_re_on_smooth=False, asf_cfg=None, conv_after_concat=False, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv'}, {'type': 'Constant', 'layer': 'BatchNorm', 'val': 1.0, 'bias': 0.0001}])[源代码]¶
FPN-like fusion module in Real-time Scene Text Detection with Differentiable Binarization.
This was partially adapted from https://github.com/MhLiao/DB and https://github.com/WenmuZhou/DBNet.pytorch.
- 参数
in_channels (list[int]) – A list of numbers of input channels.
lateral_channels (int) – Number of channels for lateral layers.
out_channels (int) – Number of output channels.
bias_on_lateral (bool) – Whether to use bias on lateral convolutional layers.
bn_re_on_lateral (bool) – Whether to use BatchNorm and ReLU on lateral convolutional layers.
bias_on_smooth (bool) – Whether to use bias on smoothing layer.
bn_re_on_smooth (bool) – Whether to use BatchNorm and ReLU on smoothing layer.
asf_cfg (dict, optional) – Adaptive Scale Fusion module configs. The attention_type can be ‘ScaleChannelSpatial’.
conv_after_concat (bool) – Whether to add a convolution layer after the concatenation of predictions.
init_cfg (dict or list[dict], optional) – Initialization configs.
- 返回类型
- class mmocr.models.textdet.necks.FPNF(in_channels=[256, 512, 1024, 2048], out_channels=256, fusion_type='concat', init_cfg={'distribution': 'uniform', 'layer': 'Conv2d', 'type': 'Xavier'})[源代码]¶
FPN-like fusion module in Shape Robust Text Detection with Progressive Scale Expansion Network.
- 参数
in_channels (list[int]) – A list of number of input channels. Defaults to [256, 512, 1024, 2048].
out_channels (int) – The number of output channels. Defaults to 256.
fusion_type (str) – Type of the final feature fusion layer. Available options are “concat” and “add”. Defaults to “concat”.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to dict(type=’Xavier’, layer=’Conv2d’, distribution=’uniform’)
- 返回类型
- class mmocr.models.textdet.necks.FPN_UNet(in_channels, out_channels, init_cfg={'distribution': 'uniform', 'layer': ['Conv2d', 'ConvTranspose2d'], 'type': 'Xavier'})[源代码]¶
The class for implementing DRRG and TextSnake U-Net-like FPN.
DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.
TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.
- 参数
- 返回类型
Text Detection Module Losses¶
- class mmocr.models.textdet.module_losses.DBModuleLoss(loss_prob={'type': 'MaskedBalancedBCEWithLogitsLoss'}, loss_thr={'beta': 0, 'type': 'MaskedSmoothL1Loss'}, loss_db={'type': 'MaskedDiceLoss'}, weight_prob=5.0, weight_thr=10.0, shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_sidelength=8)[源代码]¶
The class for implementing DBNet loss.
This is partially adapted from https://github.com/MhLiao/DB.
- 参数
loss_prob (dict) – The loss config for probability map. Defaults to dict(type=’MaskedBalancedBCEWithLogitsLoss’).
loss_thr (dict) – The loss config for threshold map. Defaults to dict(type=’MaskedSmoothL1Loss’, beta=0).
loss_db (dict) – The loss config for binary map. Defaults to dict(type=’MaskedDiceLoss’).
weight_prob (float) – The weight of probability map loss. Denoted as \(\alpha\) in paper. Defaults to 5.
weight_thr (float) – The weight of threshold map loss. Denoted as \(\beta\) in paper. Defaults to 10.
shrink_ratio (float) – The ratio of shrunk text region. Defaults to 0.4.
thr_min (float) – The minimum threshold map value. Defaults to 0.3.
thr_max (float) – The maximum threshold map value. Defaults to 0.7.
min_sidelength (int or float) – The minimum sidelength of the minimum rotated rectangle around any text region. Defaults to 8.
- 返回类型
- forward(preds, data_samples)[源代码]¶
Compute DBNet loss.
- 参数
preds (tuple(tensor)) – Raw predictions from model, containing
prob_logits
,thr_map
andbinary_map
. Each is a tensor of shape \((N, H, W)\).data_samples (list[TextDetDataSample]) – The data samples.
- 返回
The dict for dbnet losses with loss_prob, loss_db and loss_thr.
- 返回类型
results(dict)
- get_targets(data_samples)[源代码]¶
Generate loss targets from data samples.
- 参数
data_samples (list(TextDetDataSample)) – Ground truth data samples.
- 返回
A tuple of four tensors as DBNet targets.
- 返回类型
- class mmocr.models.textdet.module_losses.DRRGModuleLoss(ohem_ratio=3.0, downsample_ratio=1.0, orientation_thr=2.0, resample_step=8.0, num_min_comps=9, num_max_comps=600, min_width=8.0, max_width=24.0, center_region_shrink_ratio=0.3, comp_shrink_ratio=1.0, comp_w_h_ratio=0.3, text_comp_nms_thr=0.25, min_rand_half_height=8.0, max_rand_half_height=24.0, jitter_level=0.2, loss_text={'eps': 1e-05, 'fallback_negative_num': 100, 'type': 'MaskedBalancedBCEWithLogitsLoss'}, loss_center={'type': 'MaskedBCEWithLogitsLoss'}, loss_top={'reduction': 'none', 'type': 'SmoothL1Loss'}, loss_btm={'reduction': 'none', 'type': 'SmoothL1Loss'}, loss_sin={'type': 'MaskedSmoothL1Loss'}, loss_cos={'type': 'MaskedSmoothL1Loss'}, loss_gcn={'type': 'CrossEntropyLoss'})[源代码]¶
The class for implementing DRRG loss. This is partially adapted from https://github.com/GXYM/DRRG licensed under the MIT license.
DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.
- 参数
ohem_ratio (float) – The negative/positive ratio in ohem. Defaults to 3.0.
downsample_ratio (float) – Downsample ratio. Defaults to 1.0. TODO: remove it.
orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle. Defaults to 2.0.
resample_step (float) – The step size for resampling the text center line. Defaults to 8.0.
num_min_comps (int) – The minimum number of text components, which should be larger than k_hop1 mentioned in paper. Defaults to 9.
num_max_comps (int) – The maximum number of text components. Defaults to 600.
min_width (float) – The minimum width of text components. Defaults to 8.0.
max_width (float) – The maximum width of text components. Defaults to 24.0.
center_region_shrink_ratio (float) – The shrink ratio of text center regions. Defaults to 0.3.
comp_shrink_ratio (float) – The shrink ratio of text components. Defaults to 1.0.
comp_w_h_ratio (float) – The width to height ratio of text components. Defaults to 0.3.
min_rand_half_height (float) – The minimum half-height of random text components. Defaults to 8.0.
max_rand_half_height (float) – The maximum half-height of random text components. Defaults to 24.0.
jitter_level (float) – The jitter level of text component geometric features. Defaults to 0.2.
loss_text (dict) – The loss config used to calculate the text loss. Defaults to
dict(type='MaskedBalancedBCEWithLogitsLoss', fallback_negative_num=100, eps=1e-5)
.loss_center (dict) – The loss config used to calculate the center loss. Defaults to
dict(type='MaskedBCEWithLogitsLoss')
.loss_top (dict) – The loss config used to calculate the top loss, which is a part of the height loss. Defaults to
dict(type='SmoothL1Loss', reduction='none')
.loss_btm (dict) – The loss config used to calculate the bottom loss, which is a part of the height loss. Defaults to
dict(type='SmoothL1Loss', reduction='none')
.loss_sin (dict) – The loss config used to calculate the sin loss. Defaults to
dict(type='MaskedSmoothL1Loss')
.loss_cos (dict) – The loss config used to calculate the cos loss. Defaults to
dict(type='MaskedSmoothL1Loss')
.loss_gcn (dict) – The loss config used to calculate the GCN loss. Defaults to
dict(type='CrossEntropyLoss')
.text_comp_nms_thr (float) –
- 返回类型
- forward(preds, data_samples)[源代码]¶
Compute Drrg loss.
- 参数
preds (tuple) – The prediction tuple(pred_maps, gcn_pred, gt_labels), each of shape \((N, 6, H, W)\), \((N, 2)\) and \((m ,n)\), where \(m * n = N\).
data_samples (list[TextDetDataSample]) – The data samples.
- 返回
A loss dict with
loss_text
,loss_center
,loss_height
,loss_sin
,loss_cos
, andloss_gcn
.- 返回类型
- get_targets(data_samples)[源代码]¶
Generate loss targets from data samples.
- 参数
data_samples (list(TextDetDataSample)) – Ground truth data samples.
- 返回
A tuple of 8 lists of tensors as DRRG targets. Read docstring of
_get_target_single
for more details.- 返回类型
- class mmocr.models.textdet.module_losses.FCEModuleLoss(fourier_degree, num_sample, negative_ratio=3.0, resample_step=4.0, center_region_shrink_ratio=0.3, level_size_divisors=(8, 16, 32), level_proportion_range=((0, 0.4), (0.3, 0.7), (0.6, 1.0)), loss_tr={'type': 'MaskedBalancedBCELoss'}, loss_tcl={'type': 'MaskedBCELoss'}, loss_reg_x={'reduction': 'none', 'type': 'SmoothL1Loss'}, loss_reg_y={'reduction': 'none', 'type': 'SmoothL1Loss'})[源代码]¶
The class for implementing FCENet loss.
FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection
- 参数
fourier_degree (int) – The maximum Fourier transform degree k.
num_sample (int) – The sampling points number of regression loss. If it is too small, fcenet tends to be overfitting.
negative_ratio (float or int) – Maximum ratio of negative samples to positive ones in OHEM. Defaults to 3.
resample_step (float) – The step size for resampling the text center line (TCL). It’s better not to exceed half of the minimum width.
center_region_shrink_ratio (float) – The shrink ratio of text center region.
level_size_divisors (tuple(int)) – The downsample ratio on each level.
level_proportion_range (tuple(tuple(int))) – The range of text sizes assigned to each level.
loss_tr (dict) – The loss config used to calculate the text region loss. Defaults to dict(type=’MaskedBalancedBCELoss’).
loss_tcl (dict) – The loss config used to calculate the text center line loss. Defaults to dict(type=’MaskedBCELoss’).
loss_reg_x (dict) – The loss config used to calculate the regression loss on x axis. Defaults to dict(type=’MaskedSmoothL1Loss’).
loss_reg_y (dict) – The loss config used to calculate the regression loss on y axis. Defaults to dict(type=’MaskedSmoothL1Loss’).
- 返回类型
- forward(preds, data_samples)[源代码]¶
Compute FCENet loss.
- 参数
preds (list[dict]) – A list of dict with keys of
cls_res
,reg_res
corresponds to the classification result and regression result computed from the input tensor with the same index. They have the shapes of \((N, C_{cls,i}, H_i, W_i)\) and :math: (N, C_{out,i}, H_i, W_i).data_samples (list[TextDetDataSample]) – The data samples.
- 返回
- The dict for fcenet losses with loss_text, loss_center,
loss_reg_x and loss_reg_y.
- 返回类型
- forward_single(pred, gt)[源代码]¶
Compute loss for one feature level.
- 参数
pred (dict) – A dict with keys
cls_res
andreg_res
corresponds to the classification result and regression result from one feature level.gt (Tensor) – Ground truth for one feature level. Cls and reg targets are concatenated along the channel dimension.
- 返回
A list of losses for each feature level.
- 返回类型
list[Tensor]
- get_targets(data_samples)[源代码]¶
Generate loss targets for fcenet from data samples.
- 参数
data_samples (list(TextDetDataSample)) – Ground truth data samples.
- 返回
- A tuple of three tensors from three different
feature level as FCENet targets.
- 返回类型
tuple[Tensor]
- class mmocr.models.textdet.module_losses.PANModuleLoss(loss_text={'type': 'MaskedSquareDiceLoss'}, loss_kernel={'type': 'MaskedSquareDiceLoss'}, loss_embedding={'type': 'PANEmbLossV1'}, weight_text=1.0, weight_kernel=0.5, weight_embedding=0.25, ohem_ratio=3, shrink_ratio=(1.0, 0.5), max_shrink_dist=20, reduction='mean')[源代码]¶
The class for implementing PANet loss. This was partially adapted from https://github.com/whai362/pan_pp.pytorch and https://github.com/WenmuZhou/PAN.pytorch.
PANet: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.
- 参数
loss_text (dict) – dict(type=’MaskedSquareDiceLoss’).
loss_kernel (dict) – dict(type=’MaskedSquareDiceLoss’).
loss_embedding (dict) – dict(type=’PANEmbLossV1’).
weight_text (float) – The weight of text loss. Defaults to 1.
weight_kernel (float) – The weight of kernel loss. Defaults to 0.5.
weight_embedding (float) – The weight of embedding loss. Defaults to 0.25.
ohem_ratio (float) – The negative/positive ratio in ohem. Defaults to 3.
shrink_ratio (tuple[float]) – The ratio of shrinking kernel. Defaults to (1.0, 0.5).
max_shrink_dist (int or float) – The maximum shrinking distance. Defaults to 20.
reduction (str) – The way to reduce the loss. Available options are “mean” and “sum”. Defaults to ‘mean’.
- 返回类型
- forward(preds, data_samples)[源代码]¶
Compute PAN loss.
- 参数
preds (dict) – Raw predictions from model with shape \((N, C, H, W)\).
data_samples (list[TextDetDataSample]) – The data samples.
- 返回
The dict for pan losses with loss_text, loss_kernel, loss_aggregation and loss_discrimination.
- 返回类型
- get_targets(data_samples)[源代码]¶
Generate the gt targets for PANet.
- 参数
results (dict) – The input result dictionary.
data_samples (Sequence[mmocr.structures.textdet_data_sample.TextDetDataSample]) –
- 返回
The output result dictionary.
- 返回类型
results (dict)
- class mmocr.models.textdet.module_losses.PSEModuleLoss(weight_text=0.7, weight_kernel=0.3, loss_text={'type': 'MaskedSquareDiceLoss'}, loss_kernel={'type': 'MaskedSquareDiceLoss'}, ohem_ratio=3, reduction='mean', kernel_sample_type='adaptive', shrink_ratio=(1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4), max_shrink_dist=20)[源代码]¶
The class for implementing PSENet loss. This is partially adapted from https://github.com/whai362/PSENet.
PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network.
- 参数
weight_text (float) – The weight of text loss. Defaults to 0.7.
weight_kernel (float) – The weight of text kernel. Defaults to 0.3.
loss_text (dict) – Loss type for text. Defaults to dict(‘MaskedSquareDiceLoss’).
loss_kernel (dict) – Loss type for kernel. Defaults to dict(‘MaskedSquareDiceLoss’).
ohem_ratio (int or float) – The negative/positive ratio in ohem. Defaults to 3.
reduction (str) – The way to reduce the loss. Defaults to ‘mean’. Options are ‘mean’ and ‘sum’.
kernel_sample_type (str) – The way to sample kernel. Defaults to adaptive. Options are ‘adaptive’ and ‘hard’.
shrink_ratio (tuple) – The ratio for shirinking text instances. Defaults to (1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4).
max_shrink_dist (int or float) – The maximum shrinking distance. Defaults to 20.
- 返回类型
- forward(preds, data_samples)[源代码]¶
Compute PSENet loss.
- 参数
preds (torch.Tensor) – Raw predictions from model with shape \((N, C, H, W)\).
data_samples (list[TextDetDataSample]) – The data samples.
- 返回
The dict for pse losses with loss_text, loss_kernel, loss_aggregation and loss_discrimination.
- 返回类型
- class mmocr.models.textdet.module_losses.SegBasedModuleLoss[源代码]¶
Base class for the module loss of segmentation-based text detection algorithms with some handy utilities.
- 返回类型
- class mmocr.models.textdet.module_losses.TextSnakeModuleLoss(ohem_ratio=3.0, downsample_ratio=1.0, orientation_thr=2.0, resample_step=4.0, center_region_shrink_ratio=0.3, loss_text={'eps': 1e-05, 'fallback_negative_num': 100, 'type': 'MaskedBalancedBCEWithLogitsLoss'}, loss_center={'type': 'MaskedBCEWithLogitsLoss'}, loss_radius={'type': 'MaskedSmoothL1Loss'}, loss_sin={'type': 'MaskedSmoothL1Loss'}, loss_cos={'type': 'MaskedSmoothL1Loss'})[源代码]¶
The class for implementing TextSnake loss. This is partially adapted from https://github.com/princewang1994/TextSnake.pytorch.
TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.
- 参数
ohem_ratio (float) – The negative/positive ratio in ohem.
downsample_ratio (float) – Downsample ratio. Defaults to 1.0. TODO: remove it.
orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.
resample_step (float) – The step of resampling.
center_region_shrink_ratio (float) – The shrink ratio of text center.
loss_text (dict) – The loss config used to calculate the text loss.
loss_center (dict) – The loss config used to calculate the center loss.
loss_radius (dict) – The loss config used to calculate the radius loss.
loss_sin (dict) – The loss config used to calculate the sin loss.
loss_cos (dict) – The loss config used to calculate the cos loss.
- 返回类型
- forward(preds, data_samples)[源代码]¶
- 参数
preds (Tensor) – The prediction map of shape \((N, 5, H, W)\), where each dimension is the map of “text_region”, “center_region”, “sin_map”, “cos_map”, and “radius_map” respectively.
data_samples (list[TextDetDataSample]) – The data samples.
- 返回
A loss dict with
loss_text
,loss_center
,loss_radius
,loss_sin
andloss_cos
.- 返回类型
- get_targets(data_samples)[源代码]¶
Generate loss targets from data samples.
- 参数
data_samples (list(TextDetDataSample)) – Ground truth data samples.
- 返回
tuple(gt_text_masks, gt_masks, gt_center_region_masks, gt_radius_maps, gt_sin_maps, gt_cos_maps): A tuple of six lists of ndarrays as the targets.
- 返回类型
Tuple
- vector_angle(vec1, vec2)[源代码]¶
Compute the angle between two vectors.
- 参数
vec1 (numpy.ndarray) –
vec2 (numpy.ndarray) –
- 返回类型
- vector_cos(vec)[源代码]¶
Compute the cos of the angle between vector and x-axis.
- 参数
vec (numpy.ndarray) –
- 返回类型
- vector_sin(vec)[源代码]¶
Compute the sin of the angle between vector and x-axis.
- 参数
vec (numpy.ndarray) –
- 返回类型
- vector_slope(vec)[源代码]¶
Compute the slope of a vector.
- 参数
vec (numpy.ndarray) –
- 返回类型
Text Detection Data Preprocessors¶
- class mmocr.models.textdet.data_preprocessors.TextDetDataPreprocessor(mean=None, std=None, pad_size_divisor=1, pad_value=0, bgr_to_rgb=False, rgb_to_bgr=False, batch_augments=None)[源代码]¶
Image pre-processor for detection tasks.
Comparing with the
mmengine.ImgDataPreprocessor
,It supports batch augmentations.
2. It will additionally append batch_input_shape and pad_shape to data_samples considering the object detection task.
It provides the data pre-processing as follows
Collate and move data to the target device.
Pad inputs to the maximum size of current batch with defined
pad_value
. The padding size can be divisible by a definedpad_size_divisor
Stack inputs to batch_inputs.
Convert inputs from bgr to rgb if the shape of input is (3, H, W).
Normalize image with defined std and mean.
Do batch augmentations during training.
- 参数
mean (Sequence[Number], optional) – The pixel mean of R, G, B channels. Defaults to None.
std (Sequence[Number], optional) – The pixel standard deviation of R, G, B channels. Defaults to None.
pad_size_divisor (int) – The size of padded image should be divisible by
pad_size_divisor
. Defaults to 1.pad_value (Number) – The padded pixel value. Defaults to 0.
pad_mask (bool) – Whether to pad instance masks. Defaults to False.
mask_pad_value (int) – The padded pixel value for instance masks. Defaults to 0.
pad_seg (bool) – Whether to pad semantic segmentation maps. Defaults to False.
seg_pad_value (int) – The padded pixel value for semantic segmentation maps. Defaults to 255.
bgr_to_rgb (bool) – whether to convert image from BGR to RGB. Defaults to False.
rgb_to_bgr (bool) – whether to convert image from RGB to RGB. Defaults to False.
batch_augments (list[dict], optional) – Batch-level augmentations
- 返回类型
Text Detection Postprocessors¶
- class mmocr.models.textdet.postprocessors.BaseTextDetPostProcessor(text_repr_type='poly', rescale_fields=None, train_cfg=None, test_cfg=None)[源代码]¶
Base postprocessor for text detection models.
- 参数
text_repr_type (str) – The boundary encoding type, ‘poly’ or ‘quad’. Defaults to ‘poly’.
rescale_fields (list[str], optional) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed.
train_cfg (dict, optional) – The parameters to be passed to
self.get_text_instances
in training. Defaults to None.test_cfg (dict, optional) – The parameters to be passed to
self.get_text_instances
in testing. Defaults to None.
- 返回类型
- get_text_instances(pred_results, data_sample, **kwargs)[源代码]¶
Get text instance predictions of one image.
- 参数
pred_result (tuple(Tensor)) – Prediction results of an image.
data_sample (TextDetDataSample) – Datasample of an image.
**kwargs – Other parameters. Configurable via
__init__.train_cfg
and__init__.test_cfg
.pred_results (Union[torch.Tensor, List[torch.Tensor]]) –
- 返回
A new DataSample with predictions filled in. The polygon/bbox results are usually saved in
TextDetDataSample.pred_instances.polygons
orTextDetDataSample.pred_instances.bboxes
. The confidence scores are saved inTextDetDataSample.pred_instances.scores
.- 返回类型
- rescale(results, scale_factor)[源代码]¶
Rescale results in
results.pred_instances
according toscale_factor
, whose keys are defined inself.rescale_fields
. Usually used to rescale bboxes and/or polygons.- 参数
results (TextDetDataSample) – The post-processed prediction results.
- 返回
Prediction results with rescaled results.
- 返回类型
- class mmocr.models.textdet.postprocessors.DBPostprocessor(text_repr_type='poly', rescale_fields=['polygons'], mask_thr=0.3, min_text_score=0.3, min_text_width=5, unclip_ratio=1.5, epsilon_ratio=0.01, max_candidates=3000, **kwargs)[源代码]¶
Decoding predictions of DbNet to instances. This is partially adapted from https://github.com/MhLiao/DB.
- 参数
text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’. Defaults to ‘poly’.
rescale_fields (list[str]) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed. Defaults to [‘polygons’].
mask_thr (float) – The mask threshold value for binarization. Defaults to 0.3.
min_text_score (float) – The threshold value for converting binary map to shrink text regions. Defaults to 0.3.
min_text_width (int) – The minimum width of boundary polygon/box predicted. Defaults to 5.
unclip_ratio (float) – The unclip ratio for text regions dilation. Defaults to 1.5.
epsilon_ratio (float) – The epsilon ratio for approximation accuracy. Defaults to 0.01.
max_candidates (int) – The maximum candidate number. Defaults to 3000.
- 返回类型
- get_text_instances(prob_map, data_sample)[源代码]¶
Get text instance predictions of one image.
- 参数
pred_result (Tensor) – DBNet’s output
prob_map
of shape \((H, W)\).data_sample (TextDetDataSample) – Datasample of an image.
prob_map (torch.Tensor) –
- 返回
A new DataSample with predictions filled in. Polygons and results are saved in
TextDetDataSample.pred_instances.polygons
. The confidence scores are saved inTextDetDataSample.pred_instances.scores
.- 返回类型
- class mmocr.models.textdet.postprocessors.DRRGPostprocessor(link_thr=0.8, edge_len_thr=50.0, rescale_fields=['polygons'], **kwargs)[源代码]¶
Merge text components and construct boundaries of text instances.
- 参数
- 返回类型
- get_text_instances(pred_results, data_sample)[源代码]¶
Get text instance predictions of one image.
- 参数
pred_result (tuple(ndarray, ndarray, ndarray)) – Prediction results edge, score and text_comps. Each of shape \((N_{edges}, 2)\), \((N_{edges},)\) and \((M, 9)\), respectively.
data_sample (TextDetDataSample) – Datasample of an image.
pred_results (Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]) –
- 返回
The original dataSample with predictions filled in. Polygons and results are saved in
TextDetDataSample.pred_instances.polygons
. The confidence scores are saved inTextDetDataSample.pred_instances.scores
.- 返回类型
- split_results(pred_results)[源代码]¶
Split batched elements in pred_results along the first dimension into
batch_num
sub-elements and regather them into a list of dicts.However, DRRG only outputs one batch at inference time, so this function is a no-op.
- 参数
pred_results (Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]) –
- 返回类型
List[Tuple]
- class mmocr.models.textdet.postprocessors.FCEPostprocessor(fourier_degree, num_reconstr_points, rescale_fields=['polygons'], scales=[8, 16, 32], text_repr_type='poly', alpha=1.0, beta=2.0, score_thr=0.3, nms_thr=0.1, **kwargs)[源代码]¶
Decoding predictions of FCENet to instances.
- 参数
fourier_degree (int) – The maximum Fourier transform degree k.
num_reconstr_points (int) – The points number of the polygon reconstructed from predicted Fourier coefficients.
rescale_fields (list[str]) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed. Defaults to [‘polygons’].
scales (list[int]) – The down-sample scale of each layer. Defaults to [8, 16, 32].
text_repr_type (str) –
- Boundary encoding type ‘poly’ or ‘quad’. Defaults
to ‘poly’.
- alpha (float): The parameter to calculate final scores
\(Score_{final} = (Score_{text region} ^ alpha) * (Score_{text center_region}^ beta)\). Defaults to 1.0.
beta (float) – The parameter to calculate final score. Defaults to 2.0.
score_thr (float) – The threshold used to filter out the final candidates.Defaults to 0.3.
nms_thr (float) – The threshold of nms. Defaults to 0.1.
alpha (float) –
- 返回类型
- get_text_instances(pred_results, data_sample)[源代码]¶
Get text instance predictions of one image.
- 参数
pred_results (List[dict]) – A list of dict with keys of
cls_res
,reg_res
corresponding to the classification result and regression result computed from the input tensor with the same index. They have the shapes of \((N, C_{cls,i}, H_i, W_i)\) and \((N, C_{out,i}, H_i, W_i)\).data_sample (TextDetDataSample) – Datasample of an image.
- 返回
A new DataSample with predictions filled in. Polygons and results are saved in
TextDetDataSample.pred_instances.polygons
. The confidence scores are saved inTextDetDataSample.pred_instances.scores
.- 返回类型
- split_results(pred_results)[源代码]¶
Split batched elements in pred_results along the first dimension into
batch_num
sub-elements and regather them into a list of dicts.- 参数
pred_results (list[dict]) – A list of dict with keys of
cls_res
,reg_res
corresponding to the classification result and regression result computed from the input tensor with the same index. They have the shapes of \((N, C_{cls,i}, H_i, W_i)\) and \((N, C_{out,i}, H_i, W_i)\).- 返回
N lists. Each list contains three dicts from different feature level.
- 返回类型
- class mmocr.models.textdet.postprocessors.PANPostprocessor(text_repr_type='poly', score_threshold=0.3, rescale_fields=['polygons'], min_text_confidence=0.5, min_kernel_confidence=0.5, distance_threshold=3.0, min_text_area=16, downsample_ratio=0.25)[源代码]¶
Convert scores to quadrangles via post processing in PANet. This is partially adapted from https://github.com/WenmuZhou/PAN.pytorch.
- 参数
text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’. Defaults to ‘poly’.
score_threshold (float) – The minimal text score. Defaults to 0.3.
rescale_fields (list[str]) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed. Defaults to [‘polygons’].
min_text_confidence (float) – The minimal text confidence. Defaults to 0.5.
min_kernel_confidence (float) – The minimal kernel confidence. Defaults to 0.5.
distance_threshold (float) – The minimal distance between the point to mean of text kernel. Defaults to 3.0.
min_text_area (int) – The minimal text instance region area. Defaults to 16.
downsample_ratio (float) – Downsample ratio. Defaults to 0.25.
- 返回类型
- get_text_instances(pred_results, data_sample, **kwargs)[源代码]¶
Get text instance predictions of one image.
- 参数
pred_result (torch.Tensor) – Prediction results of an image which is a tensor of shape \((N, H, W)\).
data_sample (TextDetDataSample) – Datasample of an image.
pred_results (torch.Tensor) –
- 返回
A new DataSample with predictions filled in. Polygons and results are saved in
TextDetDataSample.pred_instances.polygons
. The confidence scores are saved inTextDetDataSample.pred_instances.scores
.- 返回类型
- split_results(pred_results)[源代码]¶
Split the prediction results into text score and kernel score.
- 参数
pred_results (torch.Tensor) – The prediction results.
- 返回
The text score and kernel score.
- 返回类型
List[torch.Tensor]
- class mmocr.models.textdet.postprocessors.PSEPostprocessor(text_repr_type='poly', rescale_fields=['polygons'], min_kernel_confidence=0.5, score_threshold=0.3, min_kernel_area=0, min_text_area=16, downsample_ratio=0.25)[源代码]¶
Decoding predictions of PSENet to instances. This is partially adapted from https://github.com/whai362/PSENet.
- 参数
text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’. Defaults to ‘poly’.
rescale_fields (list[str]) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed. Defaults to [‘polygons’].
min_kernel_confidence (float) – The minimal kernel confidence. Defaults to 0.5.
score_threshold (float) – The minimal text average confidence. Defaults to 0.3.
min_kernel_area (int) – The minimal text kernel area. Defaults to 0.
min_text_area (int) – The minimal text instance region area. Defaults to 16.
downsample_ratio (float) – Downsample ratio. Defaults to 0.25.
- 返回类型
- get_text_instances(pred_results, data_sample, **kwargs)[源代码]¶
- 参数
pred_result (torch.Tensor) – Prediction results of an image which is a tensor of shape \((N, H, W)\).
data_sample (TextDetDataSample) – Datasample of an image.
pred_results (torch.Tensor) –
- 返回
A new DataSample with predictions filled in. Polygons and results are saved in
TextDetDataSample.pred_instances.polygons
. The confidence scores are saved inTextDetDataSample.pred_instances.scores
.- 返回类型
- class mmocr.models.textdet.postprocessors.TextSnakePostprocessor(text_repr_type='poly', min_text_region_confidence=0.6, min_center_region_confidence=0.2, min_center_area=30, disk_overlap_thr=0.03, radius_shrink_ratio=1.03, rescale_fields=['polygons'], **kwargs)[源代码]¶
Decoding predictions of TextSnake to instances. This was partially adapted from https://github.com/princewang1994/TextSnake.pytorch.
- 参数
text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’.
min_text_region_confidence (float) – The confidence threshold of text region in TextSnake.
min_center_region_confidence (float) – The confidence threshold of text center region in TextSnake.
min_center_area (int) – The minimal text center region area.
disk_overlap_thr (float) – The radius overlap threshold for merging disks.
radius_shrink_ratio (float) – The shrink ratio of ordered disks radii.
rescale_fields (list[str], optional) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed.
- 返回类型
- get_text_instances(pred_results, data_sample)[源代码]¶
- 参数
pred_results (torch.Tensor) – Prediction map with shape \((C, H, W)\).
data_sample (TextDetDataSample) – Datasample of an image.
- 返回
The instance boundary and its confidence.
- 返回类型
- split_results(pred_results)[源代码]¶
Split the prediction results into text score and kernel score.
- 参数
pred_results (torch.Tensor) – The prediction results.
- 返回
The text score and kernel score.
- 返回类型
List[torch.Tensor]
Text Recognition Recognizer¶
- class mmocr.models.textrecog.recognizers.ABINet(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]¶
Implementation of `Read Like Humans: Autonomous, Bidirectional and Iterative LanguageModeling for Scene Text Recognition.
<https://arxiv.org/pdf/2103.06495.pdf>`_
- 参数
preprocessor (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
backbone (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
encoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
decoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
data_preprocessor (Union[mmengine.config.config.ConfigDict, Dict]) –
init_cfg (Union[Dict, List[Dict]]) –
- 返回类型
- class mmocr.models.textrecog.recognizers.BaseRecognizer(data_preprocessor=None, init_cfg=None)[源代码]¶
Base class for recognizer.
- 参数
- abstract extract_feat(inputs)[源代码]¶
Extract features from images.
- 参数
inputs (torch.Tensor) –
- 返回类型
- forward(inputs, data_samples=None, mode='tensor', **kwargs)[源代码]¶
The unified entry for a forward process in both training and test.
The method should accept three modes: “tensor”, “predict” and “loss”:
“tensor”: Forward the whole network and return tensor or tuple of
tensor without any post-processing, same as a common nn.Module. - “predict”: Forward and return the predictions, which are fully processed to a list of
DetDataSample
. - “loss”: Forward and return a dict of losses according to the given inputs and data samples.Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- 参数
inputs (torch.Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (list[
DetDataSample
], optional) – The annotation data of every samples. Defaults to None.mode (str) – Return what kind of value. Defaults to ‘tensor’.
- 返回
The return type depends on
mode
.If
mode="tensor"
, return a tensor or a tuple of tensor.If
mode="predict"
, return a list ofDetDataSample
.If
mode="loss"
, return a dict of tensor.
- 返回类型
Union[Dict[str, torch.Tensor], List[mmocr.structures.textrecog_data_sample.TextRecogDataSample], Tuple[torch.Tensor], torch.Tensor]
- abstract loss(inputs, data_samples, **kwargs)[源代码]¶
Calculate losses from a batch of inputs and data samples.
- 参数
inputs (torch.Tensor) –
data_samples (List[mmocr.structures.textrecog_data_sample.TextRecogDataSample]) –
- 返回类型
- abstract predict(inputs, data_samples, **kwargs)[源代码]¶
Predict results from a batch of inputs and data samples with post- processing.
- 参数
inputs (torch.Tensor) –
data_samples (List[mmocr.structures.textrecog_data_sample.TextRecogDataSample]) –
- 返回类型
List[mmocr.structures.textrecog_data_sample.TextRecogDataSample]
- class mmocr.models.textrecog.recognizers.CRNN(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]¶
CTC-loss based recognizer.
- 参数
preprocessor (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
backbone (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
encoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
decoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
data_preprocessor (Union[mmengine.config.config.ConfigDict, Dict]) –
init_cfg (Union[Dict, List[Dict]]) –
- 返回类型
- class mmocr.models.textrecog.recognizers.EncoderDecoderRecognizer(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]¶
Base class for encode-decode recognizer.
- 参数
preprocessor (dict, optional) – Config dict for preprocessor. Defaults to None.
backbone (dict, optional) – Backbone config. Defaults to None.
encoder (dict, optional) – Encoder config. If None, the output from backbone will be directly fed into
decoder
. Defaults to None.decoder (dict, optional) – Decoder config. Defaults to None.
data_preprocessor (dict, optional) – Model preprocessing config for processing the input image data. Keys allowed are ``to_rgb``(bool), ``pad_size_divisor``(int), ``pad_value``(int or float), ``mean``(int or float) and ``std``(int or float). Preprcessing order: 1. to rgb; 2. normalization 3. pad. Defaults to None.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- extract_feat(inputs)[源代码]¶
Directly extract features from the backbone.
- 参数
inputs (torch.Tensor) –
- 返回类型
- loss(inputs, data_samples, **kwargs)[源代码]¶
Calculate losses from a batch of inputs and data samples. :param inputs: Input images of shape (N, C, H, W).
Typically these should be mean centered and std scaled.
- 参数
data_samples (list[TextRecogDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.
inputs (tensor) –
- 返回
A dictionary of loss components.
- 返回类型
- predict(inputs, data_samples, **kwargs)[源代码]¶
Predict results from a batch of inputs and data samples with post- processing.
- 参数
inputs (torch.Tensor) – Image input tensor.
data_samples (list[TextRecogDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.
- 返回
A list of N datasamples of prediction results. Results are stored in
pred_text
.- 返回类型
- class mmocr.models.textrecog.recognizers.MASTER(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]¶
Implementation of MASTER
- 参数
preprocessor (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
backbone (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
encoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
decoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
data_preprocessor (Union[mmengine.config.config.ConfigDict, Dict]) –
init_cfg (Union[Dict, List[Dict]]) –
- 返回类型
- class mmocr.models.textrecog.recognizers.NRTR(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]¶
Implementation of NRTR
- 参数
preprocessor (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
backbone (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
encoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
decoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
data_preprocessor (Union[mmengine.config.config.ConfigDict, Dict]) –
init_cfg (Union[Dict, List[Dict]]) –
- 返回类型
- class mmocr.models.textrecog.recognizers.RobustScanner(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]¶
Implementation of `RobustScanner.
<https://arxiv.org/pdf/2007.07542.pdf>
- 参数
preprocessor (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
backbone (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
encoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
decoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
data_preprocessor (Union[mmengine.config.config.ConfigDict, Dict]) –
init_cfg (Union[Dict, List[Dict]]) –
- 返回类型
- class mmocr.models.textrecog.recognizers.SARNet(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]¶
Implementation of SAR
- 参数
preprocessor (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
backbone (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
encoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
decoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
data_preprocessor (Union[mmengine.config.config.ConfigDict, Dict]) –
init_cfg (Union[Dict, List[Dict]]) –
- 返回类型
- class mmocr.models.textrecog.recognizers.SATRN(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]¶
Implementation of SATRN
- 参数
preprocessor (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
backbone (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
encoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
decoder (Optional[Union[mmengine.config.config.ConfigDict, Dict]]) –
data_preprocessor (Union[mmengine.config.config.ConfigDict, Dict]) –
init_cfg (Union[Dict, List[Dict]]) –
- 返回类型
Text Recognition Backbones¶
- class mmocr.models.textrecog.backbones.MiniVGG(leaky_relu=True, input_channels=3, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]¶
A mini VGG backbone for text recognition, modified from `VGG-VeryDeep.
<https://arxiv.org/pdf/1409.1556.pdf>`_
- 参数
- class mmocr.models.textrecog.backbones.MobileNetV2(pooling_layers=[3, 4, 5], init_cfg=None)[源代码]¶
See mmdet.models.backbones.MobileNetV2 for details.
- 参数
pooling_layers (list) – List of indices of pooling layers.
init_cfg (InitConfigType, optional) – Initialization config dict.
- 返回类型
- forward(x)[源代码]¶
Forward function.
- 参数
x (torch.Tensor) –
- 返回类型
- class mmocr.models.textrecog.backbones.NRTRModalityTransform(in_channels=3, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]¶
Modality transform in NRTR.
- 参数
- 返回类型
- forward(x)[源代码]¶
Backbone forward.
- 参数
x (torch.Tensor) – Image tensor of shape \((N, C, W, H)\). W, H is the width and height of image.
- 返回
Output tensor.
- 返回类型
Tensor
- class mmocr.models.textrecog.backbones.ResNet(in_channels, stem_channels, block_cfgs, arch_layers, arch_channels, strides, out_indices=None, plugins=None, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': 'BatchNorm2d'}])[源代码]¶
- 参数
in_channels (int) – Number of channels of input image tensor.
stem_channels (list[int]) – List of channels in each stem layer. E.g., [64, 128] stands for 64 and 128 channels in the first and second stem layers.
block_cfgs (dict) – Configs of block
arch_layers (list[int]) – List of Block number for each stage.
arch_channels (list[int]) – List of channels for each stage.
strides (Sequence[int] or Sequence[tuple]) – Strides of the first block of each stage.
out_indices (Sequence[int], optional) – Indices of output stages. If not specified, only the last stage will be returned.
plugins (dict, optional) – Configs of stage plugins
init_cfg (dict or list[dict], optional) – Initialization config dict.
- forward(x)[源代码]¶
Args: x (Tensor): Image tensor of shape \((N, 3, H, W)\).
- 返回
Feature tensor. It can be a list of feature outputs at specific layers if
out_indices
is specified.- 返回类型
Tensor or list[Tensor]
- 参数
x (torch.Tensor) –
- forward_plugin(x, plugin_name)[源代码]¶
Forward tensor through plugin.
- 参数
x (torch.Tensor) – Input tensor.
- 返回
Output tensor.
- 返回类型
- class mmocr.models.textrecog.backbones.ResNet31OCR(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]¶
- Implement ResNet backbone for text recognition, modified from
- 参数
base_channels (int) – Number of channels of input image tensor.
layers (list[int]) – List of BasicBlock number for each stage.
channels (list[int]) – List of out_channels of Conv2d layer.
out_indices (None | Sequence[int]) – Indices of output stages.
stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.
last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.textrecog.backbones.ResNetABI(in_channels=3, stem_channels=32, base_channels=32, arch_settings=[3, 4, 6, 6, 3], strides=[2, 1, 2, 1, 1], out_indices=None, last_stage_pool=False, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': 'BatchNorm2d'}])[源代码]¶
Implement ResNet backbone for text recognition, modified from `ResNet.
<https://arxiv.org/pdf/1512.03385.pdf>`_ and https://github.com/FangShancheng/ABINet
- 参数
in_channels (int) – Number of channels of input image tensor.
stem_channels (int) – Number of stem channels.
base_channels (int) – Number of base channels.
arch_settings (list[int]) – List of BasicBlock number for each stage.
strides (Sequence[int]) – Strides of the first block of each stage.
out_indices (None | Sequence[int]) – Indices of output stages. If not specified, only the last stage will be returned.
last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.
- class mmocr.models.textrecog.backbones.ShallowCNN(input_channels=1, hidden_dim=512, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]¶
Implement Shallow CNN block for SATRN.
SATRN: On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention.
- 参数
- 返回类型
Text Recognition Data Preprocessors¶
- class mmocr.models.textrecog.data_preprocessors.TextRecogDataPreprocessor(mean=None, std=None, pad_size_divisor=1, pad_value=0, bgr_to_rgb=False, rgb_to_bgr=False, batch_augments=None)[源代码]¶
Image pre-processor for recognition tasks.
Comparing with the
mmengine.ImgDataPreprocessor
,It supports batch augmentations.
2. It will additionally append batch_input_shape and valid_ratio to data_samples considering the object recognition task.
It provides the data pre-processing as follows
Collate and move data to the target device.
Pad inputs to the maximum size of current batch with defined
pad_value
. The padding size can be divisible by a definedpad_size_divisor
Stack inputs to inputs.
Convert inputs from bgr to rgb if the shape of input is (3, H, W).
Normalize image with defined std and mean.
Do batch augmentations during training.
- 参数
mean (Sequence[Number], optional) – The pixel mean of R, G, B channels. Defaults to None.
std (Sequence[Number], optional) – The pixel standard deviation of R, G, B channels. Defaults to None.
pad_size_divisor (int) – The size of padded image should be divisible by
pad_size_divisor
. Defaults to 1.pad_value (Number) – The padded pixel value. Defaults to 0.
bgr_to_rgb (bool) – whether to convert image from BGR to RGB. Defaults to False.
rgb_to_bgr (bool) – whether to convert image from RGB to RGB. Defaults to False.
batch_augments (list[dict], optional) – Batch-level augmentations
- 返回类型
Text Recognition Layers¶
- class mmocr.models.textrecog.layers.Adaptive2DPositionalEncoding(d_hid=512, n_height=100, n_width=100, dropout=0.1, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}])[源代码]¶
Implement Adaptive 2D positional encoder for SATRN, see `SATRN.
<https://arxiv.org/abs/1910.04396>`_ Modified from https://github.com/Media-Smart/vedastr Licensed under the Apache License, Version 2.0 (the “License”);
- 参数
d_hid (int) – Dimensions of hidden layer. Defaults to 512.
n_height (int) – Max height of the 2D feature output. Defaults to 100.
n_width (int) – Max width of the 2D feature output. Defaults to 100.
dropout (float) – Dropout rate. Defaults to 0.1.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to [dict(type=’Xavier’, layer=’Conv2d’)]
- 返回类型
- class mmocr.models.textrecog.layers.BasicBlock(inplanes, planes, stride=1, downsample=None, use_conv1x1=False, plugins=None)[源代码]¶
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.textrecog.layers.BidirectionalLSTM(nIn, nHidden, nOut)[源代码]¶
- forward(input)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.textrecog.layers.Bottleneck(inplanes, planes, stride=1, downsample=False)[源代码]¶
- forward(x)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.textrecog.layers.DotProductAttentionLayer(dim_model=None)[源代码]¶
- forward(query, key, value, mask=None)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.textrecog.layers.PositionAwareLayer(dim_model, rnn_layers=2)[源代码]¶
- forward(img_feature)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.textrecog.layers.RobustScannerFusionLayer(dim_model, dim=- 1, init_cfg=None)[源代码]¶
- forward(x0, x1)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.textrecog.layers.SATRNEncoderLayer(d_model=512, d_inner=512, n_head=8, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, init_cfg=None)[源代码]¶
Implement encoder layer for SATRN, see `SATRN.
<https://arxiv.org/abs/1910.04396>`_.
- 参数
d_model (int) – Dimension \(D_m\) of the input from previous model. Defaults to 512.
d_inner (int) – Hidden dimension of feedforward layers. Defaults to 256.
n_head (int) – Number of parallel attention heads. Defaults to 8.
d_k (int) – Dimension of the key vector. Defaults to 64.
d_v (int) – Dimension of the value vector. Defaults to 64.
dropout (float) – Dropout rate. Defaults to 0.1.
qkv_bias (bool) – Whether to use bias. Defaults to False.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
Text Recognition Plugins¶
- class mmocr.models.textrecog.plugins.GCAModule(in_channels, ratio, n_head, pooling_type='att', scale_attn=False, fusion_type='channel_add', **kwargs)[源代码]¶
GCAModule in MASTER.
- 参数
in_channels (int) – Channels of input tensor.
ratio (float) – Scale ratio of in_channels.
n_head (int) – Numbers of attention head.
pooling_type (str) – Spatial pooling type. Options are [
avg
,att
].scale_attn (bool) – Whether to scale the attention map. Defaults to False.
fusion_type (str) – Fusion type of input and context. Options are [
channel_add
,channel_mul
,channel_concat
].
- 返回类型
Text Recognition Encoders¶
- class mmocr.models.textrecog.encoders.ABIEncoder(n_layers=2, n_head=8, d_model=512, d_inner=2048, dropout=0.1, max_len=256, init_cfg=None)[源代码]¶
Implement transformer encoder for text recognition, modified from <https://github.com/FangShancheng/ABINet>.
- 参数
n_layers (int) – Number of attention layers. Defaults to 2.
n_head (int) – Number of parallel attention heads. Defaults to 8.
d_model (int) – Dimension \(D_m\) of the input from previous model. Defaults to 512.
d_inner (int) – Hidden dimension of feedforward layers. Defaults to 2048.
dropout (float) – Dropout rate. Defaults to 0.1.
max_len (int) – Maximum output sequence length \(T\). Defaults to 8 * 32.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- forward(feature, data_samples)[源代码]¶
- 参数
feature (Tensor) – Feature tensor of shape \((N, D_m, H, W)\).
data_samples (List[TextRecogDataSample]) – List of data samples.
- 返回
Features of shape \((N, D_m, H, W)\).
- 返回类型
Tensor
- class mmocr.models.textrecog.encoders.BaseEncoder(init_cfg=None)[源代码]¶
Base Encoder class for text recognition.
- forward(feat, **kwargs)[源代码]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
注解
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class mmocr.models.textrecog.encoders.ChannelReductionEncoder(in_channels, out_channels, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'})[源代码]¶
Change the channel number with a one by one convoluational layer.
- 参数
- 返回类型
- forward(feat, data_samples=None)[源代码]¶
- 参数
feat (Tensor) – Image features with the shape of \((N, C_{in}, H, W)\).
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.
- 返回
A tensor of shape \((N, C_{out}, H, W)\).
- 返回类型
Tensor
- class mmocr.models.textrecog.encoders.NRTREncoder(n_layers=6, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, dropout=0.1, init_cfg=None)[源代码]¶
Transformer Encoder block with self attention mechanism.
- 参数
n_layers (int) – The number of sub-encoder-layers in the encoder. Defaults to 6.
n_head (int) – The number of heads in the multiheadattention models Defaults to 8.
d_k (int) – Total number of features in key. Defaults to 64.
d_v (int) – Total number of features in value. Defaults to 64.
d_model (int) – The number of expected features in the decoder inputs. Defaults to 512.
d_inner (int) – The dimension of the feedforward network model. Defaults to 256.
dropout (float) – Dropout rate for MHSA and FFN. Defaults to 0.1.
init_cfg (dict or list[dict], optional) – Initialization configs.
- 返回类型
- forward(feat, data_samples=None)[源代码]¶
- 参数
feat (Tensor) – Backbone output of shape \((N, C, H, W)\).
data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.
- 返回
The encoder output tensor. Shape \((N, T, C)\).
- 返回类型
Tensor
- class mmocr.models.textrecog.encoders.SAREncoder(enc_bi_rnn=False, rnn_dropout=0.0, enc_gru=False, d_model=512, d_enc=512, mask=True, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}], **kwargs)[源代码]¶
Implementation of encoder module in `SAR.
<https://arxiv.org/abs/1811.00751>`_.
- 参数
enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder. Defaults to False.
rnn_dropout (float) – Dropout probability of RNN layer in encoder. Defaults to 0.0.
enc_gru (bool) – If True, use GRU, else LSTM in encoder. Defaults to False.
d_model (int) – Dim \(D_i\) of channels from backbone. Defaults to 512.
d_enc (int) – Dim \(D_m\) of encoder RNN layer. Defaults to 512.
mask (bool) – If True, mask padding in RNN sequence. Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to [dict(type=’Xavier’, layer=’Conv2d’), dict(type=’Uniform’, layer=’BatchNorm2d’)].
- 返回类型
- forward(feat, data_samples=None)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.
- 返回
A tensor of shape \((N, D_m)\).
- 返回类型
Tensor
- class mmocr.models.textrecog.encoders.SATRNEncoder(n_layers=12, n_head=8, d_k=64, d_v=64, d_model=512, n_position=100, d_inner=256, dropout=0.1, init_cfg=None)[源代码]¶
Implement encoder for SATRN, see `SATRN.
<https://arxiv.org/abs/1910.04396>`_.
- 参数
n_layers (int) – Number of attention layers. Defaults to 12.
n_head (int) – Number of parallel attention heads. Defaults to 8.
d_k (int) – Dimension of the key vector. Defaults to 64.
d_v (int) – Dimension of the value vector. Defaults to 64.
d_model (int) – Dimension \(D_m\) of the input from previous model. Defaults to 512.
n_position (int) – Length of the positional encoding vector. Must be greater than
max_seq_len
. Defaults to 100.d_inner (int) – Hidden dimension of feedforward layers. Defaults to 256.
dropout (float) – Dropout rate. Defaults to 0.1.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- forward(feat, data_samples=None)[源代码]¶
Forward propagation of encoder.
- 参数
feat (Tensor) – Feature tensor of shape \((N, D_m, H, W)\).
data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.
- 返回
A tensor of shape \((N, T, D_m)\).
- 返回类型
Tensor
Text Recognition Decoders¶
- class mmocr.models.textrecog.decoders.ABIFuser(dictionary, vision_decoder, language_decoder=None, d_model=512, num_iters=1, max_seq_len=40, module_loss=None, postprocessor=None, init_cfg=None, **kwargs)[源代码]¶
A special decoder responsible for mixing and aligning visual feature and linguistic feature. ABINet
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary. The dictionary must have an end token.vision_decoder (dict) – The config for vision decoder.
language_decoder (dict, optional) – The config for language decoder.
num_iters (int) – Rounds of iterative correction. Defaults to 1.
d_model (int) – Hidden size \(E\) of model. Defaults to 512.
max_seq_len (int) – Maximum sequence length \(T\). The sequence is usually generated from decoder. Defaults to 40.
module_loss (dict, optional) – Config to build loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- forward_test(feat, logits, data_samples=None)[源代码]¶
- 参数
feat (torch.Tensor, optional) – Not required. Feature map placeholder. Defaults to None.
logits (Tensor) – Raw language logitis. Shape \((N, T, C)\).
data_samples (list[TextRecogDataSample], optional) – Not required. DataSample placeholder. Defaults to None.
- 返回
Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- forward_train(feat=None, out_enc=None, data_samples=None)[源代码]¶
- 参数
feat (torch.Tensor, optional) – Not required. Feature map placeholder. Defaults to None.
out_enc (Tensor) – Raw language logitis. Shape \((N, T, C)\). Defaults to None.
data_samples (list[TextRecogDataSample], optional) – Not required. DataSample placeholder. Defaults to None.
- 返回
A dict with keys
out_enc
,out_decs
andout_fusers
.out_vis (dict): Dict from
self.vision_decoder
with keysfeature
,logits
andattn_scores
.out_langs (dict or list): Dict from
self.vision_decoder
with keysfeature
,logits
if applicable, or an empty list otherwise.out_fusers (dict or list): Dict of fused visual and language features with keys
feature
,logits
if applicable, or an empty list otherwise.
- 返回类型
Dict
- fuse(l_feature, v_feature)[源代码]¶
Mix and align visual feature and linguistic feature.
- 参数
l_feature (torch.Tensor) – (N, T, E) where T is length, N is batch size and E is dim of model.
v_feature (torch.Tensor) – (N, T, E) shape the same as l_feature.
- 返回
A dict with key
logits
. of shape \((N, T, C)\) where N is batch size, T is length and C is the number of characters.- 返回类型
- class mmocr.models.textrecog.decoders.ABILanguageDecoder(dictionary, d_model=512, n_head=8, d_inner=2048, n_layers=4, dropout=0.1, detach_tokens=True, use_self_attn=False, max_seq_len=40, module_loss=None, postprocessor=None, init_cfg=None, **kwargs)[源代码]¶
Transformer-based language model responsible for spell correction. Implementation of language model of
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary. The dictionary must have an end token.d_model (int) – Hidden size \(E\) of model. Defaults to 512.
n_head (int) – Number of multi-attention heads.
d_inner (int) – Hidden size of feedforward network model.
n_layers (int) – The number of similar decoding layers.
dropout (float) – Dropout rate.
detach_tokens (bool) – Whether to block the gradient flow at input tokens.
use_self_attn (bool) – If True, use self attention in decoder layers, otherwise cross attention will be used.
max_seq_len (int) – Maximum sequence length \(T\). The sequence is usually generated from decoder. Defaults to 40.
module_loss (dict, optional) – Config to build loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- forward_test(feat=None, logits=None, data_samples=None)[源代码]¶
- 参数
feat (torch.Tensor, optional) – Not required. Feature map placeholder. Defaults to None.
logits (Tensor) – Raw language logitis. Shape \((N, T, C)\). Defaults to None.
data_samples (list[TextRecogDataSample], optional) – Not required. DataSample placeholder. Defaults to None.
- 返回
A dict with keys
feature
andlogits
.feature (Tensor): Shape \((N, T, E)\). Raw textual features for vision language aligner.
logits (Tensor): Shape \((N, T, C)\). The raw logits for characters after spell correction.
- 返回类型
Dict
- forward_train(feat=None, out_enc=None, data_samples=None)[源代码]¶
- 参数
feat (torch.Tensor, optional) – Not required. Feature map placeholder. Defaults to None.
out_enc (torch.Tensor) – Logits with shape \((N, T, C)\). Defaults to None.
data_samples (list[TextRecogDataSample], optional) – Not required. DataSample placeholder. Defaults to None.
- 返回
A dict with keys
feature
andlogits
.feature (Tensor): Shape \((N, T, E)\). Raw textual features for vision language aligner.
logits (Tensor): Shape \((N, T, C)\). The raw logits for characters after spell correction.
- 返回类型
Dict
- class mmocr.models.textrecog.decoders.ABIVisionDecoder(dictionary, in_channels=512, num_channels=64, attn_height=8, attn_width=32, attn_mode='nearest', module_loss=None, postprocessor=None, max_seq_len=40, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'}, **kwargs)[源代码]¶
Converts visual features into text characters.
- Implementation of VisionEncoder in
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.in_channels (int) – Number of channels \(E\) of input vector. Defaults to 512.
num_channels (int) – Number of channels of hidden vectors in mini U-Net. Defaults to 64.
attn_height (int) – Height \(H\) of input image features. Defaults to 8.
attn_width (int) – Width \(W\) of input image features. Defaults to 32.
attn_mode (str) – Upsampling mode for
torch.nn.Upsample
in mini U-Net. Defaults to ‘nearest’.module_loss (dict, optional) – Config to build loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to dict(type=’Xavier’, layer=’Conv2d’).
- 返回类型
- forward_test(feat=None, out_enc=None, data_samples=None)[源代码]¶
- 参数
feat (torch.Tensor, optional) – Image features of shape (N, E, H, W). Defaults to None.
out_enc (torch.Tensor) – Encoder output. Defaults to None.
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回
A dict with keys
feature
,logits
andattn_scores
.feature (Tensor): Shape (N, T, E). Raw visual features for language decoder.
logits (Tensor): Shape (N, T, C). The raw logits for characters.
attn_scores (Tensor): Shape (N, T, H, W). Intermediate result for vision-language aligner.
- 返回类型
- forward_train(feat=None, out_enc=None, data_samples=None)[源代码]¶
- 参数
feat (Tensor, optional) – Image features of shape (N, E, H, W). Defaults to None.
out_enc (torch.Tensor) – Encoder output. Defaults to None.
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回
A dict with keys
feature
,logits
andattn_scores
.feature (Tensor): Shape (N, T, E). Raw visual features for language decoder.
logits (Tensor): Shape (N, T, C). The raw logits for characters.
attn_scores (Tensor): Shape (N, T, H, W). Intermediate result for vision-language aligner.
- 返回类型
- class mmocr.models.textrecog.decoders.BaseDecoder(dictionary, module_loss=None, postprocessor=None, max_seq_len=40, init_cfg=None)[源代码]¶
Base decoder for text recognition, build the loss and postprocessor.
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.loss (dict, optional) – Config to build loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
module_loss (Optional[Dict]) –
- 返回类型
- forward(feat=None, out_enc=None, data_samples=None)[源代码]¶
Decoder forward.
- Args:
- feat (Tensor, optional): Features from the backbone. Defaults
to None.
- out_enc (Tensor, optional): Features from the encoder.
Defaults to None.
- data_samples (list[TextRecogDataSample]): A list of N datasamples,
containing meta information and gold annotations for each of the images. Defaults to None.
- 返回
Features from
decoder
forward.- 返回类型
Tensor
- 参数
feat (Optional[torch.Tensor]) –
out_enc (Optional[torch.Tensor]) –
data_samples (Optional[Sequence[mmocr.structures.textrecog_data_sample.TextRecogDataSample]]) –
- forward_test(feat=None, out_enc=None, data_samples=None)[源代码]¶
Forward for testing.
- 参数
feat (torch.Tensor, optional) – The feature map from backbone of shape \((N, E, H, W)\). Defaults to None.
out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.
data_samples (Sequence[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回类型
- forward_train(feat=None, out_enc=None, data_samples=None)[源代码]¶
Forward for training.
- 参数
feat (torch.Tensor, optional) – The feature map from backbone of shape \((N, E, H, W)\). Defaults to None.
out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.
data_samples (Sequence[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回类型
- loss(feat=None, out_enc=None, data_samples=None)[源代码]¶
Calculate losses from a batch of inputs and data samples.
- 参数
feat (Tensor, optional) – Features from the backbone. Defaults to None.
out_enc (Tensor, optional) – Features from the encoder. Defaults to None.
data_samples (list[TextRecogDataSample], optional) – A list of N datasamples, containing meta information and gold annotations for each of the images. Defaults to None.
- 返回
A dictionary of loss components.
- 返回类型
- predict(feat=None, out_enc=None, data_samples=None)[源代码]¶
Perform forward propagation of the decoder and postprocessor.
- 参数
feat (Tensor, optional) – Features from the backbone. Defaults to None.
out_enc (Tensor, optional) – Features from the encoder. Defaults to None.
data_samples (list[TextRecogDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images. Defaults to None.
- 返回
A list of N datasamples of prediction results. Results are stored in
pred_text
.- 返回类型
- class mmocr.models.textrecog.decoders.CRNNDecoder(in_channels, dictionary, rnn_flag=False, module_loss=None, postprocessor=None, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'}, **kwargs)[源代码]¶
Decoder for CRNN.
- 参数
in_channels (int) – Number of input channels.
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.rnn_flag (bool) – Use RNN or CNN as the decoder. Defaults to False.
module_loss (dict, optional) – Config to build module_loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- forward_test(feat=None, out_enc=None, data_samples=None)[源代码]¶
- 参数
feat (Tensor) – A Tensor of shape \((N, C, 1, W)\).
out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.
data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing
gt_text
information. Defaults to None.
- 返回
Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- forward_train(feat, out_enc=None, data_samples=None)[源代码]¶
- 参数
feat (Tensor) – A Tensor of shape \((N, C, 1, W)\).
out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回
The raw logit tensor. Shape \((N, W, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- class mmocr.models.textrecog.decoders.MasterDecoder(n_layers=3, n_head=8, d_model=512, feat_size=240, d_inner=2048, attn_drop=0.0, ffn_drop=0.0, feat_pe_drop=0.2, module_loss=None, postprocessor=None, dictionary=None, max_seq_len=30, init_cfg=None)[源代码]¶
Decoder module in MASTER.
Code is partially modified from https://github.com/wenwenyu/MASTER-pytorch.
- 参数
n_layers (int) – Number of attention layers. Defaults to 3.
n_head (int) – Number of parallel attention heads. Defaults to 8.
d_model (int) – Dimension \(E\) of the input from previous model. Defaults to 512.
feat_size (int) – The size of the input feature from previous model, usually \(H * W\). Defaults to 6 * 40.
d_inner (int) – Hidden dimension of feedforward layers. Defaults to 2048.
attn_drop (float) – Dropout rate of the attention layer. Defaults to 0.
ffn_drop (float) – Dropout rate of the feedforward layer. Defaults to 0.
feat_pe_drop (float) – Dropout rate of the feature positional encoding layer. Defaults to 0.2.
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary. Defaults to None.module_loss (dict, optional) – Config to build module_loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
max_seq_len (int) – Maximum output sequence length \(T\). Defaults to 30.
init_cfg (dict or list[dict], optional) – Initialization configs.
- decode(tgt_seq, feature, src_mask, tgt_mask)[源代码]¶
Decode the input sequence.
- 参数
tgt_seq (Tensor) – Target sequence of shape: math: (N, T, C).
feature (Tensor) – Input feature map from encoder of shape: math: (N, C, H, W)
src_mask (BoolTensor) – The source mask of shape: math: (N, H*W).
tgt_mask (BoolTensor) – The target mask of shape: math: (N, T, T).
- 返回
The decoded sequence.
- 返回类型
Tensor
- forward_test(feat=None, out_enc=None, data_samples=None)[源代码]¶
Forward for testing.
- 参数
feat (Tensor, optional) – Input feature map from backbone.
out_enc (Tensor) – Unused.
data_samples (list[TextRecogDataSample]) – Unused.
- 返回
Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- forward_train(feat=None, out_enc=None, data_samples=None)[源代码]¶
Forward for training. Source mask will not be used here.
- 参数
feat (Tensor, optional) – Input feature map from backbone.
out_enc (Tensor) – Unused.
data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information.
- 返回
The raw logit tensor. Shape \((N, T, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- make_target_mask(tgt, device)[源代码]¶
Make target mask for self attention.
- 参数
tgt (Tensor) – Shape [N, l_tgt]
device (torch.device) – Mask device.
- 返回
Mask of shape [N * self.n_head, l_tgt, l_tgt]
- 返回类型
Tensor
- class mmocr.models.textrecog.decoders.NRTRDecoder(n_layers=6, d_embedding=512, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, n_position=200, dropout=0.1, module_loss=None, postprocessor=None, dictionary=None, max_seq_len=30, init_cfg=None)[源代码]¶
Transformer Decoder block with self attention mechanism.
- 参数
n_layers (int) – Number of attention layers. Defaults to 6.
d_embedding (int) – Language embedding dimension. Defaults to 512.
n_head (int) – Number of parallel attention heads. Defaults to 8.
d_k (int) – Dimension of the key vector. Defaults to 64.
d_v (int) – Dimension of the value vector. Defaults to 64
d_model (int) – Dimension \(D_m\) of the input from previous model. Defaults to 512.
d_inner (int) – Hidden dimension of feedforward layers. Defaults to 256.
n_position (int) – Length of the positional encoding vector. Must be greater than
max_seq_len
. Defaults to 200.dropout (float) – Dropout rate for text embedding, MHSA, FFN. Defaults to 0.1.
module_loss (dict, optional) – Config to build module_loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.max_seq_len (int) – Maximum output sequence length \(T\). Defaults to 30.
init_cfg (dict or list[dict], optional) – Initialization configs.
- 返回类型
- forward_test(feat=None, out_enc=None, data_samples=None)[源代码]¶
Forward for testing.
- 参数
feat (Tensor, optional) – Unused.
out_enc (Tensor) – Encoder output of shape: math:(N, T, D_m) where \(D_m\) is
d_model
. Defaults to None.data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information. Defaults to None.
- 返回
Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- forward_train(feat=None, out_enc=None, data_samples=None)[源代码]¶
Forward for training. Source mask will be used here.
- 参数
feat (Tensor, optional) – Unused.
out_enc (Tensor) – Encoder output of shape : math:(N, T, D_m) where \(D_m\) is
d_model
. Defaults to None.data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information. Defaults to None.
- 返回
The raw logit tensor. Shape \((N, T, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- class mmocr.models.textrecog.decoders.ParallelSARDecoder(dictionary, module_loss=None, postprocessor=None, enc_bi_rnn=False, dec_bi_rnn=False, dec_rnn_dropout=0.0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=30, mask=True, pred_concat=False, init_cfg=None, **kwargs)[源代码]¶
Implementation Parallel Decoder module in `SAR.
<https://arxiv.org/abs/1811.00751>`_.
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.module_loss (dict, optional) – Config to build module_loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder. Defaults to False.
dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder. Defaults to False.
dec_rnn_dropout (float) – Dropout of RNN layer in decoder. Defaults to 0.0.
dec_gru (bool) – If True, use GRU, else LSTM in decoder. Defaults to False.
d_model (int) – Dim of channels from backbone \(D_i\). Defaults to 512.
d_enc (int) – Dim of encoder RNN layer \(D_m\). Defaults to 512.
d_k (int) – Dim of channels of attention module. Defaults to 64.
pred_dropout (float) – Dropout probability of prediction layer. Defaults to 0.0.
max_seq_len (int) – Maximum sequence length for decoding. Defaults to 30.
mask (bool) – If True, mask padding in feature map. Defaults to True.
pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state. Defaults to False.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- forward_test(feat, out_enc, data_samples=None)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.
- 返回
Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- forward_train(feat, out_enc, data_samples)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information.
- 返回
A raw logit tensor of shape \((N, T, C)\).
- 返回类型
Tensor
- class mmocr.models.textrecog.decoders.ParallelSARDecoderWithBS(beam_width=5, num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=0, pred_concat=False, init_cfg=None, **kwargs)[源代码]¶
Parallel Decoder module with beam-search in SAR.
- 参数
beam_width (int) – Width for beam search.
- forward_test(feat, out_enc, img_metas)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.
- 返回
Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- class mmocr.models.textrecog.decoders.PositionAttentionDecoder(dictionary, module_loss=None, postprocessor=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, mask=True, return_feature=True, encode_value=False, init_cfg=None)[源代码]¶
Position attention decoder for RobustScanner.
RobustScanner: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.module_loss (dict, optional) – Config to build module_loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
rnn_layers (int) – Number of RNN layers. Defaults to 2.
dim_input (int) – Dimension \(D_i\) of input vector
feat
. Defaults to 512.dim_model (int) – Dimension \(D_m\) of the model. Should also be the same as encoder output vector
out_enc
. Defaults to 128.max_seq_len (int) – Maximum output sequence length \(T\). Defaults to 40.
mask (bool) – Whether to mask input features according to
img_meta['valid_ratio']
. Defaults to True.return_feature (bool) – Return feature or logits as the result. Defaults to True.
encode_value (bool) – Whether to use the output of encoder
out_enc
as value of attention layer. If False, the original featurefeat
will be used. Defaults to False.init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- forward_test(feat, out_enc, img_metas)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
img_metas (Sequence[mmocr.structures.textrecog_data_sample.TextRecogDataSample]) –
- 返回
Character probabilities of shape \((N, T, C)\) if
return_feature=False
. Otherwise it would be the hidden feature before the prediction projection layer, whose shape is \((N, T, D_m)\).- 返回类型
Tensor
- forward_train(feat, out_enc, data_samples)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回
A raw logit tensor of shape \((N, T, C)\) if
return_feature=False
. Otherwise it will be the hidden feature before the prediction projection layer, whose shape is \((N, T, D_m)\).- 返回类型
Tensor
- class mmocr.models.textrecog.decoders.RobustScannerFuser(dictionary, module_loss=None, postprocessor=None, hybrid_decoder={'type': 'SequenceAttentionDecoder'}, position_decoder={'type': 'PositionAttentionDecoder'}, max_seq_len=30, in_channels=[512, 512], dim=- 1, init_cfg=None)[源代码]¶
Decoder for RobustScanner.
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.module_loss (dict, optional) – Config to build module_loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
hybrid_decoder (dict) – Config to build hybrid_decoder. Defaults to dict(type=’SequenceAttentionDecoder’).
position_decoder (dict) – Config to build position_decoder. Defaults to dict(type=’PositionAttentionDecoder’).
fuser (dict) – Config to build fuser. Defaults to dict(type=’RobustScannerFuser’).
max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 30.
in_channels (list[int]) – List of input channels. Defaults to [512, 512].
dim (int) – The dimension on which to split the input. Defaults to -1.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- forward_test(feat=None, out_enc=None, data_samples=None)[源代码]¶
Forward for testing.
- 参数
feat (torch.Tensor, optional) – The feature map from backbone of shape \((N, E, H, W)\). Defaults to None.
out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.
data_samples (Sequence[TextRecogDataSample]) – Batch of TextRecogDataSample, containing vaild_ratio information. Defaults to None.
- 返回
Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- forward_train(feat=None, out_enc=None, data_samples=None)[源代码]¶
Forward for training.
- 参数
feat (torch.Tensor, optional) – The feature map from backbone of shape \((N, E, H, W)\). Defaults to None.
out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.
data_samples (Sequence[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回类型
- class mmocr.models.textrecog.decoders.SequenceAttentionDecoder(dictionary, module_loss=None, postprocessor=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, mask=True, dropout=0, return_feature=True, encode_value=False, init_cfg=None)[源代码]¶
Sequence attention decoder for RobustScanner.
RobustScanner: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.module_loss (dict, optional) – Config to build module_loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
rnn_layers (int) – Number of RNN layers. Defaults to 2.
dim_input (int) – Dimension \(D_i\) of input vector
feat
. Defaults to 512.dim_model (int) – Dimension \(D_m\) of the model. Should also be the same as encoder output vector
out_enc
. Defaults to 128.max_seq_len (int) – Maximum output sequence length \(T\). Defaults to 40.
mask (bool) – Whether to mask input features according to
data_sample.valid_ratio
. Defaults to True.dropout (float) – Dropout rate for LSTM layer. Defaults to 0.
return_feature (bool) – Return feature or logic as the result. Defaults to True.
encode_value (bool) – Whether to use the output of encoder
out_enc
as value of attention layer. If False, the original featurefeat
will be used. Defaults to False.init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- forward_test(feat, out_enc, data_samples)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回
Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- forward_test_step(feat, out_enc, decode_sequence, current_step, data_samples)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
decode_sequence (Tensor) – Shape \((N, T)\). The tensor that stores history decoding result.
current_step (int) – Current decoding step.
data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回
Shape \((N, C)\). The logit tensor of predicted tokens at current time step.
- 返回类型
Tensor
- forward_train(feat, out_enc, data_samples=None)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
targets_dict (dict) – A dict with the key
padded_targets
, a tensor of shape \((N, T)\). Each element is the index of a character.data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.
- 返回
A raw logit tensor of shape \((N, T, C)\) if
return_feature=False
. Otherwise it would be the hidden feature before the prediction projection layer, whose shape is \((N, T, D_m)\).- 返回类型
Tensor
- class mmocr.models.textrecog.decoders.SequentialSARDecoder(dictionary=None, module_loss=None, postprocessor=None, enc_bi_rnn=False, dec_bi_rnn=False, dec_gru=False, d_k=64, d_model=512, d_enc=512, pred_dropout=0.0, mask=True, max_seq_len=40, pred_concat=False, init_cfg=None, **kwargs)[源代码]¶
Implementation Sequential Decoder module in `SAR.
<https://arxiv.org/abs/1811.00751>`_.
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.module_loss (dict, optional) – Config to build module_loss. Defaults to None.
postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.
enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder. Defaults to False.
dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder. Defaults to False.
dec_do_rnn (float) – Dropout of RNN layer in decoder. Defaults to 0.
dec_gru (bool) – If True, use GRU, else LSTM in decoder. Defaults to False.
d_k (int) – Dim of conv layers in attention module. Defaults to 64.
d_model (int) – Dim of channels from backbone \(D_i\). Defaults to 512.
d_enc (int) – Dim of encoder RNN layer \(D_m\). Defaults to 512.
pred_dropout (float) – Dropout probability of prediction layer. Defaults to 0.
max_seq_len (int) – Maximum sequence length during decoding. Defaults to 40.
mask (bool) – If True, mask padding in feature map. Defaults to False.
pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state. Defaults to False.
init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- forward_test(feat, out_enc, data_samples=None)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing valid_ratio information.
- 返回
Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is
num_classes
.- 返回类型
Tensor
- forward_train(feat, out_enc, data_samples=None)[源代码]¶
- 参数
feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).
out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).
data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information.
- 返回
A raw logit tensor of shape \((N, T, C)\).
- 返回类型
Tensor
Text Recognition Module Losses¶
- class mmocr.models.textrecog.module_losses.ABIModuleLoss(dictionary, max_seq_len=40, letter_case='unchanged', weight_vis=1.0, weight_lang=1.0, weight_fusion=1.0, **kwargs)[源代码]¶
Implementation of ABINet multiloss that allows mixing different types of losses with weights.
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.
letter_case (str) – There are three options to alter the letter cases of gt texts: - unchanged: Do not change gt texts. - upper: Convert gt texts into uppercase characters. - lower: Convert gt texts into lowercase characters. Usually, it only works for English characters. Defaults to ‘unchanged’.
weight_vis (float or int) – The weight of vision decoder loss. Defaults to 1.0.
weight_dec (float or int) – The weight of language decoder loss. Defaults to 1.0.
weight_fusion (float or int) – The weight of fuser (aligner) loss. Defaults to 1.0.
- 返回类型
- forward(outputs, data_samples)[源代码]¶
- 参数
outputs (dict) – The output dictionary with at least one of
out_vis
,out_langs
andout_fusers
specified.data_samples (list[TextRecogDataSample]) – List of
TextRecogDataSample
which are processed byget_target
.
- 返回
A loss dictionary with
loss_visual
,loss_lang
andloss_fusion
. Each should either be the loss tensor or None if the output of its corresponding module is not given.- 返回类型
- class mmocr.models.textrecog.module_losses.BaseTextRecogModuleLoss(dictionary, max_seq_len=40, letter_case='unchanged', pad_with='auto', **kwargs)[源代码]¶
Base recognition loss.
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.
letter_case (str) – There are three options to alter the letter cases of gt texts: - unchanged: Do not change gt texts. - upper: Convert gt texts into uppercase characters. - lower: Convert gt texts into lowercase characters. Usually, it only works for English characters. Defaults to ‘unchanged’.
pad_with (str) –
The padding strategy for
gt_text.padded_indexes
. Defaults to ‘auto’. Options are: - ‘auto’: Use dictionary.padding_idx to pad gt texts, ordictionary.end_idx if dictionary.padding_idx is None.
’padding’: Always use dictionary.padding_idx to pad gt texts.
’end’: Always use dictionary.end_idx to pad gt texts.
’none’: Do not pad gt texts.
- 返回类型
- get_targets(data_samples)[源代码]¶
Target generator.
- 参数
data_samples (list[TextRecogDataSample]) – It usually includes
gt_text
information.- 返回
Updated data_samples. Two keys will be added to data_sample:
indexes (torch.LongTensor): Character indexes representing gt texts. All special tokens are excluded, except for UKN.
padded_indexes (torch.LongTensor): Character indexes representing gt texts with BOS and EOS if applicable, following several padding indexes until the length reaches
max_seq_len
. In particular, ifpad_with='none'
, no padding will be applied.
- 返回类型
- class mmocr.models.textrecog.module_losses.CEModuleLoss(dictionary, max_seq_len=40, letter_case='unchanged', pad_with='auto', ignore_char='padding', flatten=False, reduction='none', ignore_first_char=False)[源代码]¶
Implementation of loss module for encoder-decoder based text recognition method with CrossEntropy loss.
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.
letter_case (str) – There are three options to alter the letter cases of gt texts: - unchanged: Do not change gt texts. - upper: Convert gt texts into uppercase characters. - lower: Convert gt texts into lowercase characters. Usually, it only works for English characters. Defaults to ‘unchanged’.
pad_with (str) –
The padding strategy for
gt_text.padded_indexes
. Defaults to ‘auto’. Options are: - ‘auto’: Use dictionary.padding_idx to pad gt texts, ordictionary.end_idx if dictionary.padding_idx is None.
’padding’: Always use dictionary.padding_idx to pad gt texts.
’end’: Always use dictionary.end_idx to pad gt texts.
’none’: Do not pad gt texts.
ignore_char (int or str) – Specifies a target value that is ignored and does not contribute to the input gradient. ignore_char can be int or str. If int, it is the index of the ignored char. If str, it is the character to ignore. Apart from single characters, each item can be one of the following reversed keywords: ‘padding’, ‘start’, ‘end’, and ‘unknown’, which refer to their corresponding special tokens in the dictionary. It will not ignore any special tokens when ignore_char == -1 or ‘none’. Defaults to ‘padding’.
flatten (bool) – Whether to flatten the output and target before computing CE loss. Defaults to False.
reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’). Defaults to ‘none’.
ignore_first_char (bool) – Whether to ignore the first token in target ( usually the start token). If
True
, the last token of the output sequence will also be removed to be aligned with the target length. Defaults toFalse
.flatten – Whether to flatten the vectors for loss computation. Defaults to False.
- forward(outputs, data_samples)[源代码]¶
- 参数
outputs (Tensor) – A raw logit tensor of shape \((N, T, C)\).
data_samples (list[TextRecogDataSample]) – List of
TextRecogDataSample
which are processed byget_target
.
- 返回
A loss dict with the key
loss_ce
.- 返回类型
- class mmocr.models.textrecog.module_losses.CTCModuleLoss(dictionary, letter_case='unchanged', flatten=True, reduction='mean', zero_infinity=False, **kwargs)[源代码]¶
Implementation of loss module for CTC-loss based text recognition.
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.letter_case (str) – There are three options to alter the letter cases of gt texts: - unchanged: Do not change gt texts. - upper: Convert gt texts into uppercase characters. - lower: Convert gt texts into lowercase characters. Usually, it only works for English characters. Defaults to ‘unchanged’.
flatten (bool) – If True, use flattened targets, else padded targets.
reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).
zero_infinity (bool) – Whether to zero infinite losses and the associated gradients. Default: False. Infinite losses mainly occur when the inputs are too short to be aligned to the targets.
- 返回类型
- forward(outputs, data_samples)[源代码]¶
- 参数
outputs (Tensor) – A raw logit tensor of shape \((N, T, C)\).
data_samples (list[TextRecogDataSample]) – List of
TextRecogDataSample
which are processed byget_target
.
- 返回
The loss dict with key
loss_ctc
.- 返回类型
- get_targets(data_samples)[源代码]¶
Target generator.
- 参数
data_samples (list[TextRecogDataSample]) – It usually includes
gt_text
information.- 返回
updated data_samples. It will add two key in data_sample:
indexes (torch.LongTensor): The index corresponding to the item.
- 返回类型
KIE Extractors¶
- class mmocr.models.kie.extractors.SDMGR(backbone=None, roi_extractor=None, neck=None, kie_head=None, dictionary=None, data_preprocessor=None, init_cfg=None)[源代码]¶
The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. https://arxiv.org/abs/2103.14470.
- 参数
backbone (dict, optional) – Config of backbone. If None, None will be passed to kie_head during training and testing. Defaults to None.
roi_extractor (dict, optional) – Config of roi extractor. Only applicable when backbone is not None. Defaults to None.
neck (dict, optional) – Config of neck. Defaults to None.
kie_head (dict) – Config of KIE head. Defaults to None.
dictionary (dict, optional) – Config of dictionary. Defaults to None.
data_preprocessor (dict or ConfigDict, optional) – The pre-process config of
BaseDataPreprocessor
. it usually includes,pad_size_divisor
,pad_value
,mean
andstd
. It has to be None when working in non-visual mode. Defaults to None.init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.
- 返回类型
- extract_feat(img, gt_bboxes)[源代码]¶
Extract features from images if self.backbone is not None. It returns None otherwise.
- 参数
img (torch.Tensor) – The input image with shape (N, C, H, W).
gt_bboxes (list[torch.Tensor)) – A list of ground truth bounding boxes, each of shape \((N_i, 4)\).
- 返回
The extracted features with shape (N, E).
- 返回类型
- forward(inputs, data_samples=None, mode='tensor', **kwargs)[源代码]¶
The unified entry for a forward process in both training and test.
The method should accept three modes: “tensor”, “predict” and “loss”:
“tensor”: Forward the whole network and return tensor or tuple of
tensor without any post-processing, same as a common nn.Module. - “predict”: Forward and return the predictions, which are fully processed to a list of
DetDataSample
. - “loss”: Forward and return a dict of losses according to the given inputs and data samples.Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- 参数
inputs (torch.Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (list[
DetDataSample
], optional) – The annotation data of every samples. Defaults to None.mode (str) – Return what kind of value. Defaults to ‘tensor’.
- 返回
The return type depends on
mode
.If
mode="tensor"
, return a tensor or a tuple of tensor.If
mode="predict"
, return a list ofDetDataSample
.If
mode="loss"
, return a dict of tensor.
- 返回类型
- loss(inputs, data_samples, **kwargs)[源代码]¶
Calculate losses from a batch of inputs and data samples.
- 参数
inputs (torch.Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
data_samples (list[KIEDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.
- 返回
A dictionary of loss components.
- 返回类型
- predict(inputs, data_samples, **kwargs)[源代码]¶
Predict results from a batch of inputs and data samples with post- processing. :param inputs: Input images of shape (N, C, H, W).
Typically these should be mean centered and std scaled.
- 参数
data_samples (list[KIEDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.
inputs (torch.Tensor) –
- 返回
A list of datasamples of prediction results. Results are stored in
pred_instances.labels
andpred_instances.edge_labels
.- 返回类型
List[KIEDataSample]
KIE Heads¶
- class mmocr.models.kie.heads.SDMGRHead(dictionary, num_classes=26, visual_dim=64, fusion_dim=1024, node_input=32, node_embed=256, edge_input=5, edge_embed=256, num_gnn=2, bidirectional=False, relation_norm=10.0, module_loss={'type': 'SDMGRModuleLoss'}, postprocessor={'type': 'SDMGRPostProcessor'}, init_cfg={'mean': 0, 'override': {'name': 'edge_embed'}, 'std': 0.01, 'type': 'Normal'})[源代码]¶
SDMGR Head.
- 参数
dictionary (dict or
Dictionary
) – The config for Dictionary or the instance of Dictionary.num_classes (int) – Number of class labels. Defaults to 26.
visual_dim (int) – Dimension of visual features \(E\). Defaults to 64.
fusion_dim (int) – Dimension of fusion layer. Defaults to 1024.
node_input (int) – Dimension of raw node embedding. Defaults to 32.
node_embed (int) – Dimension of node embedding. Defaults to 256.
edge_input (int) – Dimension of raw edge embedding. Defaults to 5.
edge_embed (int) – Dimension of edge embedding. Defaults to 256.
num_gnn (int) – Number of GNN layers. Defaults to 2.
bidirectional (bool) – Whether to use bidirectional RNN to embed nodes. Defaults to False.
relation_norm (float) – Norm to map value from one range to another.= Defaults to 10.
module_loss (dict) – Module Loss config. Defaults to
dict(type='SDMGRModuleLoss')
.postprocessor (dict) – Postprocessor config. Defaults to
dict(type='SDMGRPostProcessor')
.init_cfg (dict or list[dict], optional) – Initialization configs.
- 返回类型
- compute_relations(data_samples)[源代码]¶
Compute the relations between every two boxes for each datasample, then return the concatenated relations.
- 参数
data_samples (List[mmocr.structures.kie_data_sample.KIEDataSample]) –
- 返回类型
- convert_texts(data_samples)[源代码]¶
Extract texts in datasamples and pack them into a batch.
- 参数
data_samples (List[KIEDataSample]) – List of data samples.
- 返回
node_nums (List[int]): A list of node numbers for each sample.
char_nums (List[Tensor]): A list of character numbers for each sample.
nodes (Tensor): A tensor of shape \((N, C)\) where \(C\) is the maximum number of characters in a sample.
- 返回类型
- forward(inputs, data_samples)[源代码]¶
- 参数
inputs (torch.Tensor) – Shape \((N, E)\).
data_samples (List[KIEDataSample]) – List of data samples.
- 返回
node_cls (Tensor): Raw logits scores for nodes. Shape \((N, C_{l})\) where \(C_{l}\) is number of classes.
edge_cls (Tensor): Raw logits scores for edges. Shape \((N * N, 2)\).
- 返回类型
tuple(Tensor, Tensor)
- loss(inputs, data_samples)[源代码]¶
Calculate losses from a batch of inputs and data samples. :param inputs: Shape \((N, E)\). :type inputs: torch.Tensor :param data_samples: List of data samples. :type data_samples: List[KIEDataSample]
- 返回
A dictionary of loss components.
- 返回类型
- 参数
inputs (torch.Tensor) –
data_samples (List[mmocr.structures.kie_data_sample.KIEDataSample]) –
- predict(inputs, data_samples)[源代码]¶
Predict results from a batch of inputs and data samples with post- processing.
- 参数
inputs (torch.Tensor) – Shape \((N, E)\).
data_samples (List[KIEDataSample]) – List of data samples.
- 返回
A list of datasamples of prediction results. Results are stored in
pred_instances.labels
,pred_instances.scores
,pred_instances.edge_labels
andpred_instances.edge_scores
.labels (Tensor): An integer tensor of shape (N, ) indicating bbox labels for each image.
scores (Tensor): A float tensor of shape (N, ), indicating the confidence scores for node label predictions.
edge_labels (Tensor): An integer tensor of shape (N, N) indicating the connection between nodes. Options are 0, 1.
edge_scores (Tensor): A float tensor of shape (N, ), indicating the confidence scores for edge predictions.
- 返回类型
List[KIEDataSample]
KIE Module Losses¶
- class mmocr.models.kie.module_losses.SDMGRModuleLoss(weight_node=1.0, weight_edge=1.0, ignore_idx=- 100)[源代码]¶
The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.
- 参数
- 返回类型
- forward(preds, data_samples)[源代码]¶
Forward function.
- 参数
preds (tuple(Tensor, Tensor)) –
data_samples (list[KIEDataSample]) – A list of datasamples containing
gt_instances.labels
andgt_instances.edge_labels
.
- 返回
Loss dict, containing
loss_node
,loss_edge
,acc_node
andacc_edge
.- 返回类型
mmocr.structures¶
Text Detection Data Sample¶
- class mmocr.structures.textdet_data_sample.TextDetDataSample(*, metainfo=None, **kwargs)[源代码]¶
A data structure interface of MMOCR. They are used as interfaces between different components.
The attributes in
TextDetDataSample
are divided into two parts:实际案例
>>> import torch >>> import numpy as np >>> from mmengine.structures import InstanceData >>> from mmocr.data import TextDetDataSample >>> # gt_instances >>> data_sample = TextDetDataSample() >>> img_meta = dict(img_shape=(800, 1196, 3), ... pad_shape=(800, 1216, 3)) >>> gt_instances = InstanceData(metainfo=img_meta) >>> gt_instances.bboxes = torch.rand((5, 4)) >>> gt_instances.labels = torch.rand((5,)) >>> data_sample.gt_instances = gt_instances >>> assert 'img_shape' in data_sample.gt_instances.metainfo_keys() >>> len(data_sample.gt_instances) 5 >>> print(data_sample)
- <TextDetDataSample(
META INFORMATION DATA FIELDS gt_instances: <InstanceData(
META INFORMATION pad_shape: (800, 1216, 3) img_shape: (800, 1196, 3) DATA FIELDS labels: tensor([0.8533, 0.1550, 0.5433, 0.7294, 0.5098]) bboxes: tensor([[9.7725e-01, 5.8417e-01, 1.7269e-01, 6.5694e-01],
[1.7894e-01, 5.1780e-01, 7.0590e-01, 4.8589e-01], [7.0392e-01, 6.6770e-01, 1.7520e-01, 1.4267e-01], [2.2411e-01, 5.1962e-01, 9.6953e-01, 6.6994e-01], [4.1338e-01, 2.1165e-01, 2.7239e-04, 6.8477e-01]])
) at 0x7f21fb1b9190>
- ) at 0x7f21fb1b9880>
>>> # pred_instances >>> pred_instances = InstanceData(metainfo=img_meta) >>> pred_instances.bboxes = torch.rand((5, 4)) >>> pred_instances.scores = torch.rand((5,)) >>> data_sample = TextDetDataSample(pred_instances=pred_instances) >>> assert 'pred_instances' in data_sample >>> data_sample = TextDetDataSample() >>> gt_instances_data = dict( ... bboxes=torch.rand(2, 4), ... labels=torch.rand(2), ... masks=np.random.rand(2, 2, 2)) >>> gt_instances = InstanceData(**gt_instances_data) >>> data_sample.gt_instances = gt_instances >>> assert 'gt_instances' in data_sample >>> assert 'masks' in data_sample.gt_instances
- property gt_instances: mmengine.structures.instance_data.InstanceData¶
groundtruth instances.
- Type
InstanceData
- property pred_instances: mmengine.structures.instance_data.InstanceData¶
prediction instances.
- Type
InstanceData
Text Recognition Data Sample¶
- class mmocr.structures.textrecog_data_sample.TextRecogDataSample(*, metainfo=None, **kwargs)[源代码]¶
A data structure interface of MMOCR for text recognition. They are used as interfaces between different components.
The attributes in
TextRecogDataSample
are divided into two parts:实际案例
>>> import torch >>> import numpy as np >>> from mmengine.structures import LabelData >>> from mmocr.data import TextRecogDataSample >>> # gt_text >>> data_sample = TextRecogDataSample() >>> img_meta = dict(img_shape=(800, 1196, 3), ... pad_shape=(800, 1216, 3)) >>> gt_text = LabelData(metainfo=img_meta) >>> gt_text.item = 'mmocr' >>> data_sample.gt_text = gt_text >>> assert 'img_shape' in data_sample.gt_text.metainfo_keys() >>> print(data_sample)
- <TextRecogDataSample(
META INFORMATION DATA FIELDS gt_text: <LabelData(
META INFORMATION pad_shape: (800, 1216, 3) img_shape: (800, 1196, 3) DATA FIELDS item: ‘mmocr’
) at 0x7f21fb1b9190>
- ) at 0x7f21fb1b9880>
>>> # pred_text >>> pred_text = LabelData(metainfo=img_meta) >>> pred_text.item = 'mmocr' >>> data_sample = TextRecogDataSample(pred_text=pred_text) >>> assert 'pred_text' in data_sample >>> data_sample = TextRecogDataSample() >>> gt_text_data = dict(item='mmocr') >>> gt_text = LabelData(**gt_text_data) >>> data_sample.gt_text = gt_text >>> assert 'gt_text' in data_sample >>> assert 'item' in data_sample.gt_text
- property gt_text: mmengine.structures.label_data.LabelData¶
ground truth text.
- Type
LabelData
- property pred_text: mmengine.structures.label_data.LabelData¶
prediction text.
- Type
LabelData
KIE Data Sample¶
- class mmocr.structures.kie_data_sample.KIEDataSample(*, metainfo=None, **kwargs)[源代码]¶
A data structure interface of MMOCR. They are used as interfaces between different components.
The attributes in
KIEDataSample
are divided into two parts:实际案例
>>> import torch >>> import numpy as np >>> from mmengine.structures import InstanceData >>> from mmocr.data import KIEDataSample >>> # gt_instances >>> data_sample = KIEDataSample() >>> img_meta = dict(img_shape=(800, 1196, 3), ... pad_shape=(800, 1216, 3)) >>> gt_instances = InstanceData(metainfo=img_meta) >>> gt_instances.bboxes = torch.rand((5, 4)) >>> gt_instances.labels = torch.rand((5,)) >>> data_sample.gt_instances = gt_instances >>> assert 'img_shape' in data_sample.gt_instances.metainfo_keys() >>> len(data_sample.gt_instances) 5 >>> print(data_sample)
- <KIEDataSample(
META INFORMATION DATA FIELDS gt_instances: <InstanceData(
META INFORMATION pad_shape: (800, 1216, 3) img_shape: (800, 1196, 3) DATA FIELDS labels: tensor([0.8533, 0.1550, 0.5433, 0.7294, 0.5098]) bboxes: tensor([[9.7725e-01, 5.8417e-01, 1.7269e-01, 6.5694e-01],
[1.7894e-01, 5.1780e-01, 7.0590e-01, 4.8589e-01], [7.0392e-01, 6.6770e-01, 1.7520e-01, 1.4267e-01], [2.2411e-01, 5.1962e-01, 9.6953e-01, 6.6994e-01], [4.1338e-01, 2.1165e-01, 2.7239e-04, 6.8477e-01]])
) at 0x7f21fb1b9190>
- ) at 0x7f21fb1b9880>
>>> # pred_instances >>> pred_instances = InstanceData(metainfo=img_meta) >>> pred_instances.bboxes = torch.rand((5, 4)) >>> pred_instances.scores = torch.rand((5,)) >>> data_sample = KIEDataSample(pred_instances=pred_instances) >>> assert 'pred_instances' in data_sample >>> data_sample = KIEDataSample() >>> gt_instances_data = dict( ... bboxes=torch.rand(2, 4), ... labels=torch.rand(2)) >>> gt_instances = InstanceData(**gt_instances_data) >>> data_sample.gt_instances = gt_instances >>> assert 'gt_instances' in data_sample
- property gt_instances: mmengine.structures.instance_data.InstanceData¶
groundtruth instances.
- Type
InstanceData
- property pred_instances: mmengine.structures.instance_data.InstanceData¶
prediction instances.
- Type
InstanceData
mmocr.visualization¶
Text Detection Visualizer¶
- class mmocr.visualization.textdet_visualizer.TextDetLocalVisualizer(name='visualizer', image=None, with_poly=True, with_bbox=False, vis_backends=None, save_dir=None, gt_color='g', pred_color='r', line_width=2, alpha=0.8)[源代码]¶
The MMOCR Text Detection Local Visualizer.
- 参数
name (str) – Name of the instance. Defaults to ‘visualizer’.
image (np.ndarray, optional) – The origin image to draw. The format should be RGB. Defaults to None.
with_poly (bool) – Whether to draw polygons. Defaults to True.
with_bbox (bool) – Whether to draw bboxes. Defaults to False.
vis_backends (list, optional) – Visual backend config list. Defaults to None.
save_dir (str, optional) – Save file dir for all storage backends. If it is None, the backend storage will not save any data.
gt_color (Union[str, tuple, list[str], list[tuple]]) – The colors of GT polygons and bboxes.
colors
can have the same length with lines or just single value. Ifcolors
is single value, all the lines will have the same colors. Refer to matplotlib.colors for full list of formats that are accepted. Defaults to ‘g’.pred_color (Union[str, tuple, list[str], list[tuple]]) – The colors of pred polygons and bboxes.
colors
can have the same length with lines or just single value. Ifcolors
is single value, all the lines will have the same colors. Refer to matplotlib.colors for full list of formats that are accepted. Defaults to ‘r’.line_width (int, float) – The linewidth of lines. Defaults to 2.
alpha (float) – The transparency of bboxes or polygons. Defaults to 0.8.
- 返回类型
- add_datasample(name, image, data_sample=None, draw_gt=True, draw_pred=True, show=False, wait_time=0, out_file=None, pred_score_thr=0.3, step=0)[源代码]¶
Draw datasample and save to all backends.
If GT and prediction are plotted at the same time, they are
displayed in a stitched image where the left image is the ground truth and the right image is the prediction. - If
show
is True, all storage backends are ignored, and the images will be displayed in a local window. - Ifout_file
is specified, the drawn image will be saved toout_file
. This is usually used when the display is not available.- 参数
name (str) – The image identifier.
image (np.ndarray) – The image to draw.
data_sample (
TextDetDataSample
, optional) –- TextDetDataSample which contains gt and prediction. Defaults
to None.
draw_gt (bool) – Whether to draw GT TextDetDataSample. Defaults to True.
draw_pred (bool) – Whether to draw Predicted TextDetDataSample. Defaults to True.
show (bool) – Whether to display the drawn image. Default to False.
wait_time (float) – The interval of show (s). Defaults to 0.
out_file (str) – Path to output file. Defaults to None.
pred_score_thr (float) – The threshold to visualize the bboxes and masks. Defaults to 0.3.
step (int) – Global step value to record. Defaults to 0.
- 返回类型
Text Recognition Visualizer¶
- class mmocr.visualization.textrecog_visualizer.TextRecogLocalVisualizer(name='visualizer', image=None, vis_backends=None, save_dir=None, gt_color='g', pred_color='r')[源代码]¶
MMOCR Text Detection Local Visualizer.
- 参数
name (str) – Name of the instance. Defaults to ‘visualizer’.
image (np.ndarray, optional) – The origin image to draw. The format should be RGB. Defaults to None.
vis_backends (list, optional) – Visual backend config list. Defaults to None.
save_dir (str, optional) – Save file dir for all storage backends. If it is None, the backend storage will not save any data.
gt_color (str or tuple[int, int, int]) – Colors of GT text. The tuple of color should be in RGB order. Or using an abbreviation of color, such as ‘g’ for ‘green’. Defaults to ‘g’.
pred_color (str or tuple[int, int, int]) – Colors of Predicted text. The tuple of color should be in RGB order. Or using an abbreviation of color, such as ‘r’ for ‘red’. Defaults to ‘r’.
- 返回类型
- add_datasample(name, image, data_sample=None, draw_gt=True, draw_pred=True, show=False, wait_time=0, pred_score_thr=None, out_file=None, step=0)[源代码]¶
Visualize datasample and save to all backends.
If GT and prediction are plotted at the same time, they are
displayed in a stitched image where the left image is the ground truth and the right image is the prediction. - If
show
is True, all storage backends are ignored, and the images will be displayed in a local window. - Ifout_file
is specified, the drawn image will be saved toout_file
. This is usually used when the display is not available.- 参数
name (str) – The image title. Defaults to ‘image’.
image (np.ndarray) – The image to draw.
data_sample (
TextRecogDataSample
, optional) – TextRecogDataSample which contains gt and prediction. Defaults to None.draw_gt (bool) – Whether to draw GT TextRecogDataSample. Defaults to True.
draw_pred (bool) – Whether to draw Predicted TextRecogDataSample. Defaults to True.
show (bool) – Whether to display the drawn image. Defaults to False.
wait_time (float) – The interval of show (s). Defaults to 0.
out_file (str) – Path to output file. Defaults to None.
step (int) – Global step value to record. Defaults to 0.
pred_score_thr (float) – Threshold of prediction score. It’s not used in this function. Defaults to None.
- 返回类型
Text Spotting Visualizer¶
- class mmocr.visualization.textspotting_visualizer.TextSpottingLocalVisualizer(name='visualizer', image=None, vis_backends=None, save_dir=None, fig_save_cfg={'frameon': False}, fig_show_cfg={'frameon': False})[源代码]¶
- 参数
image (Optional[numpy.ndarray]) –
vis_backends (Optional[List[Dict]]) –
save_dir (Optional[str]) –
- 返回类型
KIE Visualizer¶
- class mmocr.visualization.kie_visualizer.KIELocalVisualizer(name='kie_visualizer', is_openset=False, **kwargs)[源代码]¶
The MMOCR Text Detection Local Visualizer.
- 参数
name (str) – Name of the instance. Defaults to ‘visualizer’.
image (np.ndarray, optional) – the origin image to draw. The format should be RGB. Defaults to None.
vis_backends (list, optional) – Visual backend config list. Default to None.
save_dir (str, optional) – Save file dir for all storage backends. If it is None, the backend storage will not save any data.
fig_save_cfg (dict) – Keyword parameters of figure for saving. Defaults to empty dict.
fig_show_cfg (dict) – Keyword parameters of figure for showing. Defaults to empty dict.
is_openset (bool, optional) – Whether the visualizer is used in OpenSet. Defaults to False.
- 返回类型
- add_datasample(name, image, data_sample=None, draw_gt=True, draw_pred=True, show=False, wait_time=0, pred_score_thr=None, out_file=None, step=0)[源代码]¶
Draw datasample and save to all backends.
If GT and prediction are plotted at the same time, they are
displayed in a stitched image where the left image is the ground truth and the right image is the prediction. - If
show
is True, all storage backends are ignored, and the images will be displayed in a local window. - Ifout_file
is specified, the drawn image will be saved toout_file
. This is usually used when the display is not available.- 参数
name (str) – The image identifier.
image (np.ndarray) – The image to draw.
data_sample (
KIEDataSample
, optional) –- KIEDataSample which contains gt and prediction. Defaults
to None.
draw_gt (bool) – Whether to draw GT KIEDataSample. Defaults to True.
draw_pred (bool) – Whether to draw Predicted KIEDataSample. Defaults to True.
show (bool) – Whether to display the drawn image. Default to False.
wait_time (float) – The interval of show (s). Defaults to 0.
pred_score_thr (float) – The threshold to visualize the bboxes and masks. Defaults to 0.3.
out_file (str) – Path to output file. Defaults to None.
step (int) – Global step value to record. Defaults to 0.
- 返回类型
- draw_arrows(x_data, y_data, colors='C1', line_widths=1, line_styles='-', arrow_tail_widths=0.001, arrow_head_widths=None, arrow_head_lengths=None, arrow_shapes='full', overhangs=0)[源代码]¶
Draw single or multiple arrows.
- 参数
x_data (np.ndarray or torch.Tensor) – The x coordinate of each line’ start and end points.
y_data (np.ndarray, torch.Tensor) – The y coordinate of each line’ start and end points.
colors (str or tuple or list[str or tuple]) – The colors of lines.
colors
can have the same length with lines or just single value. Ifcolors
is single value, all the lines will have the same colors. Reference to https://matplotlib.org/stable/gallery/color/named_colors.html for more details. Defaults to ‘g’.line_widths (int or float or list[int or float]) – The linewidth of lines.
line_widths
can have the same length with lines or just single value. Ifline_widths
is single value, all the lines will have the same linewidth. Defaults to 2.line_styles (str or list[str]]) – The linestyle of lines.
line_styles
can have the same length with lines or just single value. Ifline_styles
is single value, all the lines will have the same linestyle. Defaults to ‘-‘.arrow_tail_widths (int or float or list[int, float]) – The width of arrow tails.
arrow_tail_widths
can have the same length with lines or just single value. Ifarrow_tail_widths
is single value, all the lines will have the same width. Defaults to 0.001.arrow_head_widths (int or float or list[int, float]) – The width of arrow heads.
arrow_head_widths
can have the same length with lines or just single value. Ifarrow_head_widths
is single value, all the lines will have the same width. Defaults to None.arrow_head_lengths (int or float or list[int, float]) – The length of arrow heads.
arrow_head_lengths
can have the same length with lines or just single value. Ifarrow_head_lengths
is single value, all the lines will have the same length. Defaults to None.arrow_shapes (str or list[str]]) – The shapes of arrow heads.
arrow_shapes
can have the same length with lines or just single value. Ifarrow_shapes
is single value, all the lines will have the same shape. Defaults to ‘full’.overhangs (int or list[int]]) – The overhangs of arrow heads.
overhangs
can have the same length with lines or just single value. Ifoverhangs
is single value, all the lines will have the same overhangs. Defaults to 0.
- 返回类型