Shortcuts

mmocr.datasets

class mmocr.datasets.ConcatDataset(datasets, pipeline=[], verify_meta=True, force_apply=False, lazy_init=False)[源代码]

A wrapper of concatenated dataset.

Same as torch.utils.data.dataset.ConcatDataset and support lazy_init.

注解

ConcatDataset should not inherit from BaseDataset since get_subset and get_subset_ could produce ambiguous meaning sub-dataset which conflicts with original dataset. If you want to use a sub-dataset of ConcatDataset, you should set indices arguments for wrapped dataset which inherit from BaseDataset.

参数
  • datasets (Sequence[BaseDataset] or Sequence[dict]) – A list of datasets which will be concatenated.

  • pipeline (list, optional) – Processing pipeline to be applied to all of the concatenated datasets. Defaults to [].

  • verify_meta (bool) – Whether to verify the consistency of meta information of the concatenated datasets. Defaults to True.

  • force_apply (bool) – Whether to force apply pipeline to all datasets if any of them already has the pipeline configured. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. Defaults to False.

class mmocr.datasets.IcdarDataset(*args, proposal_file=None, file_client_args={'backend': 'disk'}, **kwargs)[源代码]

Dataset for text detection while ann_file in coco format.

参数
  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.

  • data_root (str) – The root directory for data_prefix and ann_file. Defaults to ‘’.

  • data_prefix (dict) – Prefix for training data. Defaults to dict(img_path=’’).

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

parse_data_info(raw_data_info)[源代码]

Parse raw annotation to target format.

参数

raw_data_info (dict) – Raw data information loaded from ann_file

返回

Parsed annotation.

返回类型

Union[dict, List[dict]]

class mmocr.datasets.OCRDataset(ann_file='', metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]

OCRDataset for text detection and text recognition.

The annotation format is shown as follows.

{
    "metainfo":
    {
      "dataset_type": "test_dataset",
      "task_name": "test_task"
    },
    "data_list":
    [
      {
        "img_path": "test_img.jpg",
        "height": 604,
        "width": 640,
        "instances":
        [
          {
            "bbox": [0, 0, 10, 20],
            "bbox_label": 1,
            "mask": [0,0,0,10,10,20,20,0],
            "text": '123'
          },
          {
            "bbox": [10, 10, 110, 120],
            "bbox_label": 2,
            "mask": [10,10],10,110,110,120,120,10]],
            "extra_anns": '456'
          }
        ]
      },
    ]
}
参数
  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Defaults to None.

  • data_prefix (dict, optional) – Prefix for training data. Defaults to dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. OCRdataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If OCRdataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

注解

OCRDataset collects meta information from annotation file (the lowest priority), ``OCRDataset.METAINFO``(medium) and metainfo parameter (highest) passed to constructors. The lower priority meta information will be overwritten by higher one.

实际案例

Assume the annotation file is given above. >>> class CustomDataset(OCRDataset): >>> METAINFO: dict = dict(task_name=’custom_task’, >>> dataset_type=’custom_type’) >>> metainfo=dict(task_name=’custom_task_name’) >>> custom_dataset = CustomDataset( >>> ‘path/to/ann_file’, >>> metainfo=metainfo) >>> # meta information of annotation file will be overwritten by >>> # CustomDataset.METAINFO. The merged meta information will >>> # further be overwritten by argument metainfo. >>> custom_dataset.metainfo {‘task_name’: custom_task_name, dataset_type: custom_type}

class mmocr.datasets.RecogLMDBDataset(ann_file='', parser_cfg={'keys': ['filename', 'text'], 'type': 'LineJsonParser'}, metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]

RecogLMDBDataset for text recognition.

The annotation format should be in lmdb format. We support two lmdb formats, one is the lmdb file with only labels generated by txt2lmdb (deprecated), and another one is the lmdb file generated by recog2lmdb.

The former format stores string in filename text format directly in lmdb, while the latter uses image_key as well as label_key for querying.

参数
  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • parse_cfg (dict, optional) – Config of parser for parsing annotations. Use LineJsonParser when the annotation file is in jsonl format with keys of filename and text. The keys in parse_cfg should be consistent with the keys in jsonl annotations. The first key in parse_cfg should be the key of the path in jsonl annotations. The second key in parse_cfg should be the key of the text in jsonl Use LineStrParser when the annotation file is in txt format. Defaults to dict(type='LineJsonParser', keys=['filename', 'text']).

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.

  • data_root (str) – The root directory for data_prefix and ann_file. Defaults to ‘’.

  • data_prefix (dict) – Prefix for training data. Defaults to dict(img_path='').

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. RecogLMDBDataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If RecogLMDBdataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

close()[源代码]

Close lmdb environment.

load_data_list()[源代码]

Load annotations from an annotation file named as self.ann_file

返回

A list of annotation.

返回类型

List[dict]

parse_data_info(raw_anno_info)[源代码]

Parse raw annotation to target format.

参数

raw_anno_info (str) – One raw data information loaded from ann_file.

返回

Parsed annotation.

返回类型

(dict)

class mmocr.datasets.RecogTextDataset(ann_file='', file_client_args=None, parser_cfg={'keys': ['filename', 'text'], 'type': 'LineJsonParser'}, metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]

RecogTextDataset for text recognition.

The annotation format can be both in jsonl and txt. If the annotation file is in jsonl format, it should be a list of dicts. If the annotation file is in txt format, it should be a list of lines.

The annotation formats are shown as follows. - txt format .. code-block:: none

test_img1.jpg OpenMMLab test_img2.jpg MMOCR

  • jsonl format

``{"filename": "test_img1.jpg", "text": "OpenMMLab"}``
``{"filename": "test_img2.jpg", "text": "MMOCR"}``
参数
  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • file_client_args (dict, optional) – Arguments to instantiate a FileClient. See mmengine.fileio.FileClient for details. Default: None.

  • parse_cfg (dict, optional) – Config of parser for parsing annotations. Use LineJsonParser when the annotation file is in jsonl format with keys of filename and text. The keys in parse_cfg should be consistent with the keys in jsonl annotations. The first key in parse_cfg should be the key of the path in jsonl annotations. The second key in parse_cfg should be the key of the text in jsonl Use LineStrParser when the annotation file is in txt format. Defaults to dict(type='LineJsonParser', keys=['filename', 'text']).

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.

  • data_root (str) – The root directory for data_prefix and ann_file. Defaults to ‘’.

  • data_prefix (dict) – Prefix for training data. Defaults to dict(img_path='').

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. RecogTextDataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If RecogTextDataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

load_data_list()[源代码]

Load annotations from an annotation file named as self.ann_file

返回

A list of annotation.

返回类型

List[dict]

parse_data_info(raw_anno_info)[源代码]

Parse raw annotation to target format.

参数

raw_anno_info (str) – One raw data information loaded from ann_file.

返回

Parsed annotation.

返回类型

(dict)

class mmocr.datasets.WildReceiptDataset(directed=False, ann_file='', metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=Ellipsis, test_mode=False, lazy_init=False, max_refetch=1000)[源代码]

WildReceipt Dataset for key information extraction. There are two files to be loaded: metainfo and annotation. The metainfo file contains the mapping between classes and labels. The annotation file contains the all necessary information about the image, such as bounding boxes, texts, and labels etc.

The metainfo file is a text file with the following format:

0 Ignore
1 Store_name_value
2 Store_name_key

The annotation format is shown as follows.

{
    "file_name": "a.jpeg",
    "height": 348,
    "width": 348,
    "annotations": [
        {
            "box": [
                114.0,
                19.0,
                230.0,
                19.0,
                230.0,
                1.0,
                114.0,
                1.0
            ],
            "text": "CHOEUN",
            "label": 1
        },
        {
            "box": [
                97.0,
                35.0,
                236.0,
                35.0,
                236.0,
                19.0,
                97.0,
                19.0
            ],
            "text": "KOREANRESTAURANT",
            "label": 2
        }
    ]
}
参数
  • directed (bool) – Whether to use directed graph. Defaults to False.

  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • metainfo (str or dict, optional) – Meta information for dataset, such as class information. If it’s a string, it will be treated as a path to the class file from which the class information will be loaded. Defaults to None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Defaults to ‘’.

  • data_prefix (dict, optional) – Prefix for training data. Defaults to dict(img_path=’’).

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

load_data_list()[源代码]

Load data list from annotation file.

返回

A list of annotation dict.

返回类型

List[dict]

parse_data_info(raw_data_info)[源代码]

Parse data info from raw data info.

参数

raw_data_info (dict) – Raw data info.

返回

Parsed data info.

  • img_path (str): Path to the image.

  • img_shape (tuple(int, int)): Image shape in (H, W).

  • instances (list[dict]): A list of instances. - bbox (ndarray(dtype=np.float32)): Shape (4, ). Bounding box. - text (str): Annotation text. - edge_label (int): Edge label. - bbox_label (int): Bounding box label.

返回类型

dict

Dataset Types

class mmocr.datasets.ocr_dataset.OCRDataset(ann_file='', metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]

OCRDataset for text detection and text recognition.

The annotation format is shown as follows.

{
    "metainfo":
    {
      "dataset_type": "test_dataset",
      "task_name": "test_task"
    },
    "data_list":
    [
      {
        "img_path": "test_img.jpg",
        "height": 604,
        "width": 640,
        "instances":
        [
          {
            "bbox": [0, 0, 10, 20],
            "bbox_label": 1,
            "mask": [0,0,0,10,10,20,20,0],
            "text": '123'
          },
          {
            "bbox": [10, 10, 110, 120],
            "bbox_label": 2,
            "mask": [10,10],10,110,110,120,120,10]],
            "extra_anns": '456'
          }
        ]
      },
    ]
}
参数
  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Defaults to None.

  • data_prefix (dict, optional) – Prefix for training data. Defaults to dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. OCRdataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If OCRdataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

注解

OCRDataset collects meta information from annotation file (the lowest priority), ``OCRDataset.METAINFO``(medium) and metainfo parameter (highest) passed to constructors. The lower priority meta information will be overwritten by higher one.

实际案例

Assume the annotation file is given above. >>> class CustomDataset(OCRDataset): >>> METAINFO: dict = dict(task_name=’custom_task’, >>> dataset_type=’custom_type’) >>> metainfo=dict(task_name=’custom_task_name’) >>> custom_dataset = CustomDataset( >>> ‘path/to/ann_file’, >>> metainfo=metainfo) >>> # meta information of annotation file will be overwritten by >>> # CustomDataset.METAINFO. The merged meta information will >>> # further be overwritten by argument metainfo. >>> custom_dataset.metainfo {‘task_name’: custom_task_name, dataset_type: custom_type}

class mmocr.datasets.icdar_dataset.IcdarDataset(*args, proposal_file=None, file_client_args={'backend': 'disk'}, **kwargs)[源代码]

Dataset for text detection while ann_file in coco format.

参数
  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.

  • data_root (str) – The root directory for data_prefix and ann_file. Defaults to ‘’.

  • data_prefix (dict) – Prefix for training data. Defaults to dict(img_path=’’).

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

parse_data_info(raw_data_info)[源代码]

Parse raw annotation to target format.

参数

raw_data_info (dict) – Raw data information loaded from ann_file

返回

Parsed annotation.

返回类型

Union[dict, List[dict]]

class mmocr.datasets.recog_lmdb_dataset.RecogLMDBDataset(ann_file='', parser_cfg={'keys': ['filename', 'text'], 'type': 'LineJsonParser'}, metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]

RecogLMDBDataset for text recognition.

The annotation format should be in lmdb format. We support two lmdb formats, one is the lmdb file with only labels generated by txt2lmdb (deprecated), and another one is the lmdb file generated by recog2lmdb.

The former format stores string in filename text format directly in lmdb, while the latter uses image_key as well as label_key for querying.

参数
  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • parse_cfg (dict, optional) – Config of parser for parsing annotations. Use LineJsonParser when the annotation file is in jsonl format with keys of filename and text. The keys in parse_cfg should be consistent with the keys in jsonl annotations. The first key in parse_cfg should be the key of the path in jsonl annotations. The second key in parse_cfg should be the key of the text in jsonl Use LineStrParser when the annotation file is in txt format. Defaults to dict(type='LineJsonParser', keys=['filename', 'text']).

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.

  • data_root (str) – The root directory for data_prefix and ann_file. Defaults to ‘’.

  • data_prefix (dict) – Prefix for training data. Defaults to dict(img_path='').

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. RecogLMDBDataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If RecogLMDBdataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

close()[源代码]

Close lmdb environment.

load_data_list()[源代码]

Load annotations from an annotation file named as self.ann_file

返回

A list of annotation.

返回类型

List[dict]

parse_data_info(raw_anno_info)[源代码]

Parse raw annotation to target format.

参数

raw_anno_info (str) – One raw data information loaded from ann_file.

返回

Parsed annotation.

返回类型

(dict)

class mmocr.datasets.recog_text_dataset.RecogTextDataset(ann_file='', file_client_args=None, parser_cfg={'keys': ['filename', 'text'], 'type': 'LineJsonParser'}, metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=[], test_mode=False, lazy_init=False, max_refetch=1000)[源代码]

RecogTextDataset for text recognition.

The annotation format can be both in jsonl and txt. If the annotation file is in jsonl format, it should be a list of dicts. If the annotation file is in txt format, it should be a list of lines.

The annotation formats are shown as follows. - txt format .. code-block:: none

test_img1.jpg OpenMMLab test_img2.jpg MMOCR

  • jsonl format

``{"filename": "test_img1.jpg", "text": "OpenMMLab"}``
``{"filename": "test_img2.jpg", "text": "MMOCR"}``
参数
  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • file_client_args (dict, optional) – Arguments to instantiate a FileClient. See mmengine.fileio.FileClient for details. Default: None.

  • parse_cfg (dict, optional) – Config of parser for parsing annotations. Use LineJsonParser when the annotation file is in jsonl format with keys of filename and text. The keys in parse_cfg should be consistent with the keys in jsonl annotations. The first key in parse_cfg should be the key of the path in jsonl annotations. The second key in parse_cfg should be the key of the text in jsonl Use LineStrParser when the annotation file is in txt format. Defaults to dict(type='LineJsonParser', keys=['filename', 'text']).

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Defaults to None.

  • data_root (str) – The root directory for data_prefix and ann_file. Defaults to ‘’.

  • data_prefix (dict) – Prefix for training data. Defaults to dict(img_path='').

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. RecogTextDataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If RecogTextDataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

load_data_list()[源代码]

Load annotations from an annotation file named as self.ann_file

返回

A list of annotation.

返回类型

List[dict]

parse_data_info(raw_anno_info)[源代码]

Parse raw annotation to target format.

参数

raw_anno_info (str) – One raw data information loaded from ann_file.

返回

Parsed annotation.

返回类型

(dict)

class mmocr.datasets.wildreceipt_dataset.WildReceiptDataset(directed=False, ann_file='', metainfo=None, data_root='', data_prefix={'img_path': ''}, filter_cfg=None, indices=None, serialize_data=True, pipeline=Ellipsis, test_mode=False, lazy_init=False, max_refetch=1000)[源代码]

WildReceipt Dataset for key information extraction. There are two files to be loaded: metainfo and annotation. The metainfo file contains the mapping between classes and labels. The annotation file contains the all necessary information about the image, such as bounding boxes, texts, and labels etc.

The metainfo file is a text file with the following format:

0 Ignore
1 Store_name_value
2 Store_name_key

The annotation format is shown as follows.

{
    "file_name": "a.jpeg",
    "height": 348,
    "width": 348,
    "annotations": [
        {
            "box": [
                114.0,
                19.0,
                230.0,
                19.0,
                230.0,
                1.0,
                114.0,
                1.0
            ],
            "text": "CHOEUN",
            "label": 1
        },
        {
            "box": [
                97.0,
                35.0,
                236.0,
                35.0,
                236.0,
                19.0,
                97.0,
                19.0
            ],
            "text": "KOREANRESTAURANT",
            "label": 2
        }
    ]
}
参数
  • directed (bool) – Whether to use directed graph. Defaults to False.

  • ann_file (str) – Annotation file path. Defaults to ‘’.

  • metainfo (str or dict, optional) – Meta information for dataset, such as class information. If it’s a string, it will be treated as a path to the class file from which the class information will be loaded. Defaults to None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Defaults to ‘’.

  • data_prefix (dict, optional) – Prefix for training data. Defaults to dict(img_path=’’).

  • filter_cfg (dict, optional) – Config for filter data. Defaults to None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Defaults to None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Defaults to False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Defaults to False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Defaults to 1000.

load_data_list()[源代码]

Load data list from annotation file.

返回

A list of annotation dict.

返回类型

List[dict]

parse_data_info(raw_data_info)[源代码]

Parse data info from raw data info.

参数

raw_data_info (dict) – Raw data info.

返回

Parsed data info.

  • img_path (str): Path to the image.

  • img_shape (tuple(int, int)): Image shape in (H, W).

  • instances (list[dict]): A list of instances. - bbox (ndarray(dtype=np.float32)): Shape (4, ). Bounding box. - text (str): Annotation text. - edge_label (int): Edge label. - bbox_label (int): Bounding box label.

返回类型

dict

Transforms

class mmocr.datasets.transforms.BoundedScaleAspectJitter(long_size_bound, short_size_bound, ratio_range=(0.7, 1.3), aspect_ratio_range=(0.9, 1.1), resize_type='Resize', **resize_kwargs)[源代码]

First randomly rescale the image so that the longside and shortside of the image are around the bound; then jitter its aspect ratio.

Required Keys:

  • img

  • img_shape

  • gt_bboxes (optional)

  • gt_polygons (optional)

Modified Keys:

  • img

  • img_shape

  • gt_bboxes (optional)

  • gt_polygons (optional)

Added Keys:

  • scale

  • scale_factor

  • keep_ratio

参数
  • long_size_bound (int) – The approximate bound for long size.

  • short_size_bound (int) – The approximate bound for short size.

  • size_jitter_range (tuple(float, float)) – Range of the ratio used to jitter the size. Defaults to (0.7, 1.3).

  • aspect_ratio_jitter_range (tuple(float, float)) – Range of the ratio used to jitter its aspect ratio. Defaults to (0.9, 1.1).

  • resize_type (str) – The type of resize class to use. Defaults to “Resize”.

  • **resize_kwargs – Other keyword arguments for the resize_type.

  • ratio_range (Tuple[float, float]) –

  • aspect_ratio_range (Tuple[float, float]) –

返回类型

None

transform(results)[源代码]

The transform function. All subclass of BaseTransform should override this method.

This function takes the result dict as the input, and can add new items to the dict or modify existing items in the dict. And the result dict will be returned in the end, which allows to concate multiple transforms into a pipeline.

参数

results (dict) – The result dict.

返回

The result dict.

返回类型

dict

class mmocr.datasets.transforms.FixInvalidPolygon(mode='fix', min_poly_points=3)[源代码]

Fix invalid polygons in the dataset.

Required Keys:

  • gt_polygons

  • gt_ignored

Modified Keys:

  • gt_polygons

  • gt_ignored

参数
  • mode (str) – The mode of fixing invalid polygons. Options are ‘fix’ and ‘ignore’. For the ‘fix’ mode, the transform will try to fix the invalid polygons to a valid one by eliminating the self-intersection. For the ‘ignore’ mode, the invalid polygons will be ignored during training. Defaults to ‘fix’.

  • min_poly_points (int) – Minimum number of the coordinate points in a polygon. Defaults to 3.

返回类型

None

transform(results)[源代码]

Fix invalid polygons.

参数

results (dict) – Result dict containing the data to transform.

返回

The transformed data.

返回类型

dict

class mmocr.datasets.transforms.ImgAugWrapper(args=None)[源代码]

A wrapper around imgaug https://github.com/aleju/imgaug.

Find available augmenters at https://imgaug.readthedocs.io/en/latest/source/overview_of_augmenters.html.

Required Keys:

  • img

  • gt_polygons (optional for text recognition)

  • gt_bboxes (optional for text recognition)

  • gt_bboxes_labels (optional for text recognition)

  • gt_ignored (optional for text recognition)

  • gt_texts (optional)

Modified Keys:

  • img

  • gt_polygons (optional for text recognition)

  • gt_bboxes (optional for text recognition)

  • gt_bboxes_labels (optional for text recognition)

  • gt_ignored (optional for text recognition)

  • img_shape (optional)

  • gt_texts (optional)

参数

args (list[list or dict]], optional) – The argumentation list. For details, please refer to imgaug document. Take args=[[‘Fliplr’, 0.5], dict(cls=’Affine’, rotate=[-10, 10]), [‘Resize’, [0.5, 3.0]]] as an example. The args horizontally flip images with probability 0.5, followed by random rotation with angles in range [-10, 10], and resize with an independent scale in range [0.5, 3.0] for each side of images. Defaults to None.

返回类型

None

transform(results)[源代码]

Transform the image and annotation data.

参数

results (dict) – Result dict containing the data to transform.

返回

The transformed data.

返回类型

dict

class mmocr.datasets.transforms.LoadImageFromFile(to_float32=False, color_type='color', imdecode_backend='cv2', file_client_args={'backend': 'disk'}, min_size=0, ignore_empty=False)[源代码]

Load an image from file.

Required Keys:

  • img_path

Modified Keys:

  • img

  • img_shape

  • ori_shape

参数
  • to_float32 (bool) – Whether to convert the loaded image to a float32 numpy array. If set to False, the loaded image is an uint8 array. Defaults to False.

  • color_type (str) – The flag argument for :func:mmcv.imfrombytes. Defaults to ‘color’.

  • imdecode_backend (str) – The image decoding backend type. The backend argument for :func:mmcv.imfrombytes. See :func:mmcv.imfrombytes for details. Defaults to ‘cv2’.

  • file_client_args (dict) – Arguments to instantiate a FileClient. See mmengine.fileio.FileClient for details. Defaults to dict(backend='disk').

  • ignore_empty (bool) – Whether to allow loading empty image or file path not existent. Defaults to False.

  • min_size (int) – The minimum size of the image to be loaded. If the image is smaller than the minimum size, it will be regarded as a broken image. Defaults to 0.

返回类型

None

transform(results)[源代码]

Functions to load image.

参数

results (dict) – Result dict from :obj:mmcv.BaseDataset.

返回类型

Optional[dict]

class mmocr.datasets.transforms.LoadImageFromLMDB(to_float32=False, color_type='color', imdecode_backend='cv2', file_client_args={}, ignore_empty=False)[源代码]

Load an image from lmdb file. Only support LMDB file at disk.

LMDB file is organized with the following structure:
lmdb

|__data.mdb |__lock.mdb

Required Keys:

  • img_path (In LMDB img_path is a key in the format of “image-{i:09d}”.)

Modified Keys:

  • img

  • img_shape

  • ori_shape

参数
  • to_float32 (bool) – Whether to convert the loaded image to a float32 numpy array. If set to False, the loaded image is an uint8 array. Defaults to False.

  • color_type (str) – The flag argument for :func:mmcv.imfrombytes. Defaults to ‘color’.

  • imdecode_backend (str) – The image decoding backend type. The backend argument for :func:mmcv.imfrombytes. See :func:mmcv.imfrombytes for details. Defaults to ‘cv2’.

  • file_client_args (dict) – Arguments to instantiate a FileClient except for backend and db_path. See mmengine.fileio.FileClient for details. Defaults to dict().

  • ignore_empty (bool) – Whether to allow loading empty image or file path not existent. Defaults to False.

返回类型

None

transform(results)[源代码]

Functions to load image from LMDB file.

参数

results (dict) – Result dict from :obj:mmcv.BaseDataset.

返回

The dict contains loaded image and meta information.

返回类型

dict

class mmocr.datasets.transforms.LoadImageFromNDArray(to_float32=False, color_type='color', imdecode_backend='cv2', file_client_args={'backend': 'disk'}, min_size=0, ignore_empty=False)[源代码]

Load an image from results['img'].

Similar with LoadImageFromFile, but the image has been loaded as np.ndarray in results['img']. Can be used when loading image from webcam.

Required Keys:

  • img

Modified Keys:

  • img

  • img_path

  • img_shape

  • ori_shape

参数
  • to_float32 (bool) – Whether to convert the loaded image to a float32 numpy array. If set to False, the loaded image is an uint8 array. Defaults to False.

  • color_type (str) –

  • imdecode_backend (str) –

  • file_client_args (dict) –

  • min_size (int) –

  • ignore_empty (bool) –

返回类型

None

transform(results)[源代码]

Transform function to add image meta information.

参数

results (dict) – Result dict with Webcam read image in results['img'].

返回

The dict contains loaded image and meta information.

返回类型

dict

class mmocr.datasets.transforms.LoadKIEAnnotations(with_bbox=True, with_label=True, with_text=True, directed=False, key_node_idx=None, value_node_idx=None, **kwargs)[源代码]

Load and process the instances annotation provided by dataset.

The annotation format is as the following:

{
    # A nested list of 4 numbers representing the bounding box of the
    # instance, in (x1, y1, x2, y2) order.
    'bbox': np.array([[x1, y1, x2, y2], [x1, y1, x2, y2], ...],
                     dtype=np.int32),

    # Labels of boxes. Shape is (N,).
    'bbox_labels': np.array([0, 2, ...], dtype=np.int32),

    # Labels of edges. Shape (N, N).
    'edge_labels': np.array([0, 2, ...], dtype=np.int32),

    # List of texts.
    "texts": ['text1', 'text2', ...],
}

After this module, the annotation has been changed to the format below:

{
    # In (x1, y1, x2, y2) order, float type. N is the number of bboxes
    # in np.float32
    'gt_bboxes': np.ndarray(N, 4),
    # In np.int64 type.
    'gt_bboxes_labels': np.ndarray(N, ),
    # In np.int32 type.
    'gt_edges_labels': np.ndarray(N, N),
    # In list[str]
    'gt_texts': list[str],
    # tuple(int)
    'ori_shape': (H, W)
}

Required Keys:

  • bboxes

  • bbox_labels

  • edge_labels

  • texts

Added Keys:

  • gt_bboxes (np.float32)

  • gt_bboxes_labels (np.int64)

  • gt_edges_labels (np.int64)

  • gt_texts (list[str])

  • ori_shape (tuple[int])

参数
  • with_bbox (bool) – Whether to parse and load the bbox annotation. Defaults to True.

  • with_label (bool) – Whether to parse and load the label annotation. Defaults to True.

  • with_text (bool) – Whether to parse and load the text annotation. Defaults to True.

  • directed (bool) – Whether build edges as a directed graph. Defaults to False.

  • key_node_idx (int, optional) – Key node label, used to mask out edges that are not connected from key nodes to value nodes. It has to be specified together with value_node_idx. Defaults to None.

  • value_node_idx (int, optional) – Value node label, used to mask out edges that are not connected from key nodes to value nodes. It has to be specified together with key_node_idx. Defaults to None.

返回类型

None

transform(results)[源代码]

Function to load multiple types annotations.

参数

results (dict) – Result dict from :obj:OCRDataset.

返回

The dict contains loaded bounding box, label polygon and text annotations.

返回类型

dict

class mmocr.datasets.transforms.LoadOCRAnnotations(with_bbox=False, with_label=False, with_polygon=False, with_text=False, **kwargs)[源代码]

Load and process the instances annotation provided by dataset.

The annotation format is as the following:

{
    'instances':
    [
        {
        # List of 4 numbers representing the bounding box of the
        # instance, in (x1, y1, x2, y2) order.
        # used in text detection or text spotting tasks.
        'bbox': [x1, y1, x2, y2],

        # Label of instance, usually it's 0.
        # used in text detection or text spotting tasks.
        'bbox_label': 0,

        # List of n numbers representing the polygon of the
        # instance, in (xn, yn) order.
        # used in text detection/ textspotter.
        "polygon": [x1, y1, x2, y2, ... xn, yn],

        # The flag indicating whether the instance should be ignored.
        # used in text detection or text spotting tasks.
        "ignore": False,

        # The groundtruth of text.
        # used in text recognition or text spotting tasks.
        "text": 'tmp',
        }
    ]
}

After this module, the annotation has been changed to the format below:

{
    # In (x1, y1, x2, y2) order, float type. N is the number of bboxes
    # in np.float32
    'gt_bboxes': np.ndarray(N, 4)
     # In np.int64 type.
    'gt_bboxes_labels': np.ndarray(N, )
    # In (x1, y1,..., xk, yk) order, float type.
    # in list[np.float32]
    'gt_polygons': list[np.ndarray(2k, )]
     # In np.bool_ type.
    'gt_ignored': np.ndarray(N, )
     # In list[str]
    'gt_texts': list[str]
}

Required Keys:

  • instances

    • bbox (optional)

    • bbox_label (optional)

    • polygon (optional)

    • ignore (optional)

    • text (optional)

Added Keys:

  • gt_bboxes (np.float32)

  • gt_bboxes_labels (np.int64)

  • gt_polygons (list[np.float32])

  • gt_ignored (np.bool_)

  • gt_texts (list[str])

参数
  • with_bbox (bool) – Whether to parse and load the bbox annotation. Defaults to False.

  • with_label (bool) – Whether to parse and load the label annotation. Defaults to False.

  • with_polygon (bool) – Whether to parse and load the polygon annotation. Defaults to False.

  • with_text (bool) – Whether to parse and load the text annotation. Defaults to False.

返回类型

None

transform(results)[源代码]

Function to load multiple types annotations.

参数

results (dict) – Result dict from :obj:OCRDataset.

返回

The dict contains loaded bounding box, label polygon and text annotations.

返回类型

dict

class mmocr.datasets.transforms.MMDet2MMOCR[源代码]

Convert transforms’s data format from MMDet to MMOCR.

Required Keys:

  • gt_masks (PolygonMasks | BitmapMasks) (optional)

  • gt_ignore_flags (np.bool) (optional)

Added Keys:

  • gt_polygons (list[np.ndarray])

  • gt_ignored (np.ndarray)

transform(results)[源代码]

Convert MMDet’s data format to MMOCR’s data format.

参数

results (Dict) – Result dict containing the data to transform.

返回

The transformed data.

返回类型

(Dict)

class mmocr.datasets.transforms.MMOCR2MMDet(poly2mask=False)[源代码]

Convert transforms’s data format from MMOCR to MMDet.

Required Keys:

  • img_shape

  • gt_polygons (List[ndarray]) (optional)

  • gt_ignored (np.bool) (optional)

Added Keys:

  • gt_masks (PolygonMasks | BitmapMasks) (optional)

  • gt_ignore_flags (np.bool) (optional)

参数

poly2mask (bool) – Whether to convert mask to bitmap. Default: True.

返回类型

None

transform(results)[源代码]

Convert MMOCR’s data format to MMDet’s data format.

参数

results (Dict) – Result dict containing the data to transform.

返回

The transformed data.

返回类型

(Dict)

class mmocr.datasets.transforms.PackKIEInputs(meta_keys=())[源代码]

Pack the inputs data for key information extraction.

The type of outputs is dict:

  • inputs: image converted to tensor, whose shape is (C, H, W).

  • data_samples: Two components of TextDetDataSample will be updated:

    • gt_instances (InstanceData): Depending on annotations, a subset of the following keys will be updated:

      • bboxes (torch.Tensor((N, 4), dtype=torch.float32)): The groundtruth of bounding boxes in the form of [x1, y1, x2, y2]. Renamed from ‘gt_bboxes’.

      • labels (torch.LongTensor(N)): The labels of instances. Renamed from ‘gt_bboxes_labels’.

      • edge_labels (torch.LongTensor(N, N)): The edge labels. Renamed from ‘gt_edges_labels’.

      • texts (list[str]): The groundtruth texts. Renamed from ‘gt_texts’.

    • metainfo (dict): ‘metainfo’ is always populated. The contents of the ‘metainfo’ depends on meta_keys. By default it includes:

      • “img_path”: Path to the image file.

      • “img_shape”: Shape of the image input to the network as a tuple (h, w). Note that the image may be zero-padded afterward on the bottom/right if the batch tensor is larger than this shape.

      • “scale_factor”: A tuple indicating the ratio of width and height of the preprocessed image to the original one.

      • “ori_shape”: Shape of the preprocessed image as a tuple (h, w).

参数

meta_keys (Sequence[str], optional) – Meta keys to be converted to the metainfo of TextDetSample. Defaults to ('img_path', 'ori_shape', 'img_shape', 'scale_factor', 'flip', 'flip_direction').

transform(results)[源代码]

Method to pack the input data.

参数

results (dict) – Result dict from the data pipeline.

返回

  • ‘inputs’ (obj:torch.Tensor): Data for model forwarding.

  • ’data_samples’ (obj:DetDataSample): The annotation info of the sample.

返回类型

dict

class mmocr.datasets.transforms.PackTextDetInputs(meta_keys=('img_path', 'ori_shape', 'img_shape', 'scale_factor', 'flip', 'flip_direction'))[源代码]

Pack the inputs data for text detection.

The type of outputs is dict:

  • inputs: image converted to tensor, whose shape is (C, H, W).

  • data_samples: Two components of TextDetDataSample will be updated:

    • gt_instances (InstanceData): Depending on annotations, a subset of the following keys will be updated:

      • bboxes (torch.Tensor((N, 4), dtype=torch.float32)): The groundtruth of bounding boxes in the form of [x1, y1, x2, y2]. Renamed from ‘gt_bboxes’.

      • labels (torch.LongTensor(N)): The labels of instances. Renamed from ‘gt_bboxes_labels’.

      • polygons(list[np.array((2k,), dtype=np.float32)]): The groundtruth of polygons in the form of [x1, y1,…, xk, yk]. Each element in polygons may have different number of points. Renamed from ‘gt_polygons’. Using numpy instead of tensor is that polygon usually is not the output of model and operated on cpu.

      • ignored (torch.BoolTensor((N,))): The flag indicating whether the corresponding instance should be ignored. Renamed from ‘gt_ignored’.

      • texts (list[str]): The groundtruth texts. Renamed from ‘gt_texts’.

    • metainfo (dict): ‘metainfo’ is always populated. The contents of the ‘metainfo’ depends on meta_keys. By default it includes:

      • “img_path”: Path to the image file.

      • “img_shape”: Shape of the image input to the network as a tuple (h, w). Note that the image may be zero-padded afterward on the bottom/right if the batch tensor is larger than this shape.

      • “scale_factor”: A tuple indicating the ratio of width and height of the preprocessed image to the original one.

      • “ori_shape”: Shape of the preprocessed image as a tuple (h, w).

      • “pad_shape”: Image shape after padding (if any Pad-related transform involved) as a tuple (h, w).

      • “flip”: A boolean indicating if the image has been flipped.

      • flip_direction: the flipping direction.

参数

meta_keys (Sequence[str], optional) – Meta keys to be converted to the metainfo of TextDetSample. Defaults to ('img_path', 'ori_shape', 'img_shape', 'scale_factor', 'flip', 'flip_direction').

transform(results)[源代码]

Method to pack the input data.

参数

results (dict) – Result dict from the data pipeline.

返回

  • ‘inputs’ (obj:torch.Tensor): Data for model forwarding.

  • ’data_samples’ (obj:DetDataSample): The annotation info of the sample.

返回类型

dict

class mmocr.datasets.transforms.PackTextRecogInputs(meta_keys=('img_path', 'ori_shape', 'img_shape', 'pad_shape', 'valid_ratio'))[源代码]

Pack the inputs data for text recognition.

The type of outputs is dict:

  • inputs: Image as a tensor, whose shape is (C, H, W).

  • data_samples: Two components of TextRecogDataSample will be updated:

    • gt_text (LabelData):

      • item(str): The groundtruth of text. Rename from ‘gt_texts’.

    • metainfo (dict): ‘metainfo’ is always populated. The contents of the ‘metainfo’ depends on meta_keys. By default it includes:

      • “img_path”: Path to the image file.

      • “ori_shape”: Shape of the preprocessed image as a tuple (h, w).

      • “img_shape”: Shape of the image input to the network as a tuple (h, w). Note that the image may be zero-padded afterward on the bottom/right if the batch tensor is larger than this shape.

      • “valid_ratio”: The proportion of valid (unpadded) content of image on the x-axis. It defaults to 1 if not set in pipeline.

参数

meta_keys (Sequence[str], optional) – Meta keys to be converted to the metainfo of TextRecogDataSampel. Defaults to ('img_path', 'ori_shape', 'img_shape', 'pad_shape', 'valid_ratio').

transform(results)[源代码]

Method to pack the input data.

参数

results (dict) – Result dict from the data pipeline.

返回

  • ‘inputs’ (obj:torch.Tensor): Data for model forwarding.

  • ’data_samples’ (obj:TextRecogDataSample): The annotation info

    of the sample.

返回类型

dict

class mmocr.datasets.transforms.PadToWidth(width, pad_cfg={'type': 'Pad'})[源代码]

Only pad the image’s width.

Required Keys:

  • img

Modified Keys:

  • img

  • img_shape

Added Keys:

  • pad_shape

  • pad_fixed_size

  • pad_size_divisor

  • valid_ratio

参数
  • width (int) – Target width of padded image. Defaults to None.

  • pad_cfg (dict) – Config to construct the Resize transform. Refer to Pad for detail. Defaults to dict(type='Pad').

返回类型

None

transform(results)[源代码]

Call function to pad images.

参数

results (dict) – Result dict from loading pipeline.

返回

Updated result dict.

返回类型

dict

class mmocr.datasets.transforms.PyramidRescale(factor=4, base_shape=(128, 512), randomize_factor=True)[源代码]

Resize the image to the base shape, downsample it with gaussian pyramid, and rescale it back to original size.

Adapted from https://github.com/FangShancheng/ABINet.

Required Keys:

  • img (ndarray)

Modified Keys:

  • img (ndarray)

参数
  • factor (int) – The decay factor from base size, or the number of downsampling operations from the base layer.

  • base_shape (tuple[int, int]) – The shape (width, height) of the base layer of the pyramid.

  • randomize_factor (bool) – If True, the final factor would be a random integer in [0, factor].

返回类型

None

transform(results)[源代码]

Applying pyramid rescale on results.

参数

results (dict) – Result dict containing the data to transform.

返回

The transformed data.

返回类型

Dict

class mmocr.datasets.transforms.RandomCrop(min_side_ratio=0.4)[源代码]

Randomly crop images and make sure to contain at least one intact instance.

Required Keys:

  • img

  • gt_polygons

  • gt_bboxes

  • gt_bboxes_labels

  • gt_ignored

  • gt_texts (optional)

Modified Keys:

  • img

  • img_shape

  • gt_polygons

  • gt_bboxes

  • gt_bboxes_labels

  • gt_ignored

  • gt_texts (optional)

参数

min_side_ratio (float) – The ratio of the shortest edge of the cropped image to the original image size.

返回类型

None

transform(results)[源代码]

Applying random crop on results. :param results: Result dict contains the data to transform. :type results: dict

返回

The transformed data.

返回类型

dict

参数

results (Dict) –

class mmocr.datasets.transforms.RandomFlip(prob=None, direction='horizontal')[源代码]

Flip the image & bbox polygon.

There are 3 flip modes:

  • prob is float, direction is string: the image will be

    direction``ly flipped with probability of ``prob . E.g., prob=0.5, direction='horizontal', then image will be horizontally flipped with probability of 0.5.

  • prob is float, direction is list of string: the image will

    be direction[i]``ly flipped with probability of ``prob/len(direction). E.g., prob=0.5, direction=['horizontal', 'vertical'], then image will be horizontally flipped with probability of 0.25, vertically with probability of 0.25.

  • prob is list of float, direction is list of string:

    given len(prob) == len(direction), the image will be direction[i]``ly flipped with probability of ``prob[i]. E.g., prob=[0.3, 0.5], direction=['horizontal', 'vertical'], then image will be horizontally flipped with probability of 0.3, vertically with probability of 0.5.

Required Keys:
  • img

  • gt_bboxes (optional)

  • gt_polygons (optional)

Modified Keys:
  • img

  • gt_bboxes (optional)

  • gt_polygons (optional)

Added Keys:
  • flip

  • flip_direction

参数
  • prob (float | list[float], optional) – The flipping probability. Defaults to None.

  • direction (str | list[str]) – The flipping direction. Options If input is a list, the length must equal prob. Each element in prob indicates the flip probability of corresponding direction. Defaults to ‘horizontal’.

返回类型

None

flip_polygons(polygons, img_shape, direction)[源代码]

Flip polygons horizontally, vertically or diagonally.

参数
  • polygons (list[numpy.ndarray) – polygons.

  • img_shape (tuple[int]) – Image shape (height, width)

  • direction (str) – Flip direction. Options are ‘horizontal’, ‘vertical’ and ‘diagonal’.

返回

Flipped polygons.

返回类型

list[numpy.ndarray]

class mmocr.datasets.transforms.RandomRotate(max_angle=10, pad_with_fixed_color=False, pad_value=(0, 0, 0), use_canvas=False)[源代码]

Randomly rotate the image, boxes, and polygons. For recognition task, only the image will be rotated. If set use_canvas as True, the shape of rotated image might be modified based on the rotated angle size, otherwise, the image will keep the shape before rotation.

Required Keys:

  • img

  • img_shape

  • gt_bboxes (optional)

  • gt_polygons (optional)

Modified Keys:

  • img

  • img_shape (optional)

  • gt_bboxes (optional)

  • gt_polygons (optional)

Added Keys:

  • rotated_angle

参数
  • max_angle (int) – The maximum rotation angle (can be bigger than 180 or a negative). Defaults to 10.

  • pad_with_fixed_color (bool) – The flag for whether to pad rotated image with fixed value. Defaults to False.

  • pad_value (tuple[int, int, int]) – The color value for padding rotated image. Defaults to (0, 0, 0).

  • use_canvas (bool) – Whether to create a canvas for rotated image. Defaults to False. If set true, the image shape may be modified.

返回类型

None

transform(results)[源代码]

Applying random rotate on results.

参数
  • results (Dict) – Result dict containing the data to transform.

  • center_shift (Tuple[int, int]) – The shifting offset of the center point

返回

The transformed data

返回类型

dict

class mmocr.datasets.transforms.RescaleToHeight(height, min_width=None, max_width=None, width_divisor=1, resize_type='Resize', **resize_kwargs)[源代码]

Rescale the image to the height according to setting and keep the aspect ratio unchanged if possible. However, if any of min_width, max_width or width_divisor are specified, aspect ratio may still be changed to ensure the width meets these constraints.

Required Keys:

  • img

Modified Keys:

  • img

  • img_shape

Added Keys:

  • scale

  • scale_factor

  • keep_ratio

参数
  • height (int) – Height of rescaled image.

  • min_width (int, optional) – Minimum width of rescaled image. Defaults to None.

  • max_width (int, optional) – Maximum width of rescaled image. Defaults to None.

  • width_divisor (int) – The divisor of width size. Defaults to 1.

  • resize_type (str) – The type of resize class to use. Defaults to “Resize”.

  • **resize_kwargs – Other keyword arguments for the resize_type.

返回类型

None

transform(results)[源代码]

Transform function to resize images, bounding boxes and polygons.

参数

results (dict) – Result dict from loading pipeline.

返回

Resized results.

返回类型

dict

class mmocr.datasets.transforms.Resize(scale=None, scale_factor=None, keep_ratio=False, clip_object_border=True, backend='cv2', interpolation='bilinear')[源代码]

Resize image & bboxes & polygons.

This transform resizes the input image according to scale or scale_factor. Bboxes and polygons are then resized with the same scale factor. if scale and scale_factor are both set, it will use scale to resize.

Required Keys:

  • img

  • img_shape

  • gt_bboxes

  • gt_polygons

Modified Keys:

  • img

  • img_shape

  • gt_bboxes

  • gt_polygons

Added Keys:

  • scale

  • scale_factor

  • keep_ratio

参数
  • scale (int or tuple) – Image scales for resizing. Defaults to None.

  • scale_factor (float or tuple[float, float]) – Scale factors for resizing. It’s either a factor applicable to both dimensions or in the form of (scale_w, scale_h). Defaults to None.

  • keep_ratio (bool) – Whether to keep the aspect ratio when resizing the image. Defaults to False.

  • clip_object_border (bool) – Whether to clip the objects outside the border of the image. Defaults to True.

  • backend (str) – Image resize backend, choices are ‘cv2’ and ‘pillow’. These two backends generates slightly different results. Defaults to ‘cv2’.

  • interpolation (str) – Interpolation method, accepted values are “nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’ backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults to ‘bilinear’.

返回类型

None

transform(results)[源代码]

Transform function to resize images, bounding boxes and polygons.

参数

results (dict) – Result dict from loading pipeline.

返回

Resized results, ‘img’, ‘gt_bboxes’, ‘gt_polygons’, ‘scale’, ‘scale_factor’, ‘height’, ‘width’, and ‘keep_ratio’ keys are updated in result dict.

返回类型

dict

class mmocr.datasets.transforms.ShortScaleAspectJitter(short_size=736, ratio_range=(0.7, 1.3), aspect_ratio_range=(0.9, 1.1), scale_divisor=1, resize_type='Resize', **resize_kwargs)[源代码]

First rescale the image for its shorter side to reach the short_size and then jitter its aspect ratio, final rescale the shape guaranteed to be divided by scale_divisor.

Required Keys:

  • img

  • img_shape

  • gt_bboxes (optional)

  • gt_polygons (optional)

Modified Keys:

  • img

  • img_shape

  • gt_bboxes (optional)

  • gt_polygons (optional)

Added Keys:

  • scale

  • scale_factor

  • keep_ratio

参数
  • short_size (int) – Target shorter size before jittering the aspect ratio. Defaults to 736.

  • short_size_jitter_range (tuple(float, float)) – Range of the ratio used to jitter the target shorter size. Defaults to (0.7, 1.3).

  • aspect_ratio_jitter_range (tuple(float, float)) – Range of the ratio used to jitter its aspect ratio. Defaults to (0.9, 1.1).

  • scale_divisor (int) – The scale divisor. Defaults to 1.

  • resize_type (str) – The type of resize class to use. Defaults to “Resize”.

  • **resize_kwargs – Other keyword arguments for the resize_type.

  • ratio_range (Tuple[float, float]) –

  • aspect_ratio_range (Tuple[float, float]) –

返回类型

None

transform(results)[源代码]

Short Scale Aspect Jitter. :param results: Result dict containing the data to transform. :type results: dict

返回

The transformed data.

返回类型

dict

参数

results (Dict) –

class mmocr.datasets.transforms.SourceImagePad(target_scale, crop_ratio=0.1111111111111111)[源代码]

Pad Image to target size. It will randomly crop an area from the original image and resize it to the target size, then paste the original image to its top left corner.

Required Keys:

  • img

Modified Keys:

  • img

  • img_shape

Added Keys: - pad_shape - pad_fixed_size

参数
  • target_scale (int or tuple[int, int]]) – The target size of padded image. If it’s an integer, then the padding size would be (target_size, target_size). If it’s tuple, then target_scale[0] should be the width and target_scale[1] should be the height. The size of the padded image will be (target_scale[1], target_scale[0])

  • crop_ratio (float or Tuple[float, float]) – Relative size for the crop region. If crop_ratio is a float, then the initial crop size would be (crop_ratio * img.shape[0], crop_ratio * img.shape[1]) . If crop_ratio is a tuple, then crop_ratio[0] is for the width and crop_ratio[1] is for the height. The initial crop size would be (crop_ratio[1] * img.shape[0], crop_ratio[0] * img.shape[1]). Defaults to 1./9.

返回类型

None

transform(results)[源代码]

Pad Image to target size. It will randomly select a small area from the original image and resize it to the target size, then paste the original image to its top left corner.

参数

results (Dict) – Result dict containing the data to transform.

返回

The transformed data.

返回类型

(Dict)

class mmocr.datasets.transforms.TextDetRandomCrop(target_size, positive_sample_ratio=0.625)[源代码]

Randomly select a region and crop images to a target size and make sure to contain text region. This transform may break up text instances, and for broken text instances, we will crop it’s bbox and polygon coordinates. This transform is recommend to be used in segmentation-based network.

Required Keys:

  • img

  • gt_polygons

  • gt_bboxes

  • gt_bboxes_labels

  • gt_ignored

Modified Keys:

  • img

  • img_shape

  • gt_polygons

  • gt_bboxes

  • gt_bboxes_labels

  • gt_ignored

参数
  • target_size (tuple(int, int) or int) – Target size for the cropped image. If it’s a tuple, then target width and target height will be target_size[0] and target_size[1], respectively. If it’s an integer, them both target width and target height will be target_size.

  • positive_sample_ratio (float) – The probability of sampling regions that go through text regions. Defaults to 5. / 8.

返回类型

None

transform(results)[源代码]

Applying random crop on results. :param results: Result dict contains the data to transform. :type results: dict

返回

The transformed data

返回类型

dict

参数

results (Dict) –

class mmocr.datasets.transforms.TextDetRandomCropFlip(pad_ratio=0.1, crop_ratio=0.5, iter_num=1, min_area_ratio=0.2, epsilon=0.01)[源代码]

Random crop and flip a patch in the image. Only used in text detection task.

Required Keys:

  • img

  • gt_bboxes

  • gt_polygons

Modified Keys:

  • img

  • gt_bboxes

  • gt_polygons

参数
  • pad_ratio (float) – The ratio of padding. Defaults to 0.1.

  • crop_ratio (float) – The ratio of cropping. Defaults to 0.5.

  • iter_num (int) – Number of operations. Defaults to 1.

  • min_area_ratio (float) – Minimal area ratio between cropped patch and original image. Defaults to 0.2.

  • epsilon (float) – The threshold of polygon IoU between cropped area and polygon, which is used to avoid cropping text instances. Defaults to 0.01.

返回类型

None

transform(results)[源代码]

Applying random crop flip on results.

参数

results (dict) – Result dict containing the data to transform

返回

The transformed data

返回类型

dict

class mmocr.datasets.transforms.TorchVisionWrapper(op, **kwargs)[源代码]

A wrapper around torchvision trasnforms. It applies specific transform to img and updates height and width accordingly.

Required Keys:

  • img (ndarray): The input image.

Modified Keys:

  • img (ndarray): The modified image.

  • img_shape (tuple(int, int)): The shape of the image in (height, width).

警告

This transform only affects the image but not its associated annotations, such as word bounding boxes and polygons. Therefore, it may only be applicable to text recognition tasks.

参数
  • op (str) – The name of any transform class in torchvision.transforms().

  • **kwargs – Arguments that will be passed to initializer of torchvision transform.

返回类型

None

transform(results)[源代码]

Transform the image.

参数

results (dict) – Result dict from the data loader.

返回

Transformed results.

返回类型

dict

mmocr.engine

Hooks

class mmocr.engine.hooks.VisualizationHook(enable=False, interval=50, score_thr=0.3, show=False, draw_pred=False, draw_gt=False, wait_time=0.0, file_client_args={'backend': 'disk'})[源代码]

Detection Visualization Hook. Used to visualize validation and testing process prediction results.

参数
  • enable (bool) – Whether to enable this hook. Defaults to False.

  • interval (int) – The interval of visualization. Defaults to 50.

  • score_thr (float) – The threshold to visualize the bboxes and masks. It’s only useful for text detection. Defaults to 0.3.

  • show (bool) – Whether to display the drawn image. Defaults to False.

  • wait_time (float) – The interval of show in seconds. Defaults to 0.

  • file_client_args (dict) – Arguments to instantiate a FileClient. See mmengine.fileio.FileClient for details. Defaults to dict(backend='disk').

  • draw_pred (bool) –

  • draw_gt (bool) –

返回类型

None

after_test_iter(runner, batch_idx, data_batch, outputs)[源代码]

Run after every testing iterations.

参数
返回类型

None

:param outputs (Sequence[TextDetDataSample or: TextRecogDataSample]): Outputs from model.

after_val_iter(runner, batch_idx, data_batch, outputs)[源代码]

Run after every self.interval validation iterations.

参数
返回类型

None

:param outputs (Sequence[TextDetDataSample or: TextRecogDataSample]): Outputs from model.

mmocr.evaluation

Evaluator

class mmocr.evaluation.evaluator.MultiDatasetsEvaluator(metrics, dataset_prefixes)[源代码]

Wrapper class to compose class: ConcatDataset and multiple BaseMetric instances. The metrics will be evaluated on each dataset slice separately. The name of the each metric is the concatenation of the dataset prefix, the metric prefix and the key of metric - e.g. dataset_prefix/metric_prefix/accuracy.

参数
  • metrics (dict or BaseMetric or Sequence) – The config of metrics.

  • dataset_prefixes (Sequence[str]) – The prefix of each dataset. The length of this sequence should be the same as the length of the datasets.

返回类型

None

evaluate(size)[源代码]

Invoke evaluate method of each metric and collect the metrics dictionary.

参数

size (int) – Length of the entire validation dataset. When batch size > 1, the dataloader may pad some data samples to make sure all ranks have the same length of dataset slice. The collect_results function will drop the padded data based on this size.

返回

Evaluation results of all metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型

dict

Functional

mmocr.evaluation.functional.compute_hmean(accum_hit_recall, accum_hit_prec, gt_num, pred_num)[源代码]

Compute hmean given hit number, ground truth number and prediction number.

参数
  • accum_hit_recall (int|float) – Accumulated hits for computing recall.

  • accum_hit_prec (int|float) – Accumulated hits for computing precision.

  • gt_num (int) – Ground truth number.

  • pred_num (int) – Prediction number.

返回

The recall value. precision (float): The precision value. hmean (float): The hmean value.

返回类型

recall (float)

Metric

class mmocr.evaluation.metrics.CharMetric(valid_symbol='[^A-Z^a-z^0-9^一-龥]', collect_device='cpu', prefix=None)[源代码]

Character metrics for text recognition task.

参数
  • valid_symbol (str) – Valid characters. Defaults to ‘[^A-Z^a-z^0-9^一-龥]’

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

返回类型

None

compute_metrics(results)[源代码]

Compute the metrics from processed results.

参数

results (list[Dict]) – The processed results of each batch.

返回

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型

Dict

process(data_batch, data_samples)[源代码]

Process one batch of data_samples. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数
  • data_batch (Sequence[Dict]) – A batch of gts.

  • data_samples (Sequence[Dict]) – A batch of outputs from the model.

返回类型

None

class mmocr.evaluation.metrics.F1Metric(num_classes, key='labels', mode='micro', cared_classes=[], ignored_classes=[], collect_device='cpu', prefix=None)[源代码]

Compute F1 scores.

参数
  • num_classes (int) – Number of labels.

  • key (str) – The key name of the predicted and ground truth labels. Defaults to ‘labels’.

  • mode (str or list[str]) –

    Options are: - ‘micro’: Calculate metrics globally by counting the total true

    positives, false negatives and false positives.

    • ’macro’: Calculate metrics for each label, and find their unweighted mean.

    If mode is a list, then metrics in mode will be calculated separately. Defaults to ‘micro’.

  • cared_classes (list[int]) – The indices of the labels particpated in the metirc computing. If both cared_classes and ignored_classes are empty, all classes will be taken into account. Defaults to []. Note: cared_classes and ignored_classes cannot be specified together.

  • ignored_classes (list[int]) – The index set of labels that are ignored when computing metrics. If both cared_classes and ignored_classes are empty, all classes will be taken into account. Defaults to []. Note: cared_classes and ignored_classes cannot be specified together.

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

返回类型

None

警告

Only non-negative integer labels are involved in computing. All negative ground truth labels will be ignored.

compute_metrics(results)[源代码]

Compute the metrics from processed results.

参数

results (list[Dict]) – The processed results of each batch.

返回

The f1 scores. The keys are the names of the

metrics, and the values are corresponding results. Possible keys are ‘micro_f1’ and ‘macro_f1’.

返回类型

dict[str, float]

process(data_batch, data_samples)[源代码]

Process one batch of data_samples. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数
  • data_batch (Sequence[Dict]) – A batch of gts.

  • data_samples (Sequence[Dict]) – A batch of outputs from the model.

返回类型

None

class mmocr.evaluation.metrics.HmeanIOUMetric(match_iou_thr=0.5, ignore_precision_thr=0.5, pred_score_thrs={'start': 0.3, 'step': 0.1, 'stop': 0.9}, strategy='vanilla', collect_device='cpu', prefix=None)[源代码]

HmeanIOU metric.

This method computes the hmean iou metric, which is done in the following steps:

  • Filter the prediction polygon:

    • Scores is smaller than minimum prediction score threshold.

    • The proportion of the area that intersects with gt ignored polygon is greater than ignore_precision_thr.

  • Computing an M x N IoU matrix, where each element indexing E_mn represents the IoU between the m-th valid GT and n-th valid prediction.

  • Based on different prediction score threshold: - Obtain the ignored predictions according to prediction score.

    The filtered predictions will not be involved in the later metric computations.

    • Based on the IoU matrix, get the match metric according to

    match_iou_thr. - Based on different strategy, accumulate the match number.

  • calculate H-mean under different prediction score threshold.

参数
  • match_iou_thr (float) – IoU threshold for a match. Defaults to 0.5.

  • ignore_precision_thr (float) – Precision threshold when prediction and gt ignored polygons are matched. Defaults to 0.5.

  • pred_score_thrs (dict) – Best prediction score threshold searching space. Defaults to dict(start=0.3, stop=0.9, step=0.1).

  • strategy (str) – Polygon matching strategy. Options are ‘max_matching’ and ‘vanilla’. ‘max_matching’ refers to the optimum strategy that maximizes the number of matches. Vanilla strategy matches gt and pred polygons if both of them are never matched before. It was used in MMOCR 0.x and and academia. Defaults to ‘vanilla’.

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None

返回类型

None

compute_metrics(results)[源代码]

Compute the metrics from processed results.

参数

results (list[dict]) – The processed results of each batch.

返回

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型

dict

process(data_batch, data_samples)[源代码]

Process one batch of data samples and predictions. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数
  • data_batch (Sequence[Dict]) – A batch of data from dataloader.

  • data_samples (Sequence[Dict]) – A batch of outputs from the model.

返回类型

None

class mmocr.evaluation.metrics.OneMinusNEDMetric(valid_symbol='[^A-Z^a-z^0-9^一-龥]', collect_device='cpu', prefix=None)[源代码]

One minus NED metric for text recognition task.

参数
  • valid_symbol (str) – Valid characters. Defaults to ‘[^A-Z^a-z^0-9^一-龥]’

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None

返回类型

None

compute_metrics(results)[源代码]

Compute the metrics from processed results.

参数

results (list[Dict]) – The processed results of each batch.

返回

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型

Dict

process(data_batch, data_samples)[源代码]

Process one batch of data_samples. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数
  • data_batch (Sequence[Dict]) – A batch of gts.

  • data_samples (Sequence[Dict]) – A batch of outputs from the model.

返回类型

None

class mmocr.evaluation.metrics.WordMetric(mode='ignore_case_symbol', valid_symbol='[^A-Z^a-z^0-9^一-龥]', collect_device='cpu', prefix=None)[源代码]

Word metrics for text recognition task.

参数
  • mode (str or list[str]) –

    Options are: - ‘exact’: Accuracy at word level. - ‘ignore_case’: Accuracy at word level, ignoring letter

    case.

    • ’ignore_case_symbol’: Accuracy at word level, ignoring letter case and symbol. (Default metric for academic evaluation)

    If mode is a list, then metrics in mode will be calculated separately. Defaults to ‘ignore_case_symbol’

  • valid_symbol (str) – Valid characters. Defaults to ‘[^A-Z^a-z^0-9^一-龥]’

  • collect_device (str) – Device name used for collecting results from different ranks during distributed training. Must be ‘cpu’ or ‘gpu’. Defaults to ‘cpu’.

  • prefix (str, optional) – The prefix that will be added in the metric names to disambiguate homonymous metrics of different evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults to None.

返回类型

None

compute_metrics(results)[源代码]

Compute the metrics from processed results.

参数

results (list[Dict]) – The processed results of each batch.

返回

The computed metrics. The keys are the names of the metrics, and the values are corresponding results.

返回类型

Dict

process(data_batch, data_samples)[源代码]

Process one batch of data_samples. The processed results should be stored in self.results, which will be used to compute the metrics when all batches have been processed.

参数
  • data_batch (Sequence[Dict]) – A batch of gts.

  • data_samples (Sequence[Dict]) – A batch of outputs from the model.

返回类型

None

mmocr.utils

Point utils

mmocr.utils.point_utils.point_distance(pt1, pt2)[源代码]

Calculate the distance between two points.

参数
  • pt1 (ArrayLike) – The first point.

  • pt2 (ArrayLike) – The second point.

返回

The distance between two points.

返回类型

float

mmocr.utils.point_utils.points_center(points)[源代码]

Calculate the center of a set of points.

参数

points (ArrayLike) – A set of points.

返回

The coordinate of center point.

返回类型

np.ndarray

Bbox utils

mmocr.utils.bbox_utils.bbox2poly(bbox)[源代码]

Converting a bounding box to a polygon.

参数

bbox (ArrayLike) –

A bbox. In any form can be accessed by 1-D indices. E.g. list[float], np.ndarray, or torch.Tensor. bbox is written in

[x1, y1, x2, y2].

返回

The converted polygon [x1, y1, x2, y1, x2, y2, x1, y2].

返回类型

np.array

mmocr.utils.bbox_utils.bbox_center_distance(box1, box2)[源代码]

Calculate the distance between the center points of two bounding boxes.

参数
  • box1 (ArrayLike) – The first bounding box represented in [x1, y1, x2, y2].

  • box2 (ArrayLike) – The second bounding box represented in [x1, y1, x2, y2].

返回

The distance between the center points of two bounding boxes.

返回类型

float

mmocr.utils.bbox_utils.bbox_diag_distance(box)[源代码]

Calculate the diagonal length of a bounding box (distance between the top-left and bottom-right).

参数
  • box (ArrayLike) – The bounding box represented in

  • [x1

  • y1

  • x2

  • y2

  • x3

  • y3

  • x4

  • or [x1 (y4]) –

  • y1

  • x2

  • y2]

返回

The diagonal length of the bounding box.

返回类型

float

mmocr.utils.bbox_utils.bbox_jitter(points_x, points_y, jitter_ratio_x=0.5, jitter_ratio_y=0.1)[源代码]

Jitter on the coordinates of bounding box.

参数
  • points_x (list[float | int]) – List of y for four vertices.

  • points_y (list[float | int]) – List of x for four vertices.

  • jitter_ratio_x (float) – Horizontal jitter ratio relative to the height.

  • jitter_ratio_y (float) – Vertical jitter ratio relative to the height.

mmocr.utils.bbox_utils.bezier2polygon(bezier_points, num_sample=20)[源代码]

Sample points from the boundary of a polygon enclosed by two Bezier curves, which are controlled by bezier_points.

参数
  • bezier_points (ndarray) – A \((2, 4, 2)\) array of 8 Bezeir points or its equalivance. The first 4 points control the curve at one side and the last four control the other side.

  • num_sample (int) – The number of sample points at each Bezeir curve. Defaults to 20.

返回

A list of 2*num_sample points representing the polygon extracted from Bezier curves.

返回类型

list[ndarray]

警告

The points are not guaranteed to be ordered. Please use mmocr.utils.sort_points() to sort points if necessary.

mmocr.utils.bbox_utils.is_on_same_line(box_a, box_b, min_y_overlap_ratio=0.8)[源代码]

Check if two boxes are on the same line by their y-axis coordinates.

Two boxes are on the same line if they overlap vertically, and the length of the overlapping line segment is greater than min_y_overlap_ratio * the height of either of the boxes.

参数
  • box_a (list), box_b (list) – Two bounding boxes to be checked

  • min_y_overlap_ratio (float) – The minimum vertical overlapping ratio allowed for boxes in the same line

返回

The bool flag indicating if they are on the same line

mmocr.utils.bbox_utils.rescale_bbox(bbox, scale_factor, mode='mul')[源代码]

Rescale a bounding box according to scale_factor.

The behavior is different depending on the mode. When mode is ‘mul’, the coordinates will be multiplied by scale_factor, which is usually used in preprocessing transforms such as Resize(). The coordinates will be divided by scale_factor if mode is ‘div’. It can be used in postprocessors to recover the bbox in the original image size.

参数
  • bbox (ndarray) – A bounding box [x1, y1, x2, y2].

  • scale_factor (tuple(int, int)) – (w_scale, h_scale).

  • model (str) – Rescale mode. Can be ‘mul’ or ‘div’. Defaults to ‘mul’.

  • mode (str) –

返回

Rescaled bbox.

返回类型

np.ndarray

mmocr.utils.bbox_utils.rescale_bboxes(bboxes, scale_factor, mode='mul')[源代码]

Rescale bboxes according to scale_factor.

The behavior is different depending on the mode. When mode is ‘mul’, the coordinates will be multiplied by scale_factor, which is usually used in preprocessing transforms such as Resize(). The coordinates will be divided by scale_factor if mode is ‘div’. It can be used in postprocessors to recover the bboxes in the original image size.

参数
  • bboxes (np.ndarray]) – Bounding bboxes in shape (N, 4)

  • scale_factor (tuple(int, int)) – (w_scale, h_scale).

  • model (str) – Rescale mode. Can be ‘mul’ or ‘div’. Defaults to ‘mul’.

  • mode (str) –

返回

Rescaled bboxes.

返回类型

list[np.ndarray]

mmocr.utils.bbox_utils.sort_points(points)[源代码]

Sort arbitory points in clockwise order. Reference: https://stackoverflow.com/a/6989383.

参数

points (list[ndarray] or ndarray or list[list]) – A list of unsorted boundary points.

返回

A list of points sorted in clockwise order.

返回类型

list[ndarray]

mmocr.utils.bbox_utils.sort_vertex(points_x, points_y)[源代码]

Sort box vertices in clockwise order from left-top first.

参数
  • points_x (list[float]) – x of four vertices.

  • points_y (list[float]) – y of four vertices.

返回

x of sorted four vertices. sorted_points_y (list[float]): y of sorted four vertices.

返回类型

sorted_points_x (list[float])

mmocr.utils.bbox_utils.sort_vertex8(points)[源代码]

Sort vertex with 8 points [x1 y1 x2 y2 x3 y3 x4 y4]

mmocr.utils.bbox_utils.stitch_boxes_into_lines(boxes, max_x_dist=10, min_y_overlap_ratio=0.8)[源代码]

Stitch fragmented boxes of words into lines.

Note: part of its logic is inspired by @Johndirr (https://github.com/faustomorales/keras-ocr/issues/22)

参数
  • boxes (list) – List of ocr results to be stitched

  • max_x_dist (int) – The maximum horizontal distance between the closest edges of neighboring boxes in the same line

  • min_y_overlap_ratio (float) – The minimum vertical overlapping ratio allowed for any pairs of neighboring boxes in the same line

返回

List of merged boxes and texts

返回类型

merged_boxes(list[dict])

Polygon utils

mmocr.utils.polygon_utils.boundary_iou(src, target, zero_division=0)[源代码]

Calculate the IOU between two boundaries.

参数
  • src (list) – Source boundary.

  • target (list) – Target boundary.

  • zero_division (int or float) – The return value when invalid boundary exists.

返回

The iou between two boundaries.

返回类型

float

mmocr.utils.polygon_utils.crop_polygon(polygon, crop_box)[源代码]

Crop polygon to be within a box region.

参数
  • polygon (ndarray) – polygon in shape (N, ).

  • crop_box (ndarray) – target box region in shape (4, ).

返回

Cropped polygon. If the polygon is not within the

crop box, return None.

返回类型

np.array or None

mmocr.utils.polygon_utils.is_poly_inside_rect(poly, rect)[源代码]

Check if the polygon is inside the target region. :param poly: Polygon in shape (N, ). :type poly: ArrayLike :param rect: Target region [x1, y1, x2, y2]. :type rect: ndarray

返回

Whether the polygon is inside the cropping region.

返回类型

bool

参数
mmocr.utils.polygon_utils.offset_polygon(poly, distance)[源代码]

Offset (expand/shrink) the polygon by the target distance. It’s a wrapper around pyclipper based on Vatti clipping algorithm.

警告

Polygon coordinates will be casted to int type in PyClipper. Mind the potential precision loss caused by the casting.

参数
  • poly (ArrayLike) – A polygon. In any form can be converted to an 1-D numpy array. E.g. list[float], np.ndarray, or torch.Tensor. Polygon is written in [x1, y1, x2, y2, …].

  • distance (float) – The offset distance. Positive value means expanding, negative value means shrinking.

返回

1-D Offsetted polygon ndarray in float32 type. If the result polygon is invalid or has been split into several parts, return an empty array.

返回类型

np.array

mmocr.utils.polygon_utils.poly2bbox(polygon)[源代码]

Converting a polygon to a bounding box.

参数

polygon – A polygon. In any form can be converted to an 1-D numpy array. E.g. list[float], np.ndarray, or torch.Tensor. Polygon is written in [x1, y1, x2, y2, …].

返回类型

numpy.array

mmocr.utils.polygon_utils.poly2shapely(polygon)[源代码]

Convert a polygon to shapely.geometry.Polygon.

参数

polygon (ArrayLike) – A set of points of 2k shape.

返回

A polygon object.

返回类型

polygon (Polygon)

mmocr.utils.polygon_utils.poly_intersection(poly_a, poly_b, invalid_ret=None, return_poly=False)[源代码]

Calculate the intersection area between two polygons.

参数
  • poly_a (Polygon) – Polygon a.

  • poly_b (Polygon) – Polygon b.

  • invalid_ret (float or int, optional) – The return value when the invalid polygon exists. If it is not specified, the function allows the computation to proceed with invalid polygons by cleaning the their self-touching or self-crossing parts. Defaults to None.

  • return_poly (bool) – Whether to return the polygon of the intersection Defaults to False.

返回

Returns the intersection area or a tuple (area, Optional[poly_obj]), where the area is the intersection area between two polygons and poly_obj is The Polygon object of the intersection area. Set as None if the input is invalid. Set as None if the input is invalid. poly_obj will be returned only if return_poly is True.

返回类型

float or tuple(float, Polygon)

mmocr.utils.polygon_utils.poly_iou(poly_a, poly_b, zero_division=0.0)[源代码]

Calculate the IOU between two polygons.

参数
  • poly_a (Polygon) – Polygon a.

  • poly_b (Polygon) – Polygon b.

  • zero_division (float) – The return value when invalid polygon exists.

返回

The IoU between two polygons.

返回类型

float

mmocr.utils.polygon_utils.poly_make_valid(poly)[源代码]

Convert a potentially invalid polygon to a valid one by eliminating self-crossing or self-touching parts.

参数

poly (Polygon) – A polygon needed to be converted.

返回

A valid polygon.

返回类型

Polygon

mmocr.utils.polygon_utils.poly_union(poly_a, poly_b, invalid_ret=None, return_poly=False)[源代码]

Calculate the union area between two polygons. :param poly_a: Polygon a. :type poly_a: Polygon :param poly_b: Polygon b. :type poly_b: Polygon :param invalid_ret: The return value when the

invalid polygon exists. If it is not specified, the function allows the computation to proceed with invalid polygons by cleaning the their self-touching or self-crossing parts. Defaults to False.

参数
  • return_poly (bool) – Whether to return the polygon of the union. Defaults to False.

  • poly_a (shapely.geometry.polygon.Polygon) –

  • poly_b (shapely.geometry.polygon.Polygon) –

  • invalid_ret (float or int, optional) –

返回

Returns a tuple (area, Optional[poly_obj]), where the area is the union between two polygons and poly_obj is the Polygon or MultiPolygon object of the union of the inputs. The type of object depends on whether they intersect or not. Set as None if the input is invalid. poly_obj will be returned only if return_poly is True.

返回类型

tuple

mmocr.utils.polygon_utils.polys2shapely(polygons)[源代码]

Convert a nested list of boundaries to a list of Polygons.

参数

polygons (list) – The point coordinates of the instance boundary.

返回

Converted shapely.Polygon.

返回类型

list

mmocr.utils.polygon_utils.rescale_polygon(polygon, scale_factor, mode='mul')[源代码]

Rescale a polygon according to scale_factor.

The behavior is different depending on the mode. When mode is ‘mul’, the coordinates will be multiplied by scale_factor, which is usually used in preprocessing transforms such as Resize(). The coordinates will be divided by scale_factor if mode is ‘div’. It can be used in postprocessors to recover the polygon in the original image size.

参数
  • polygon (ArrayLike) – A polygon. In any form can be converted to an 1-D numpy array. E.g. list[float], np.ndarray, or torch.Tensor. Polygon is written in [x1, y1, x2, y2, …].

  • scale_factor (tuple(int, int)) – (w_scale, h_scale).

  • model (str) – Rescale mode. Can be ‘mul’ or ‘div’. Defaults to ‘mul’.

  • mode (str) –

返回

Rescaled polygon.

返回类型

np.ndarray

mmocr.utils.polygon_utils.rescale_polygons(polygons, scale_factor, mode='mul')[源代码]

Rescale polygons according to scale_factor.

The behavior is different depending on the mode. When mode is ‘mul’, the coordinates will be multiplied by scale_factor, which is usually used in preprocessing transforms such as Resize(). The coordinates will be divided by scale_factor if mode is ‘div’. It can be used in postprocessors to recover the polygon in the original image size.

参数
  • polygons (list[ArrayLike]) – A list of polygons, each written in [x1, y1, x2, y2, …] and in any form can be converted to an 1-D numpy array. E.g. list[list[float]], list[np.ndarray], or list[torch.Tensor].

  • scale_factor (tuple(int, int)) – (w_scale, h_scale).

  • model (str) – Rescale mode. Can be ‘mul’ or ‘div’. Defaults to ‘mul’.

  • mode (str) –

返回

Rescaled polygons.

返回类型

list[np.ndarray]

mmocr.utils.polygon_utils.shapely2poly(polygon)[源代码]

Convert a nested list of boundaries to a list of Polygons.

参数

polygon (Polygon) – A polygon represented by shapely.Polygon.

返回

Converted numpy array

返回类型

np.array

mmocr.utils.polygon_utils.sort_points(points)[源代码]

Sort arbitory points in clockwise order. Reference: https://stackoverflow.com/a/6989383.

参数

points (list[ndarray] or ndarray or list[list]) – A list of unsorted boundary points.

返回

A list of points sorted in clockwise order.

返回类型

list[ndarray]

mmocr.utils.polygon_utils.sort_vertex(points_x, points_y)[源代码]

Sort box vertices in clockwise order from left-top first.

参数
  • points_x (list[float]) – x of four vertices.

  • points_y (list[float]) – y of four vertices.

返回

Sorted x and y of four vertices.

  • sorted_points_x (list[float]): x of sorted four vertices.

  • sorted_points_y (list[float]): y of sorted four vertices.

返回类型

tuple[list[float], list[float]]

mmocr.utils.polygon_utils.sort_vertex8(points)[源代码]

Sort vertex with 8 points [x1 y1 x2 y2 x3 y3 x4 y4]

Mask utils

mmocr.utils.mask_utils.fill_hole(input_mask)[源代码]

Fill holes in matrix.

Input:
[[0, 0, 0, 0, 0, 0, 0],

[0, 1, 1, 1, 1, 1, 0], [0, 1, 0, 0, 0, 1, 0], [0, 1, 1, 1, 1, 1, 0], [0, 0, 0, 0, 0, 0, 0]]

Output:
[[0, 0, 0, 0, 0, 0, 0],

[0, 1, 1, 1, 1, 1, 0], [0, 1, 1, 1, 1, 1, 0], [0, 1, 1, 1, 1, 1, 0], [0, 0, 0, 0, 0, 0, 0]]

参数

input_mask (ArrayLike) – The input mask.

返回

The output mask that has been filled.

返回类型

np.array

String utils

class mmocr.utils.string_utils.StringStripper(strip=True, strip_pos='both', strip_str=None)[源代码]

Removing the leading and/or the trailing characters based on the string argument passed.

参数
  • strip (bool) – Whether remove characters from both left and right of the string. Default: True.

  • strip_pos (str) – Which position for removing, can be one of (‘both’, ‘left’, ‘right’), Default: ‘both’.

  • strip_str (str|None) – A string specifying the set of characters to be removed from the left and right part of the string. If None, all leading and trailing whitespaces are removed from the string. Default: None.

Image utils

mmocr.utils.img_utils.crop_img(src_img, box, long_edge_pad_ratio=0.4, short_edge_pad_ratio=0.2)[源代码]

Crop text region given the bounding box which might be slightly padded. The bounding box is assumed to be a quadrangle and tightly bound the text region.

参数
  • src_img (np.array) – The original image.

  • box (list[float | int]) – Points of quadrangle.

  • long_edge_pad_ratio (float) – The ratio of padding to the long edge. The padding will be the length of the short edge * long_edge_pad_ratio. Defaults to 0.4.

  • short_edge_pad_ratio (float) – The ratio of padding to the short edge. The padding will be the length of the long edge * short_edge_pad_ratio. Defaults to 0.2.

返回

The cropped image.

返回类型

np.array

mmocr.utils.img_utils.warp_img(src_img, box, jitter=False, jitter_ratio_x=0.5, jitter_ratio_y=0.1)[源代码]

Crop box area from image using opencv warpPerspective.

参数
  • src_img (np.array) – Image before cropping.

  • box (list[float | int]) – Coordinates of quadrangle.

  • jitter (bool) – Whether to jitter the box.

  • jitter_ratio_x (float) – Horizontal jitter ratio relative to the height.

  • jitter_ratio_y (float) – Vertical jitter ratio relative to the height.

返回

The warped image.

返回类型

np.array

File IO utils

mmocr.utils.fileio.list_from_file(filename, encoding='utf-8')[源代码]

Load a text file and parse the content as a list of strings. The trailing “r” and “n” of each line will be removed.

注解

This will be replaced by mmcv’s version after it supports encoding.

参数
  • filename (str) – Filename.

  • encoding (str) – Encoding used to open the file. Default utf-8.

返回

A list of strings.

返回类型

list[str]

mmocr.utils.fileio.list_to_file(filename, lines)[源代码]

Write a list of strings to a text file.

参数
  • filename (str) – The output filename. It will be created/overwritten.

  • lines (list(str)) – Data to be written.

Others

mmocr.utils.data_converter_utils.dump_ocr_data(image_infos, out_json_name, task_name, **kwargs)[源代码]

Dump the annotation in openmmlab style.

参数
  • image_infos (list) – List of image information dicts. Read the example section for the format illustration.

  • out_json_name (str) – Output json filename.

  • task_name (str) – Task name. Options are ‘textdet’, ‘textrecog’ and ‘textspotter’.

返回类型

Dict

实际案例

Here is the general structure of image_infos for textdet/textspotter tasks:

[  # A list of dicts. Each dict stands for a single image.
    {
        "file_name": "1.jpg",
        "height": 100,
        "width": 200,
        "segm_file": "seg.txt" # (optional) path to segmap
        "anno_info": [  # a list of dicts. Each dict
                        # stands for a single text instance.
            {
                "iscrowd": 0,  # 0: don't ignore this instance
                               # 1: ignore
                "category_id": 0,  # Instance class id. Must be 0
                                   # for OCR tasks to permanently
                                   # be mapped to 'text' category
                "bbox": [x, y, w, h],
                "segmentation": [x1, y1, x2, y2, ...],
                "text": "demo_text"  # for textspotter only.
            }
        ]
    },
]

The input for textrecog task is much simpler:

[   # A list of dicts. Each dict stands for a single image.
    {
        "file_name": "1.jpg",
        "anno_info": [  # a list of dicts. Each dict
                        # stands for a single text instance.
                        # However, in textrecog, usually each
                        # image only has one text instance.
            {
                "text": "demo_text"
            }
        ]
    },
]
返回

The openmmlab-style annotation.

返回类型

out_json(dict)

参数
  • image_infos (Sequence[Dict]) –

  • out_json_name (str) –

  • task_name (str) –

mmocr.utils.data_converter_utils.recog_anno_to_imginfo(file_paths, labels)[源代码]

Convert a list of file_paths and labels for recognition tasks into the format of image_infos acceptable by dump_ocr_data(). It’s meant to maintain compatibility with the legacy annotation format in MMOCR 0.x.

In MMOCR 0.x, data converters for recognition usually converts the annotations into a list of file paths and a list of labels, which look like the following:

file_paths = ['1.jpg', '2.jpg', ...]
labels = ['aaa', 'bbb', ...]

This utility merges them into a list of dictionaries parsable by dump_ocr_data():

[   # A list of dicts. Each dict stands for a single image.
    {
        "file_name": "1.jpg",
        "anno_info": [
            {
                "text": "aaa"
            }
        ]
    },
    {
        "file_name": "2.jpg",
        "anno_info": [
            {
                "text": "bbb"
            }
        ]
    },
    ...
]
参数
  • file_paths (list[str]) – A list of file paths to images.

  • labels (list[str]) – A list of text labels.

返回

Annotations parsable by dump_ocr_data().

返回类型

list[dict]

class mmocr.utils.parsers.LineJsonParser(keys=['filename', 'text'])[源代码]

Parse json-string of one line in annotation file to dict format.

参数

keys (list[str]) – Keys in both json-string and result dict. Defaults to [‘filename’, ‘text’].

返回类型

None

class mmocr.utils.parsers.LineStrParser(keys=['filename', 'text'], keys_idx=[0, 1], separator=' ', **kwargs)[源代码]

Parse string of one line in annotation file to dict format.

参数
  • keys (list[str]) – Keys in result dict. Defaults to [‘filename’, ‘text’].

  • keys_idx (list[int]) – Value index in sub-string list for each key above. Defaults to [0, 1].

  • separator (str) – Separator to separate string to list of sub-string. Defaults to ‘ ‘.

mmocr.models

Common

class mmocr.models.common.backbones.UNet(in_channels=3, base_channels=64, num_stages=5, strides=(1, 1, 1, 1, 1), enc_num_convs=(2, 2, 2, 2, 2), dec_num_convs=(2, 2, 2, 2), downsamples=(True, True, True, True), enc_dilations=(1, 1, 1, 1, 1), dec_dilations=(1, 1, 1, 1), with_cp=False, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, upsample_cfg={'type': 'InterpConv'}, norm_eval=False, dcn=None, plugins=None, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': ['_BatchNorm', 'GroupNorm'], 'val': 1}])[源代码]

UNet backbone. U-Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/pdf/1505.04597.pdf

参数
  • in_channels (int) – Number of input image channels. Default” 3.

  • base_channels (int) – Number of base channels of each stage. The output channels of the first stage. Default: 64.

  • num_stages (int) – Number of stages in encoder, normally 5. Default: 5.

  • strides (Sequence[int 1 | 2]) – Strides of each stage in encoder. len(strides) is equal to num_stages. Normally the stride of the first stage in encoder is 1. If strides[i]=2, it uses stride convolution to downsample in the correspondence encoder stage. Default: (1, 1, 1, 1, 1).

  • enc_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence encoder stage. Default: (2, 2, 2, 2, 2).

  • dec_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence decoder stage. Default: (2, 2, 2, 2).

  • downsamples (Sequence[int]) – Whether use MaxPool to downsample the feature map after the first stage of encoder (stages: [1, num_stages)). If the correspondence encoder stage use stride convolution (strides[i]=2), it will never use MaxPool to downsample, even downsamples[i-1]=True. Default: (True, True, True, True).

  • enc_dilations (Sequence[int]) – Dilation rate of each stage in encoder. Default: (1, 1, 1, 1, 1).

  • dec_dilations (Sequence[int]) – Dilation rate of each stage in decoder. Default: (1, 1, 1, 1).

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • conv_cfg (dict | None) – Config dict for convolution layer. Default: None.

  • norm_cfg (dict | None) – Config dict for normalization layer. Default: dict(type=’BN’).

  • act_cfg (dict | None) – Config dict for activation layer in ConvModule. Default: dict(type=’ReLU’).

  • upsample_cfg (dict) – The upsample config of the upsample module in decoder. Default: dict(type=’InterpConv’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • dcn (bool) – Use deformable convolution in convolutional layer or not. Default: None.

  • plugins (dict) – plugins for convolutional layers. Default: None.

Notice:

The input image size should be divisible by the whole downsample rate of the encoder. More detail of the whole downsample rate can be found in UNet._check_input_divisible.

forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

train(mode=True)[源代码]

Convert the model into training mode while keep normalization layer freezed.

class mmocr.models.common.losses.CrossEntropyLoss(weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean', label_smoothing=0.0)[源代码]

Cross entropy loss.

参数
返回类型

None

class mmocr.models.common.losses.MaskedBCELoss(eps=1e-06)[源代码]

Masked BCE loss.

参数

eps (float) – Eps to avoid zero-division error. Defaults to 1e-6.

返回类型

None

forward(pred, gt, mask=None)[源代码]

Forward function.

参数
  • pred (torch.Tensor) – The prediction in any shape.

  • gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.

  • mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.

返回

The loss value.

返回类型

torch.Tensor

class mmocr.models.common.losses.MaskedBCEWithLogitsLoss(eps=1e-06)[源代码]

This loss combines a Sigmoid layers and a masked BCE loss in one single class. It’s AMP-eligible.

参数

eps (float) – Eps to avoid zero-division error. Defaults to 1e-6.

返回类型

None

forward(pred, gt, mask=None)[源代码]

Forward function.

参数
  • pred (torch.Tensor) – The prediction in any shape.

  • gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.

  • mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.

返回

The loss value.

返回类型

torch.Tensor

class mmocr.models.common.losses.MaskedBalancedBCELoss(reduction='none', negative_ratio=3, fallback_negative_num=0, eps=1e-06)[源代码]

Masked Balanced BCE loss.

参数
  • reduction (str, optional) – The method to reduce the loss. Options are ‘none’, ‘mean’ and ‘sum’. Defaults to ‘none’.

  • negative_ratio (float or int) – Maximum ratio of negative samples to positive ones. Defaults to 3.

  • fallback_negative_num (int) – When the mask contains no positive samples, the number of negative samples to be sampled. Defaults to 0.

  • eps (float) – Eps to avoid zero-division error. Defaults to 1e-6.

返回类型

None

forward(pred, gt, mask=None)[源代码]

Forward function.

参数
  • pred (torch.Tensor) – The prediction in any shape.

  • gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.

  • mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.

返回

The loss value.

返回类型

torch.Tensor

class mmocr.models.common.losses.MaskedBalancedBCEWithLogitsLoss(reduction='none', negative_ratio=3, fallback_negative_num=0, eps=1e-06)[源代码]

This loss combines a Sigmoid layers and a masked balanced BCE loss in one single class. It’s AMP-eligible.

参数
  • reduction (str, optional) – The method to reduce the loss. Options are ‘none’, ‘mean’ and ‘sum’. Defaults to ‘none’.

  • negative_ratio (float or int, optional) – Maximum ratio of negative samples to positive ones. Defaults to 3.

  • fallback_negative_num (int, optional) – When the mask contains no positive samples, the number of negative samples to be sampled. Defaults to 0.

  • eps (float, optional) – Eps to avoid zero-division error. Defaults to 1e-6.

返回类型

None

forward(pred, gt, mask=None)[源代码]

Forward function.

参数
  • pred (torch.Tensor) – The prediction in any shape.

  • gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.

  • mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.

返回

The loss value.

返回类型

torch.Tensor

class mmocr.models.common.losses.MaskedDiceLoss(eps=1e-06)[源代码]

Masked dice loss.

参数

eps (float, optional) – Eps to avoid zero-divison error. Defaults to 1e-6.

返回类型

None

forward(pred, gt, mask=None)[源代码]

Forward function.

参数
  • pred (torch.Tensor) – The prediction in any shape.

  • gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.

  • mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.

返回

The loss value.

返回类型

torch.Tensor

class mmocr.models.common.losses.MaskedSmoothL1Loss(beta=1, eps=1e-06)[源代码]

Masked Smooth L1 loss.

参数
  • beta (float, optional) – The threshold in the piecewise function. Defaults to 1.

  • eps (float, optional) – Eps to avoid zero-division error. Defaults to 1e-6.

返回类型

None

forward(pred, gt, mask=None)[源代码]

Forward function.

参数
  • pred (torch.Tensor) – The prediction in any shape.

  • gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.

  • mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.

返回

The loss value.

返回类型

torch.Tensor

class mmocr.models.common.losses.MaskedSquareDiceLoss(eps=0.001)[源代码]

Masked square dice loss.

参数

eps (float, optional) – Eps to avoid zero-divison error. Defaults to 1e-3.

返回类型

None

forward(pred, gt, mask=None)[源代码]

Forward function.

参数
  • pred (torch.Tensor) – The prediction in any shape.

  • gt (torch.Tensor) – The learning target of the prediction in the same shape as pred.

  • mask (torch.Tensor, optional) – Binary mask in the same shape of pred, indicating positive regions to calculate the loss. Whole region will be taken into account if not provided. Defaults to None.

返回

The loss value.

返回类型

torch.Tensor

class mmocr.models.common.losses.SmoothL1Loss(size_average=None, reduce=None, reduction='mean', beta=1.0)[源代码]

Smooth L1 loss.

参数
返回类型

None

class mmocr.models.common.dictionary.Dictionary(dict_file, with_start=False, with_end=False, same_start_end=False, with_padding=False, with_unknown=False, start_token='<BOS>', end_token='<EOS>', start_end_token='<BOS/EOS>', padding_token='<PAD>', unknown_token='<UKN>')[源代码]

The class generates a dictionary for recognition. It pre-defines four special tokens: start_token, end_token, pad_token, and unknown_token, which will be sequentially placed at the end of the dictionary when their corresponding flags are True.

参数
  • dict_file (str) – The path of Character dict file which a single character must occupies a line.

  • with_start (bool) – The flag to control whether to include the start token. Defaults to False.

  • with_end (bool) – The flag to control whether to include the end token. Defaults to False.

  • same_start_end (bool) – The flag to control whether the start token and end token are the same. It only works when both with_start and with_end are True. Defaults to False.

  • with_padding (bool) – The padding token may represent more than a padding. It can also represent tokens like the blank token in CTC or the background token in SegOCR. Defaults to False.

  • with_unknown (bool) – The flag to control whether to include the unknown token. Defaults to False.

  • start_token (str) – The start token as a string. Defaults to ‘<BOS>’.

  • end_token (str) – The end token as a string. Defaults to ‘<EOS>’.

  • start_end_token (str) – The start/end token as a string. if start and end is the same. Defaults to ‘<BOS/EOS>’.

  • padding_token (str) – The padding token as a string. Defaults to ‘<PAD>’.

  • unknown_token (str, optional) – The unknown token as a string. If it’s set to None and with_unknown is True, the unknown token will be skipped when converting string to index. Defaults to ‘<UKN>’.

返回类型

None

char2idx(char, strict=True)[源代码]

Convert a character to an index via Dictionary.dict.

参数
  • char (str) – The character to convert to index.

  • strict (bool) – The flag to control whether to raise an exception when the character is not in the dictionary. Defaults to True.

返回

The index of the character.

返回类型

int

property dict: list

Returns a list of characters to recognize, where special tokens are counted.

Type

list

idx2str(index)[源代码]

Convert a list of index to string.

参数

index (list[int]) – The list of indexes to convert to string.

返回

The converted string.

返回类型

str

property num_classes: int

Number of output classes. Special tokens are counted.

Type

int

str2idx(string)[源代码]

Convert a string to a list of indexes via Dictionary.dict.

参数

string (str) – The string to convert to indexes.

返回

The list of indexes of the string.

返回类型

list

class mmocr.models.common.layers.TFDecoderLayer(d_model=512, d_inner=256, n_head=8, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, act_cfg={'type': 'mmengine.GELU'}, operation_order=None)[源代码]

Transformer Decoder Layer.

参数
  • d_model (int) – The number of expected features in the decoder inputs (default=512).

  • d_inner (int) – The dimension of the feedforward network model (default=256).

  • n_head (int) – The number of heads in the multiheadattention models (default=8).

  • d_k (int) – Total number of features in key.

  • d_v (int) – Total number of features in value.

  • dropout (float) – Dropout layer on attn_output_weights.

  • qkv_bias (bool) – Add bias in projection layer. Default: False.

  • act_cfg (dict) – Activation cfg for feedforward module.

  • operation_order (tuple[str]) – The execution order of operation in transformer. Such as (‘self_attn’, ‘norm’, ‘enc_dec_attn’, ‘norm’, ‘ffn’, ‘norm’) or (‘norm’, ‘self_attn’, ‘norm’, ‘enc_dec_attn’, ‘norm’, ‘ffn’). Default:None.

forward(dec_input, enc_output, self_attn_mask=None, dec_enc_attn_mask=None)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.common.layers.TFEncoderLayer(d_model=512, d_inner=256, n_head=8, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, act_cfg={'type': 'mmengine.GELU'}, operation_order=None)[源代码]

Transformer Encoder Layer.

参数
  • d_model (int) – The number of expected features in the decoder inputs (default=512).

  • d_inner (int) – The dimension of the feedforward network model (default=256).

  • n_head (int) – The number of heads in the multiheadattention models (default=8).

  • d_k (int) – Total number of features in key.

  • d_v (int) – Total number of features in value.

  • dropout (float) – Dropout layer on attn_output_weights.

  • qkv_bias (bool) – Add bias in projection layer. Default: False.

  • act_cfg (dict) – Activation cfg for feedforward module.

  • operation_order (tuple[str]) – The execution order of operation in transformer. Such as (‘self_attn’, ‘norm’, ‘ffn’, ‘norm’) or (‘norm’, ‘self_attn’, ‘norm’, ‘ffn’). Default:None.

forward(x, mask=None)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.common.modules.MultiHeadAttention(n_head=8, d_model=512, d_k=64, d_v=64, dropout=0.1, qkv_bias=False)[源代码]

Multi-Head Attention module.

参数
  • n_head (int) – The number of heads in the multiheadattention models (default=8).

  • d_model (int) – The number of expected features in the decoder inputs (default=512).

  • d_k (int) – Total number of features in key.

  • d_v (int) – Total number of features in value.

  • dropout (float) – Dropout layer on attn_output_weights.

  • qkv_bias (bool) – Add bias in projection layer. Default: False.

forward(q, k, v, mask=None)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.common.modules.PositionalEncoding(d_hid=512, n_position=200, dropout=0)[源代码]

Fixed positional encoding with sine and cosine functions.

forward(x)[源代码]
参数

x (Tensor) – Tensor of shape (batch_size, pos_len, d_hid, …)

class mmocr.models.common.modules.PositionwiseFeedForward(d_in, d_hid, dropout=0.1, act_cfg={'type': 'Relu'})[源代码]

Two-layer feed-forward module.

参数
  • d_in (int) – The dimension of the input for feedforward network model.

  • d_hid (int) – The dimension of the feedforward network model.

  • dropout (float) – Dropout layer on feedforward output.

  • act_cfg (dict) – Activation cfg for feedforward module.

forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.common.modules.ScaledDotProductAttention(temperature, attn_dropout=0.1)[源代码]

Scaled Dot-Product Attention Module. This code is adopted from https://github.com/jadore801120/attention-is-all-you-need-pytorch.

参数
  • temperature (float) – The scale factor for softmax input.

  • attn_dropout (float) – Dropout layer on attn_output_weights.

forward(q, k, v, mask=None)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Text Detection Detectors

class mmocr.models.textdet.detectors.DBNet(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]

The class for implementing DBNet text detector: Real-time Scene Text Detection with Differentiable Binarization.

[https://arxiv.org/abs/1911.08947].

参数
  • backbone (Dict) –

  • det_head (Dict) –

  • neck (Optional[Dict]) –

  • data_preprocessor (Optional[Dict]) –

  • init_cfg (Optional[Dict]) –

返回类型

None

class mmocr.models.textdet.detectors.DRRG(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]

The class for implementing DRRG text detector. Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.

[https://arxiv.org/abs/2003.07493]

参数
  • backbone (Dict) –

  • det_head (Dict) –

  • neck (Optional[Dict]) –

  • data_preprocessor (Optional[Dict]) –

  • init_cfg (Optional[Dict]) –

返回类型

None

class mmocr.models.textdet.detectors.FCENet(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]

The class for implementing FCENet text detector FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text

Detection

[https://arxiv.org/abs/2104.10442]

参数
  • backbone (Dict) –

  • det_head (Dict) –

  • neck (Optional[Dict]) –

  • data_preprocessor (Optional[Dict]) –

  • init_cfg (Optional[Dict]) –

返回类型

None

class mmocr.models.textdet.detectors.MMDetWrapper(cfg, text_repr_type='poly')[源代码]

A wrapper of MMDet’s model.

参数
  • cfg (dict) – The config of the model.

  • text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’. Defaults to ‘poly’.

返回类型

None

adapt_predictions(data, data_samples)[源代码]

Convert Instance datas from MMDet into MMOCR’s format.

参数
  • data (List[mmdet.structures.det_data_sample.DetDataSample]) –

    (list[DetDataSample]): Detection results of the input images. Each DetDataSample usually contain ‘pred_instances’. And the pred_instances usually contains following keys. - scores (Tensor): Classification scores, has a shape

    (num_instance, )

    • labels (Tensor): Labels of bboxes, has a shape

      (num_instances, ).

    • bboxes (Tensor): Has a shape (num_instances, 4),

      the last dimension 4 arrange as (x1, y1, x2, y2).

    • masks (Tensor, Optional): Has a shape (num_instances, H, W).

  • data_samples (list[TextDetDataSample]) – The annotation data of every samples.

返回

A list of N datasamples containing ground

truth and prediction results. The polygon results are saved in TextDetDataSample.pred_instances.polygons The confidence scores are saved in TextDetDataSample.pred_instances.scores.

返回类型

list[TextDetDataSample]

forward(inputs, data_samples=None, mode='tensor', **kwargs)[源代码]

The unified entry for a forward process in both training and test.

The method works in three modes: “tensor”, “predict” and “loss”:

  • “tensor”: Forward the whole network and return tensor or tuple of

tensor without any post-processing, same as a common nn.Module. - “predict”: Forward and return the predictions, which are fully processed to a list of DetDataSample. - “loss”: Forward and return a dict of losses according to the given inputs and data samples.

Note that this method doesn’t handle either back propagation or parameter update, which are supposed to be done in train_step().

参数
返回类型

Union[Dict[str, torch.Tensor], List[mmdet.structures.det_data_sample.DetDataSample], Tuple[torch.Tensor], torch.Tensor]

:param data_samples (list[DetDataSample] or: list[TextDetDataSample]): The annotation data of every

sample. When in “predict” mode, it should be a list of TextDetDataSample. Otherwise they are :obj:`DetDataSample`s. Defaults to None.

参数
返回

The return type depends on mode.

  • If mode="tensor", return a tensor or a tuple of tensor.

  • If mode="predict", return a list of TextDetDataSample.

  • If mode="loss", return a dict of tensor.

返回类型

Union[Dict[str, torch.Tensor], List[mmdet.structures.det_data_sample.DetDataSample], Tuple[torch.Tensor], torch.Tensor]

class mmocr.models.textdet.detectors.PANet(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]

The class for implementing PANet text detector:

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network [https://arxiv.org/abs/1908.05900].

参数
  • backbone (Dict) –

  • det_head (Dict) –

  • neck (Optional[Dict]) –

  • data_preprocessor (Optional[Dict]) –

  • init_cfg (Optional[Dict]) –

返回类型

None

class mmocr.models.textdet.detectors.PSENet(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]

The class for implementing PSENet text detector: Shape Robust Text Detection with Progressive Scale Expansion Network.

[https://arxiv.org/abs/1806.02559].

参数
  • backbone (Dict) –

  • det_head (Dict) –

  • neck (Optional[Dict]) –

  • data_preprocessor (Optional[Dict]) –

  • init_cfg (Optional[Dict]) –

返回类型

None

class mmocr.models.textdet.detectors.SingleStageTextDetector(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]

The class for implementing single stage text detector.

Single-stage text detectors directly and densely predict bounding boxes or polygons on the output features of the backbone + neck (optional).

参数
  • backbone (dict) – Backbone config.

  • neck (dict, optional) – Neck config. If None, the output from backbone will be directly fed into det_head.

  • det_head (dict) – Head config.

  • data_preprocessor (dict, optional) – Model preprocessing config for processing the input image data. Keys allowed are ``to_rgb``(bool), ``pad_size_divisor``(int), ``pad_value``(int or float), ``mean``(int or float) and ``std``(int or float). Preprcessing order: 1. to rgb; 2. normalization 3. pad. Defaults to None.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

extract_feat(inputs)[源代码]

Extract features.

参数

inputs (Tensor) – Image tensor with shape (N, C, H ,W).

返回

Multi-level features that may have different resolutions.

返回类型

Tensor or tuple[Tensor]

loss(inputs, data_samples)[源代码]

Calculate losses from a batch of inputs and data samples.

参数
  • inputs (torch.Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • data_samples (list[TextDetDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.

返回

A dictionary of loss components.

返回类型

dict[str, Tensor]

predict(inputs, data_samples)[源代码]

Predict results from a batch of inputs and data samples with post- processing.

参数
  • inputs (torch.Tensor) – Images of shape (N, C, H, W).

  • data_samples (list[TextDetDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.

返回

A list of N datasamples of prediction results. Each DetDataSample usually contain ‘pred_instances’. And the pred_instances usually contains following keys.

  • scores (Tensor): Classification scores, has a shape

    (num_instance, )

  • labels (Tensor): Labels of bboxes, has a shape

    (num_instances, ).

  • bboxes (Tensor): Has a shape (num_instances, 4),

    the last dimension 4 arrange as (x1, y1, x2, y2).

  • polygons (list[np.ndarray]): The length is num_instances.

    Each element represents the polygon of the instance, in (xn, yn) order.

返回类型

list[TextDetDataSample]

class mmocr.models.textdet.detectors.TextSnake(backbone, det_head, neck=None, data_preprocessor=None, init_cfg=None)[源代码]

The class for implementing TextSnake text detector: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

[https://arxiv.org/abs/1807.01544]

参数
  • backbone (Dict) –

  • det_head (Dict) –

  • neck (Optional[Dict]) –

  • data_preprocessor (Optional[Dict]) –

  • init_cfg (Optional[Dict]) –

返回类型

None

Text Detection Heads

class mmocr.models.textdet.heads.BaseTextDetHead(module_loss, postprocessor, init_cfg=None)[源代码]

Base head for text detection, build the loss and postprocessor.

1. The init_weights method is used to initialize head’s model parameters. After detector initialization, init_weights is triggered when detector.init_weights() is called externally.

2. The loss method is used to calculate the loss of head, which includes two steps: (1) the head model performs forward propagation to obtain the feature maps (2) The module_loss method is called based on the feature maps to calculate the loss.

loss(): forward() -> module_loss()

3. The predict method is used to predict detection results, which includes two steps: (1) the head model performs forward propagation to obtain the feature maps (2) The postprocessor method is called based on the feature maps to predict detection results including post-processing.

predict(): forward() -> postprocessor()

4. The loss_and_predict method is used to return loss and detection results at the same time. It will call head’s forward, module_loss and postprocessor methods in order.

loss_and_predict(): forward() -> module_loss() -> postprocessor()
参数
  • loss (dict) – Config to build loss.

  • postprocessor (dict) – Config to build postprocessor.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

  • module_loss (Dict) –

返回类型

None

loss(x, data_samples)[源代码]

Perform forward propagation and loss calculation of the detection head on the features of the upstream network.

参数
  • x (tuple[Tensor]) – Features from the upstream network, each is a 4D-tensor.

  • data_samples (List[DetDataSample]) – The Data Samples. It usually includes information such as gt_instance, gt_panoptic_seg and gt_sem_seg.

返回

A dictionary of loss components.

返回类型

dict

loss_and_predict(x, data_samples)[源代码]

Perform forward propagation of the head, then calculate loss and predictions from the features and data samples.

参数
  • x (tuple[Tensor]) – Features from FPN.

  • data_samples (list[DetDataSample]) – Each item contains the meta information of each image and corresponding annotations.

返回

the return value is a tuple contains:

  • losses: (dict[str, Tensor]): A dictionary of loss components.

  • predictions (list[InstanceData]): Detection results of each image after the post process.

返回类型

tuple

predict(x, data_samples)[源代码]

Perform forward propagation of the detection head and predict detection results on the features of the upstream network.

参数
  • x (tuple[Tensor]) – Multi-level features from the upstream network, each is a 4D-tensor.

  • data_samples (List[DetDataSample]) – The Data Samples. It usually includes information such as gt_instance, gt_panoptic_seg and gt_sem_seg.

返回

Detection results of each image after the post process.

返回类型

SampleList

class mmocr.models.textdet.heads.DBHead(in_channels, with_bias=False, module_loss={'type': 'DBModuleLoss'}, postprocessor={'text_repr_type': 'quad', 'type': 'DBPostprocessor'}, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv'}, {'type': 'Constant', 'layer': 'BatchNorm', 'val': 1.0, 'bias': 0.0001}])[源代码]

The class for DBNet head.

This was partially adapted from https://github.com/MhLiao/DB

参数
  • in_channels (int) – The number of input channels.

  • with_bias (bool) – Whether add bias in Conv2d layer. Defaults to False.

  • module_loss (dict) – Config of loss for dbnet. Defaults to dict(type='DBModuleLoss')

  • postprocessor (dict) – Config of postprocessor for dbnet.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

forward(img, data_samples=None, mode='predict')[源代码]
参数
  • img (Tensor) – Shape \((N, C, H, W)\).

  • data_samples (list[TextDetDataSample], optional) – A list of data samples. Defaults to None.

  • mode (str) –

    Forward mode. It affects the return values. Options are “loss”, “predict” and “both”. Defaults to “predict”.

    • loss: Run the full network and return the prob logits, threshold map and binary map.

    • predict: Run the binarzation part and return the prob map only.

    • both: Run the full network and return prob logits, threshold map, binary map and prob map.

返回

Its type depends on mode, read its docstring for details. Each has the shape of \((N, 4H, 4W)\).

返回类型

Tensor or tuple(Tensor)

loss(x, batch_data_samples)[源代码]

Perform forward propagation and loss calculation of the detection head on the features of the upstream network.

参数
  • x (tuple[Tensor]) – Features from the upstream network, each is a 4D-tensor.

  • batch_data_samples (List[DetDataSample]) – The Data Samples. It usually includes information such as gt_instance, gt_panoptic_seg and gt_sem_seg.

返回

A dictionary of loss components.

返回类型

dict

loss_and_predict(x, batch_data_samples)[源代码]

Perform forward propagation of the head, then calculate loss and predictions from the features and data samples.

参数
  • x (tuple[Tensor]) – Features from FPN.

  • batch_data_samples (list[DetDataSample]) – Each item contains the meta information of each image and corresponding annotations.

返回

the return value is a tuple contains:

  • losses: (dict[str, Tensor]): A dictionary of loss components.

  • predictions (list[InstanceData]): Detection results of each image after the post process.

返回类型

tuple

predict(x, batch_data_samples)[源代码]

Perform forward propagation of the detection head and predict detection results on the features of the upstream network.

参数
  • x (tuple[Tensor]) – Multi-level features from the upstream network, each is a 4D-tensor.

  • batch_data_samples (List[DetDataSample]) – The Data Samples. It usually includes information such as gt_instance, gt_panoptic_seg and gt_sem_seg.

返回

Detection results of each image after the post process.

返回类型

SampleList

class mmocr.models.textdet.heads.DRRGHead(in_channels, k_at_hops=(8, 4), num_adjacent_linkages=3, node_geo_feat_len=120, pooling_scale=1.0, pooling_output_size=(4, 3), nms_thr=0.3, min_width=8.0, max_width=24.0, comp_shrink_ratio=1.03, comp_ratio=0.4, comp_score_thr=0.3, text_region_thr=0.2, center_region_thr=0.2, center_region_area_thr=50, local_graph_thr=0.7, module_loss={'type': 'DRRGModuleLoss'}, postprocessor={'link_thr': 0.85, 'type': 'DRRGPostprocessor'}, init_cfg={'mean': 0, 'override': {'name': 'out_conv'}, 'std': 0.01, 'type': 'Normal'})[源代码]

The class for DRRG head: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.

参数
  • in_channels (int) – The number of input channels.

  • k_at_hops (tuple(int)) – The number of i-hop neighbors, i = 1, 2. Defaults to (8, 4).

  • num_adjacent_linkages (int) – The number of linkages when constructing adjacent matrix. Defaults to 3.

  • node_geo_feat_len (int) – The length of embedded geometric feature vector of a component. Defaults to 120.

  • pooling_scale (float) – The spatial scale of rotated RoI-Align. Defaults to 1.0.

  • pooling_output_size (tuple(int)) – The output size of RRoI-Aligning. Defaults to (4, 3).

  • nms_thr (float) – The locality-aware NMS threshold of text components. Defaults to 0.3.

  • min_width (float) – The minimum width of text components. Defaults to 8.0.

  • max_width (float) – The maximum width of text components. Defaults to 24.0.

  • comp_shrink_ratio (float) – The shrink ratio of text components. Defaults to 1.03.

  • comp_ratio (float) – The reciprocal of aspect ratio of text components. Defaults to 0.4.

  • comp_score_thr (float) – The score threshold of text components. Defaults to 0.3.

  • text_region_thr (float) – The threshold for text region probability map. Defaults to 0.2.

  • center_region_thr (float) – The threshold for text center region probability map. Defaults to 0.2.

  • center_region_area_thr (int) – The threshold for filtering small-sized text center region. Defaults to 50.

  • local_graph_thr (float) – The threshold to filter identical local graphs. Defaults to 0.7.

  • module_loss (dict) – The config of loss that DRRGHead uses. Defaults to dict(type='DRRGModuleLoss').

  • postprocessor (dict) – Config of postprocessor for Drrg. Defaults to dict(type='DrrgPostProcessor', link_thr=0.85).

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to dict(type='Normal', override=dict(name='out_conv'), mean=0, std=0.01).

返回类型

None

forward(inputs, data_samples=None)[源代码]

Run DRRG head in prediction mode, and return the raw tensors only. :param inputs: Shape of \((1, C, H, W)\). :type inputs: Tensor :param data_samples: A list of data

samples. Defaults to None.

返回

Returns (edge, score, text_comps).

  • edge (ndarray): The edge array of shape \((N_{edges}, 2)\) where each row is a pair of text component indices that makes up an edge in graph.

  • score (ndarray): The score array of shape \((N_{edges},)\), corresponding to the edge above.

  • text_comps (ndarray): The text components of shape \((M, 9)\) where each row corresponds to one box and its score: (x1, y1, x2, y2, x3, y3, x4, y4, score).

返回类型

tuple

参数
loss(inputs, data_samples)[源代码]

Loss function.

参数
  • inputs (Tensor) – Shape of \((N, C, H, W)\).

  • data_samples (List[TextDetDataSample]) – List of data samples.

返回

  • pred_maps (Tensor): Prediction map with shape

    \((N, 6, H, W)\).

  • gcn_pred (Tensor): Prediction from GCN module, with

    shape \((N, 2)\).

  • gt_labels (Tensor): Ground-truth label of shape

    \((m, n)\) where \(m * n = N\).

返回类型

tuple(pred_maps, gcn_pred, gt_labels)

class mmocr.models.textdet.heads.FCEHead(in_channels, fourier_degree=5, module_loss={'num_sample': 50, 'type': 'FCEModuleLoss'}, postprocessor={'alpha': 1.0, 'beta': 2.0, 'num_reconstr_points': 50, 'score_thr': 0.3, 'text_repr_type': 'poly', 'type': 'FCEPostprocessor'}, init_cfg={'mean': 0, 'override': [{'name': 'out_conv_cls'}, {'name': 'out_conv_reg'}], 'std': 0.01, 'type': 'Normal'})[源代码]

The class for implementing FCENet head.

FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection

参数
  • in_channels (int) – The number of input channels.

  • fourier_degree (int) – The maximum Fourier transform degree k. Defaults to 5.

  • module_loss (dict) – Config of loss for FCENet. Defaults to dict(type='FCEModuleLoss', num_sample=50).

  • postprocessor (dict) – Config of postprocessor for FCENet.

  • init_cfg (dict, optional) – Initialization configs.

返回类型

None

forward(inputs, data_samples=None)[源代码]
参数
  • inputs (List[Tensor]) – Each tensor has the shape of \((N, C_i, H_i, W_i)\).

  • data_samples (list[TextDetDataSample], optional) – A list of data samples. Defaults to None.

返回

A list of dict with keys of cls_res, reg_res corresponds to the classification result and regression result computed from the input tensor with the same index. They have the shapes of \((N, C_{cls,i}, H_i, W_i)\) and \((N, C_{out,i}, H_i, W_i)\).

返回类型

list[dict]

forward_single(x)[源代码]

Forward function for a single feature level.

参数

x (Tensor) – The input tensor with the shape of \((N, C_i, H_i, W_i)\).

返回

The classification and regression result with the shape of \((N, C_{cls,i}, H_i, W_i)\) and \((N, C_{out,i}, H_i, W_i)\).

返回类型

Tensor

class mmocr.models.textdet.heads.PANHead(in_channels, hidden_dim, out_channel, module_loss={'type': 'PANModuleLoss'}, postprocessor={'text_repr_type': 'poly', 'type': 'PANPostprocessor'}, init_cfg=[{'type': 'Normal', 'mean': 0, 'std': 0.01, 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'bias': 0, 'layer': 'BN'}])[源代码]

The class for PANet head.

参数
  • in_channels (list[int]) – A list of 4 numbers of input channels.

  • hidden_dim (int) – The hidden dimension of the first convolutional layer.

  • out_channel (int) – Number of output channels.

  • module_loss (dict) – Configuration dictionary for loss type. Defaults to dict(type=’PANModuleLoss’)

  • postprocessor (dict) – Config of postprocessor for PANet. Defaults to dict(type=’PANPostprocessor’, text_repr_type=’poly’).

  • init_cfg (list[dict]) –

    Initialization configs. Defaults to [dict(type=’Normal’, mean=0, std=0.01, layer=’Conv2d’),

    dict(type=’Constant’, val=1, bias=0, layer=’BN’)]

返回类型

None

forward(inputs, data_samples=None)[源代码]

PAN head forward. :param inputs: Each tensor has the shape of

\((N, C_i, W, H)\), where \(\sum_iC_i=C_{in}\) and \(C_{in}\) is input_channels.

参数
  • data_samples (list[TextDetDataSample], optional) – A list of data samples. Defaults to None.

  • inputs (list[Tensor] | Tensor) –

返回

A tensor of shape \((N, C_{out}, W, H)\) where \(C_{out}\) is output_channels.

返回类型

Tensor

class mmocr.models.textdet.heads.PSEHead(in_channels, hidden_dim, out_channel, module_loss={'type': 'PSEModuleLoss'}, postprocessor={'text_repr_type': 'poly', 'type': 'PSEPostprocessor'}, init_cfg=None)[源代码]

The class for PSENet head.

参数
  • in_channels (list[int]) – A list of numbers of input channels.

  • hidden_dim (int) – The hidden dimension of the first convolutional layer.

  • out_channel (int) – Number of output channels.

  • module_loss (dict) – Configuration dictionary for loss type. Supported loss types are “PANModuleLoss” and “PSEModuleLoss”. Defaults to PSEModuleLoss.

  • postprocessor (dict) – Config of postprocessor for PSENet.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

class mmocr.models.textdet.heads.TextSnakeHead(in_channels, out_channels=5, downsample_ratio=1.0, module_loss={'type': 'TextSnakeModuleLoss'}, postprocessor={'text_repr_type': 'poly', 'type': 'TextSnakePostprocessor'}, init_cfg={'mean': 0, 'override': {'name': 'out_conv'}, 'std': 0.01, 'type': 'Normal'})[源代码]

The class for TextSnake head: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

参数
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • downsample_ratio (float) – Downsample ratio.

  • module_loss (dict) – Configuration dictionary for loss type. Defaults to dict(type='TextSnakeModuleLoss').

  • postprocessor (dict) – Config of postprocessor for TextSnake.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

forward(inputs, data_samples=None)[源代码]
参数
  • inputs (torch.Tensor) – Shape \((N, C_{in}, H, W)\), where \(C_{in}\) is in_channels. \(H\) and \(W\) should be the same as the input of backbone.

  • data_samples (list[TextDetDataSample], optional) – A list of data samples. Defaults to None.

返回

A tensor of shape \((N, 5, H, W)\), where the five channels represent [0]: text score, [1]: center score, [2]: sin, [3] cos, [4] radius, respectively.

返回类型

Tensor

Text Detection Necks

class mmocr.models.textdet.necks.FPEM_FFM(in_channels, conv_out=128, fpem_repeat=2, align_corners=False, init_cfg={'distribution': 'uniform', 'layer': 'Conv2d', 'type': 'Xavier'})[源代码]

This code is from https://github.com/WenmuZhou/PAN.pytorch.

参数
  • in_channels (list[int]) – A list of 4 numbers of input channels.

  • conv_out (int) – Number of output channels.

  • fpem_repeat (int) – Number of FPEM layers before FFM operations.

  • align_corners (bool) – The interpolation behaviour in FFM operation, used in torch.nn.functional.interpolate().

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(x)[源代码]
参数

x (list[Tensor]) – A list of four tensors of shape \((N, C_i, H_i, W_i)\), representing C2, C3, C4, C5 features respectively. \(C_i\) should matches the number in in_channels.

返回

Four tensors of shape \((N, C_{out}, H_0, W_0)\) where \(C_{out}\) is conv_out.

返回类型

list[Tensor]

class mmocr.models.textdet.necks.FPNC(in_channels, lateral_channels=256, out_channels=64, bias_on_lateral=False, bn_re_on_lateral=False, bias_on_smooth=False, bn_re_on_smooth=False, asf_cfg=None, conv_after_concat=False, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv'}, {'type': 'Constant', 'layer': 'BatchNorm', 'val': 1.0, 'bias': 0.0001}])[源代码]

FPN-like fusion module in Real-time Scene Text Detection with Differentiable Binarization.

This was partially adapted from https://github.com/MhLiao/DB and https://github.com/WenmuZhou/DBNet.pytorch.

参数
  • in_channels (list[int]) – A list of numbers of input channels.

  • lateral_channels (int) – Number of channels for lateral layers.

  • out_channels (int) – Number of output channels.

  • bias_on_lateral (bool) – Whether to use bias on lateral convolutional layers.

  • bn_re_on_lateral (bool) – Whether to use BatchNorm and ReLU on lateral convolutional layers.

  • bias_on_smooth (bool) – Whether to use bias on smoothing layer.

  • bn_re_on_smooth (bool) – Whether to use BatchNorm and ReLU on smoothing layer.

  • asf_cfg (dict, optional) – Adaptive Scale Fusion module configs. The attention_type can be ‘ScaleChannelSpatial’.

  • conv_after_concat (bool) – Whether to add a convolution layer after the concatenation of predictions.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

forward(inputs)[源代码]
参数

inputs (list[Tensor]) – Each tensor has the shape of \((N, C_i, H_i, W_i)\). It usually expects 4 tensors (C2-C5 features) from ResNet.

返回

A tensor of shape \((N, C_{out}, H_0, W_0)\) where \(C_{out}\) is out_channels.

返回类型

Tensor

class mmocr.models.textdet.necks.FPNF(in_channels=[256, 512, 1024, 2048], out_channels=256, fusion_type='concat', init_cfg={'distribution': 'uniform', 'layer': 'Conv2d', 'type': 'Xavier'})[源代码]

FPN-like fusion module in Shape Robust Text Detection with Progressive Scale Expansion Network.

参数
  • in_channels (list[int]) – A list of number of input channels. Defaults to [256, 512, 1024, 2048].

  • out_channels (int) – The number of output channels. Defaults to 256.

  • fusion_type (str) – Type of the final feature fusion layer. Available options are “concat” and “add”. Defaults to “concat”.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to dict(type=’Xavier’, layer=’Conv2d’, distribution=’uniform’)

返回类型

None

forward(inputs)[源代码]
参数

inputs (list[Tensor]) – Each tensor has the shape of \((N, C_i, H_i, W_i)\). It usually expects 4 tensors (C2-C5 features) from ResNet.

返回

A tensor of shape \((N, C_{out}, H_0, W_0)\) where \(C_{out}\) is out_channels.

返回类型

Tensor

class mmocr.models.textdet.necks.FPN_UNet(in_channels, out_channels, init_cfg={'distribution': 'uniform', 'layer': ['Conv2d', 'ConvTranspose2d'], 'type': 'Xavier'})[源代码]

The class for implementing DRRG and TextSnake U-Net-like FPN.

DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

参数
  • in_channels (list[int]) – Number of input channels at each scale. The length of the list should be 4.

  • out_channels (int) – The number of output channels.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

forward(x)[源代码]
参数

x (list[Tensor] | tuple[Tensor]) – A list of four tensors of shape \((N, C_i, H_i, W_i)\), representing C2, C3, C4, C5 features respectively. \(C_i\) should matches the number in in_channels.

返回

Shape \((N, C, H, W)\) where \(H=4H_0\) and \(W=4W_0\).

返回类型

Tensor

Text Detection Module Losses

class mmocr.models.textdet.module_losses.DBModuleLoss(loss_prob={'type': 'MaskedBalancedBCEWithLogitsLoss'}, loss_thr={'beta': 0, 'type': 'MaskedSmoothL1Loss'}, loss_db={'type': 'MaskedDiceLoss'}, weight_prob=5.0, weight_thr=10.0, shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_sidelength=8)[源代码]

The class for implementing DBNet loss.

This is partially adapted from https://github.com/MhLiao/DB.

参数
  • loss_prob (dict) – The loss config for probability map. Defaults to dict(type=’MaskedBalancedBCEWithLogitsLoss’).

  • loss_thr (dict) – The loss config for threshold map. Defaults to dict(type=’MaskedSmoothL1Loss’, beta=0).

  • loss_db (dict) – The loss config for binary map. Defaults to dict(type=’MaskedDiceLoss’).

  • weight_prob (float) – The weight of probability map loss. Denoted as \(\alpha\) in paper. Defaults to 5.

  • weight_thr (float) – The weight of threshold map loss. Denoted as \(\beta\) in paper. Defaults to 10.

  • shrink_ratio (float) – The ratio of shrunk text region. Defaults to 0.4.

  • thr_min (float) – The minimum threshold map value. Defaults to 0.3.

  • thr_max (float) – The maximum threshold map value. Defaults to 0.7.

  • min_sidelength (int or float) – The minimum sidelength of the minimum rotated rectangle around any text region. Defaults to 8.

返回类型

None

forward(preds, data_samples)[源代码]

Compute DBNet loss.

参数
  • preds (tuple(tensor)) – Raw predictions from model, containing prob_logits, thr_map and binary_map. Each is a tensor of shape \((N, H, W)\).

  • data_samples (list[TextDetDataSample]) – The data samples.

返回

The dict for dbnet losses with loss_prob, loss_db and loss_thr.

返回类型

results(dict)

get_targets(data_samples)[源代码]

Generate loss targets from data samples.

参数

data_samples (list(TextDetDataSample)) – Ground truth data samples.

返回

A tuple of four tensors as DBNet targets.

返回类型

tuple

class mmocr.models.textdet.module_losses.DRRGModuleLoss(ohem_ratio=3.0, downsample_ratio=1.0, orientation_thr=2.0, resample_step=8.0, num_min_comps=9, num_max_comps=600, min_width=8.0, max_width=24.0, center_region_shrink_ratio=0.3, comp_shrink_ratio=1.0, comp_w_h_ratio=0.3, text_comp_nms_thr=0.25, min_rand_half_height=8.0, max_rand_half_height=24.0, jitter_level=0.2, loss_text={'eps': 1e-05, 'fallback_negative_num': 100, 'type': 'MaskedBalancedBCEWithLogitsLoss'}, loss_center={'type': 'MaskedBCEWithLogitsLoss'}, loss_top={'reduction': 'none', 'type': 'SmoothL1Loss'}, loss_btm={'reduction': 'none', 'type': 'SmoothL1Loss'}, loss_sin={'type': 'MaskedSmoothL1Loss'}, loss_cos={'type': 'MaskedSmoothL1Loss'}, loss_gcn={'type': 'CrossEntropyLoss'})[源代码]

The class for implementing DRRG loss. This is partially adapted from https://github.com/GXYM/DRRG licensed under the MIT license.

DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.

参数
  • ohem_ratio (float) – The negative/positive ratio in ohem. Defaults to 3.0.

  • downsample_ratio (float) – Downsample ratio. Defaults to 1.0. TODO: remove it.

  • orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle. Defaults to 2.0.

  • resample_step (float) – The step size for resampling the text center line. Defaults to 8.0.

  • num_min_comps (int) – The minimum number of text components, which should be larger than k_hop1 mentioned in paper. Defaults to 9.

  • num_max_comps (int) – The maximum number of text components. Defaults to 600.

  • min_width (float) – The minimum width of text components. Defaults to 8.0.

  • max_width (float) – The maximum width of text components. Defaults to 24.0.

  • center_region_shrink_ratio (float) – The shrink ratio of text center regions. Defaults to 0.3.

  • comp_shrink_ratio (float) – The shrink ratio of text components. Defaults to 1.0.

  • comp_w_h_ratio (float) – The width to height ratio of text components. Defaults to 0.3.

  • min_rand_half_height (float) – The minimum half-height of random text components. Defaults to 8.0.

  • max_rand_half_height (float) – The maximum half-height of random text components. Defaults to 24.0.

  • jitter_level (float) – The jitter level of text component geometric features. Defaults to 0.2.

  • loss_text (dict) – The loss config used to calculate the text loss. Defaults to dict(type='MaskedBalancedBCEWithLogitsLoss', fallback_negative_num=100, eps=1e-5).

  • loss_center (dict) – The loss config used to calculate the center loss. Defaults to dict(type='MaskedBCEWithLogitsLoss').

  • loss_top (dict) – The loss config used to calculate the top loss, which is a part of the height loss. Defaults to dict(type='SmoothL1Loss', reduction='none').

  • loss_btm (dict) – The loss config used to calculate the bottom loss, which is a part of the height loss. Defaults to dict(type='SmoothL1Loss', reduction='none').

  • loss_sin (dict) – The loss config used to calculate the sin loss. Defaults to dict(type='MaskedSmoothL1Loss').

  • loss_cos (dict) – The loss config used to calculate the cos loss. Defaults to dict(type='MaskedSmoothL1Loss').

  • loss_gcn (dict) – The loss config used to calculate the GCN loss. Defaults to dict(type='CrossEntropyLoss').

  • text_comp_nms_thr (float) –

返回类型

None

forward(preds, data_samples)[源代码]

Compute Drrg loss.

参数
  • preds (tuple) – The prediction tuple(pred_maps, gcn_pred, gt_labels), each of shape \((N, 6, H, W)\), \((N, 2)\) and \((m ,n)\), where \(m * n = N\).

  • data_samples (list[TextDetDataSample]) – The data samples.

返回

A loss dict with loss_text, loss_center, loss_height, loss_sin, loss_cos, and loss_gcn.

返回类型

dict

get_targets(data_samples)[源代码]

Generate loss targets from data samples.

参数

data_samples (list(TextDetDataSample)) – Ground truth data samples.

返回

A tuple of 8 lists of tensors as DRRG targets. Read docstring of _get_target_single for more details.

返回类型

tuple

class mmocr.models.textdet.module_losses.FCEModuleLoss(fourier_degree, num_sample, negative_ratio=3.0, resample_step=4.0, center_region_shrink_ratio=0.3, level_size_divisors=(8, 16, 32), level_proportion_range=((0, 0.4), (0.3, 0.7), (0.6, 1.0)), loss_tr={'type': 'MaskedBalancedBCELoss'}, loss_tcl={'type': 'MaskedBCELoss'}, loss_reg_x={'reduction': 'none', 'type': 'SmoothL1Loss'}, loss_reg_y={'reduction': 'none', 'type': 'SmoothL1Loss'})[源代码]

The class for implementing FCENet loss.

FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection

参数
  • fourier_degree (int) – The maximum Fourier transform degree k.

  • num_sample (int) – The sampling points number of regression loss. If it is too small, fcenet tends to be overfitting.

  • negative_ratio (float or int) – Maximum ratio of negative samples to positive ones in OHEM. Defaults to 3.

  • resample_step (float) – The step size for resampling the text center line (TCL). It’s better not to exceed half of the minimum width.

  • center_region_shrink_ratio (float) – The shrink ratio of text center region.

  • level_size_divisors (tuple(int)) – The downsample ratio on each level.

  • level_proportion_range (tuple(tuple(int))) – The range of text sizes assigned to each level.

  • loss_tr (dict) – The loss config used to calculate the text region loss. Defaults to dict(type=’MaskedBalancedBCELoss’).

  • loss_tcl (dict) – The loss config used to calculate the text center line loss. Defaults to dict(type=’MaskedBCELoss’).

  • loss_reg_x (dict) – The loss config used to calculate the regression loss on x axis. Defaults to dict(type=’MaskedSmoothL1Loss’).

  • loss_reg_y (dict) – The loss config used to calculate the regression loss on y axis. Defaults to dict(type=’MaskedSmoothL1Loss’).

返回类型

None

forward(preds, data_samples)[源代码]

Compute FCENet loss.

参数
  • preds (list[dict]) – A list of dict with keys of cls_res, reg_res corresponds to the classification result and regression result computed from the input tensor with the same index. They have the shapes of \((N, C_{cls,i}, H_i, W_i)\) and :math: (N, C_{out,i}, H_i, W_i).

  • data_samples (list[TextDetDataSample]) – The data samples.

返回

The dict for fcenet losses with loss_text, loss_center,

loss_reg_x and loss_reg_y.

返回类型

dict

forward_single(pred, gt)[源代码]

Compute loss for one feature level.

参数
  • pred (dict) – A dict with keys cls_res and reg_res corresponds to the classification result and regression result from one feature level.

  • gt (Tensor) – Ground truth for one feature level. Cls and reg targets are concatenated along the channel dimension.

返回

A list of losses for each feature level.

返回类型

list[Tensor]

get_targets(data_samples)[源代码]

Generate loss targets for fcenet from data samples.

参数

data_samples (list(TextDetDataSample)) – Ground truth data samples.

返回

A tuple of three tensors from three different

feature level as FCENet targets.

返回类型

tuple[Tensor]

class mmocr.models.textdet.module_losses.PANModuleLoss(loss_text={'type': 'MaskedSquareDiceLoss'}, loss_kernel={'type': 'MaskedSquareDiceLoss'}, loss_embedding={'type': 'PANEmbLossV1'}, weight_text=1.0, weight_kernel=0.5, weight_embedding=0.25, ohem_ratio=3, shrink_ratio=(1.0, 0.5), max_shrink_dist=20, reduction='mean')[源代码]

The class for implementing PANet loss. This was partially adapted from https://github.com/whai362/pan_pp.pytorch and https://github.com/WenmuZhou/PAN.pytorch.

PANet: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.

参数
  • loss_text (dict) – dict(type=’MaskedSquareDiceLoss’).

  • loss_kernel (dict) – dict(type=’MaskedSquareDiceLoss’).

  • loss_embedding (dict) – dict(type=’PANEmbLossV1’).

  • weight_text (float) – The weight of text loss. Defaults to 1.

  • weight_kernel (float) – The weight of kernel loss. Defaults to 0.5.

  • weight_embedding (float) – The weight of embedding loss. Defaults to 0.25.

  • ohem_ratio (float) – The negative/positive ratio in ohem. Defaults to 3.

  • shrink_ratio (tuple[float]) – The ratio of shrinking kernel. Defaults to (1.0, 0.5).

  • max_shrink_dist (int or float) – The maximum shrinking distance. Defaults to 20.

  • reduction (str) – The way to reduce the loss. Available options are “mean” and “sum”. Defaults to ‘mean’.

返回类型

None

forward(preds, data_samples)[源代码]

Compute PAN loss.

参数
  • preds (dict) – Raw predictions from model with shape \((N, C, H, W)\).

  • data_samples (list[TextDetDataSample]) – The data samples.

返回

The dict for pan losses with loss_text, loss_kernel, loss_aggregation and loss_discrimination.

返回类型

dict

get_targets(data_samples)[源代码]

Generate the gt targets for PANet.

参数
返回

The output result dictionary.

返回类型

results (dict)

class mmocr.models.textdet.module_losses.PSEModuleLoss(weight_text=0.7, weight_kernel=0.3, loss_text={'type': 'MaskedSquareDiceLoss'}, loss_kernel={'type': 'MaskedSquareDiceLoss'}, ohem_ratio=3, reduction='mean', kernel_sample_type='adaptive', shrink_ratio=(1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4), max_shrink_dist=20)[源代码]

The class for implementing PSENet loss. This is partially adapted from https://github.com/whai362/PSENet.

PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network.

参数
  • weight_text (float) – The weight of text loss. Defaults to 0.7.

  • weight_kernel (float) – The weight of text kernel. Defaults to 0.3.

  • loss_text (dict) – Loss type for text. Defaults to dict(‘MaskedSquareDiceLoss’).

  • loss_kernel (dict) – Loss type for kernel. Defaults to dict(‘MaskedSquareDiceLoss’).

  • ohem_ratio (int or float) – The negative/positive ratio in ohem. Defaults to 3.

  • reduction (str) – The way to reduce the loss. Defaults to ‘mean’. Options are ‘mean’ and ‘sum’.

  • kernel_sample_type (str) – The way to sample kernel. Defaults to adaptive. Options are ‘adaptive’ and ‘hard’.

  • shrink_ratio (tuple) – The ratio for shirinking text instances. Defaults to (1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4).

  • max_shrink_dist (int or float) – The maximum shrinking distance. Defaults to 20.

返回类型

None

forward(preds, data_samples)[源代码]

Compute PSENet loss.

参数
返回

The dict for pse losses with loss_text, loss_kernel, loss_aggregation and loss_discrimination.

返回类型

dict

class mmocr.models.textdet.module_losses.SegBasedModuleLoss[源代码]

Base class for the module loss of segmentation-based text detection algorithms with some handy utilities.

返回类型

None

class mmocr.models.textdet.module_losses.TextSnakeModuleLoss(ohem_ratio=3.0, downsample_ratio=1.0, orientation_thr=2.0, resample_step=4.0, center_region_shrink_ratio=0.3, loss_text={'eps': 1e-05, 'fallback_negative_num': 100, 'type': 'MaskedBalancedBCEWithLogitsLoss'}, loss_center={'type': 'MaskedBCEWithLogitsLoss'}, loss_radius={'type': 'MaskedSmoothL1Loss'}, loss_sin={'type': 'MaskedSmoothL1Loss'}, loss_cos={'type': 'MaskedSmoothL1Loss'})[源代码]

The class for implementing TextSnake loss. This is partially adapted from https://github.com/princewang1994/TextSnake.pytorch.

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

参数
  • ohem_ratio (float) – The negative/positive ratio in ohem.

  • downsample_ratio (float) – Downsample ratio. Defaults to 1.0. TODO: remove it.

  • orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.

  • resample_step (float) – The step of resampling.

  • center_region_shrink_ratio (float) – The shrink ratio of text center.

  • loss_text (dict) – The loss config used to calculate the text loss.

  • loss_center (dict) – The loss config used to calculate the center loss.

  • loss_radius (dict) – The loss config used to calculate the radius loss.

  • loss_sin (dict) – The loss config used to calculate the sin loss.

  • loss_cos (dict) – The loss config used to calculate the cos loss.

返回类型

None

forward(preds, data_samples)[源代码]
参数
  • preds (Tensor) – The prediction map of shape \((N, 5, H, W)\), where each dimension is the map of “text_region”, “center_region”, “sin_map”, “cos_map”, and “radius_map” respectively.

  • data_samples (list[TextDetDataSample]) – The data samples.

返回

A loss dict with loss_text, loss_center, loss_radius, loss_sin and loss_cos.

返回类型

dict

get_targets(data_samples)[源代码]

Generate loss targets from data samples.

参数

data_samples (list(TextDetDataSample)) – Ground truth data samples.

返回

tuple(gt_text_masks, gt_masks, gt_center_region_masks, gt_radius_maps, gt_sin_maps, gt_cos_maps): A tuple of six lists of ndarrays as the targets.

返回类型

Tuple

vector_angle(vec1, vec2)[源代码]

Compute the angle between two vectors.

参数
返回类型

numpy.ndarray

vector_cos(vec)[源代码]

Compute the cos of the angle between vector and x-axis.

参数

vec (numpy.ndarray) –

返回类型

float

vector_sin(vec)[源代码]

Compute the sin of the angle between vector and x-axis.

参数

vec (numpy.ndarray) –

返回类型

float

vector_slope(vec)[源代码]

Compute the slope of a vector.

参数

vec (numpy.ndarray) –

返回类型

float

Text Detection Data Preprocessors

class mmocr.models.textdet.data_preprocessors.TextDetDataPreprocessor(mean=None, std=None, pad_size_divisor=1, pad_value=0, bgr_to_rgb=False, rgb_to_bgr=False, batch_augments=None)[源代码]

Image pre-processor for detection tasks.

Comparing with the mmengine.ImgDataPreprocessor,

  1. It supports batch augmentations.

2. It will additionally append batch_input_shape and pad_shape to data_samples considering the object detection task.

It provides the data pre-processing as follows

  • Collate and move data to the target device.

  • Pad inputs to the maximum size of current batch with defined pad_value. The padding size can be divisible by a defined pad_size_divisor

  • Stack inputs to batch_inputs.

  • Convert inputs from bgr to rgb if the shape of input is (3, H, W).

  • Normalize image with defined std and mean.

  • Do batch augmentations during training.

参数
  • mean (Sequence[Number], optional) – The pixel mean of R, G, B channels. Defaults to None.

  • std (Sequence[Number], optional) – The pixel standard deviation of R, G, B channels. Defaults to None.

  • pad_size_divisor (int) – The size of padded image should be divisible by pad_size_divisor. Defaults to 1.

  • pad_value (Number) – The padded pixel value. Defaults to 0.

  • pad_mask (bool) – Whether to pad instance masks. Defaults to False.

  • mask_pad_value (int) – The padded pixel value for instance masks. Defaults to 0.

  • pad_seg (bool) – Whether to pad semantic segmentation maps. Defaults to False.

  • seg_pad_value (int) – The padded pixel value for semantic segmentation maps. Defaults to 255.

  • bgr_to_rgb (bool) – whether to convert image from BGR to RGB. Defaults to False.

  • rgb_to_bgr (bool) – whether to convert image from RGB to RGB. Defaults to False.

  • batch_augments (list[dict], optional) – Batch-level augmentations

返回类型

None

forward(data, training=False)[源代码]

Perform normalization、padding and bgr2rgb conversion based on BaseDataPreprocessor.

参数
  • data (dict) – data sampled from dataloader.

  • training (bool) – Whether to enable training time augmentation.

返回

Data in the same format as the model input.

返回类型

dict

Text Detection Postprocessors

class mmocr.models.textdet.postprocessors.BaseTextDetPostProcessor(text_repr_type='poly', rescale_fields=None, train_cfg=None, test_cfg=None)[源代码]

Base postprocessor for text detection models.

参数
  • text_repr_type (str) – The boundary encoding type, ‘poly’ or ‘quad’. Defaults to ‘poly’.

  • rescale_fields (list[str], optional) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed.

  • train_cfg (dict, optional) – The parameters to be passed to self.get_text_instances in training. Defaults to None.

  • test_cfg (dict, optional) – The parameters to be passed to self.get_text_instances in testing. Defaults to None.

返回类型

None

get_text_instances(pred_results, data_sample, **kwargs)[源代码]

Get text instance predictions of one image.

参数
  • pred_result (tuple(Tensor)) – Prediction results of an image.

  • data_sample (TextDetDataSample) – Datasample of an image.

  • **kwargs – Other parameters. Configurable via __init__.train_cfg and __init__.test_cfg.

  • pred_results (Union[torch.Tensor, List[torch.Tensor]]) –

返回

A new DataSample with predictions filled in. The polygon/bbox results are usually saved in TextDetDataSample.pred_instances.polygons or TextDetDataSample.pred_instances.bboxes. The confidence scores are saved in TextDetDataSample.pred_instances.scores.

返回类型

TextDetDataSample

poly_nms(polygons, scores, threshold)[源代码]

Non-maximum suppression for text detection.

参数
  • polygons (list[ndarray]) – List of polygons.

  • scores (list[float]) – List of scores.

  • threshold (float) – Threshold for NMS.

返回

  • keep_polys (list[ndarray]): List of preserved polygons after NMS.

  • keep_scores (list[float]): List of preserved scores after NMS.

返回类型

tuple(keep_polys, keep_scores)

rescale(results, scale_factor)[源代码]

Rescale results in results.pred_instances according to scale_factor, whose keys are defined in self.rescale_fields. Usually used to rescale bboxes and/or polygons.

参数
返回

Prediction results with rescaled results.

返回类型

TextDetDataSample

split_results(pred_results)[源代码]

Split batched tensor(s) along the first dimension pack split tensors into a list.

参数

pred_results (tensor or list[tensor]) – Raw result tensor(s) from detection head. Each tensor usually has the shape of (N, …)

返回

N tensors if pred_results

is a tensor, or a list of N lists of tensors if pred_results is a list of tensors.

返回类型

list[tensor] or list[list[tensor]]

class mmocr.models.textdet.postprocessors.DBPostprocessor(text_repr_type='poly', rescale_fields=['polygons'], mask_thr=0.3, min_text_score=0.3, min_text_width=5, unclip_ratio=1.5, epsilon_ratio=0.01, max_candidates=3000, **kwargs)[源代码]

Decoding predictions of DbNet to instances. This is partially adapted from https://github.com/MhLiao/DB.

参数
  • text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’. Defaults to ‘poly’.

  • rescale_fields (list[str]) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed. Defaults to [‘polygons’].

  • mask_thr (float) – The mask threshold value for binarization. Defaults to 0.3.

  • min_text_score (float) – The threshold value for converting binary map to shrink text regions. Defaults to 0.3.

  • min_text_width (int) – The minimum width of boundary polygon/box predicted. Defaults to 5.

  • unclip_ratio (float) – The unclip ratio for text regions dilation. Defaults to 1.5.

  • epsilon_ratio (float) – The epsilon ratio for approximation accuracy. Defaults to 0.01.

  • max_candidates (int) – The maximum candidate number. Defaults to 3000.

返回类型

None

get_text_instances(prob_map, data_sample)[源代码]

Get text instance predictions of one image.

参数
  • pred_result (Tensor) – DBNet’s output prob_map of shape \((H, W)\).

  • data_sample (TextDetDataSample) – Datasample of an image.

  • prob_map (torch.Tensor) –

返回

A new DataSample with predictions filled in. Polygons and results are saved in TextDetDataSample.pred_instances.polygons. The confidence scores are saved in TextDetDataSample.pred_instances.scores.

返回类型

TextDetDataSample

class mmocr.models.textdet.postprocessors.DRRGPostprocessor(link_thr=0.8, edge_len_thr=50.0, rescale_fields=['polygons'], **kwargs)[源代码]

Merge text components and construct boundaries of text instances.

参数
  • link_thr (float) – The edge score threshold. Defaults to 0.8.

  • edge_len_thr (int or float) – The edge length threshold. Defaults to 50.

  • rescale_fields (list[str]) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed. Defaults to [polygons’].

返回类型

None

get_text_instances(pred_results, data_sample)[源代码]

Get text instance predictions of one image.

参数
  • pred_result (tuple(ndarray, ndarray, ndarray)) – Prediction results edge, score and text_comps. Each of shape \((N_{edges}, 2)\), \((N_{edges},)\) and \((M, 9)\), respectively.

  • data_sample (TextDetDataSample) – Datasample of an image.

  • pred_results (Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]) –

返回

The original dataSample with predictions filled in. Polygons and results are saved in TextDetDataSample.pred_instances.polygons. The confidence scores are saved in TextDetDataSample.pred_instances.scores.

返回类型

TextDetDataSample

split_results(pred_results)[源代码]

Split batched elements in pred_results along the first dimension into batch_num sub-elements and regather them into a list of dicts.

However, DRRG only outputs one batch at inference time, so this function is a no-op.

参数

pred_results (Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]) –

返回类型

List[Tuple]

class mmocr.models.textdet.postprocessors.FCEPostprocessor(fourier_degree, num_reconstr_points, rescale_fields=['polygons'], scales=[8, 16, 32], text_repr_type='poly', alpha=1.0, beta=2.0, score_thr=0.3, nms_thr=0.1, **kwargs)[源代码]

Decoding predictions of FCENet to instances.

参数
  • fourier_degree (int) – The maximum Fourier transform degree k.

  • num_reconstr_points (int) – The points number of the polygon reconstructed from predicted Fourier coefficients.

  • rescale_fields (list[str]) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed. Defaults to [‘polygons’].

  • scales (list[int]) – The down-sample scale of each layer. Defaults to [8, 16, 32].

  • text_repr_type (str) –

    Boundary encoding type ‘poly’ or ‘quad’. Defaults

    to ‘poly’.

    alpha (float): The parameter to calculate final scores

    \(Score_{final} = (Score_{text region} ^ alpha) * (Score_{text center_region}^ beta)\). Defaults to 1.0.

  • beta (float) – The parameter to calculate final score. Defaults to 2.0.

  • score_thr (float) – The threshold used to filter out the final candidates.Defaults to 0.3.

  • nms_thr (float) – The threshold of nms. Defaults to 0.1.

  • alpha (float) –

返回类型

None

get_text_instances(pred_results, data_sample)[源代码]

Get text instance predictions of one image.

参数
  • pred_results (List[dict]) – A list of dict with keys of cls_res, reg_res corresponding to the classification result and regression result computed from the input tensor with the same index. They have the shapes of \((N, C_{cls,i}, H_i, W_i)\) and \((N, C_{out,i}, H_i, W_i)\).

  • data_sample (TextDetDataSample) – Datasample of an image.

返回

A new DataSample with predictions filled in. Polygons and results are saved in TextDetDataSample.pred_instances.polygons. The confidence scores are saved in TextDetDataSample.pred_instances.scores.

返回类型

TextDetDataSample

split_results(pred_results)[源代码]

Split batched elements in pred_results along the first dimension into batch_num sub-elements and regather them into a list of dicts.

参数

pred_results (list[dict]) – A list of dict with keys of cls_res, reg_res corresponding to the classification result and regression result computed from the input tensor with the same index. They have the shapes of \((N, C_{cls,i}, H_i, W_i)\) and \((N, C_{out,i}, H_i, W_i)\).

返回

N lists. Each list contains three dicts from different feature level.

返回类型

list[list[dict]]

class mmocr.models.textdet.postprocessors.PANPostprocessor(text_repr_type='poly', score_threshold=0.3, rescale_fields=['polygons'], min_text_confidence=0.5, min_kernel_confidence=0.5, distance_threshold=3.0, min_text_area=16, downsample_ratio=0.25)[源代码]

Convert scores to quadrangles via post processing in PANet. This is partially adapted from https://github.com/WenmuZhou/PAN.pytorch.

参数
  • text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’. Defaults to ‘poly’.

  • score_threshold (float) – The minimal text score. Defaults to 0.3.

  • rescale_fields (list[str]) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed. Defaults to [‘polygons’].

  • min_text_confidence (float) – The minimal text confidence. Defaults to 0.5.

  • min_kernel_confidence (float) – The minimal kernel confidence. Defaults to 0.5.

  • distance_threshold (float) – The minimal distance between the point to mean of text kernel. Defaults to 3.0.

  • min_text_area (int) – The minimal text instance region area. Defaults to 16.

  • downsample_ratio (float) – Downsample ratio. Defaults to 0.25.

返回类型

None

get_text_instances(pred_results, data_sample, **kwargs)[源代码]

Get text instance predictions of one image.

参数
返回

A new DataSample with predictions filled in. Polygons and results are saved in TextDetDataSample.pred_instances.polygons. The confidence scores are saved in TextDetDataSample.pred_instances.scores.

返回类型

TextDetDataSample

split_results(pred_results)[源代码]

Split the prediction results into text score and kernel score.

参数

pred_results (torch.Tensor) – The prediction results.

返回

The text score and kernel score.

返回类型

List[torch.Tensor]

class mmocr.models.textdet.postprocessors.PSEPostprocessor(text_repr_type='poly', rescale_fields=['polygons'], min_kernel_confidence=0.5, score_threshold=0.3, min_kernel_area=0, min_text_area=16, downsample_ratio=0.25)[源代码]

Decoding predictions of PSENet to instances. This is partially adapted from https://github.com/whai362/PSENet.

参数
  • text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’. Defaults to ‘poly’.

  • rescale_fields (list[str]) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed. Defaults to [‘polygons’].

  • min_kernel_confidence (float) – The minimal kernel confidence. Defaults to 0.5.

  • score_threshold (float) – The minimal text average confidence. Defaults to 0.3.

  • min_kernel_area (int) – The minimal text kernel area. Defaults to 0.

  • min_text_area (int) – The minimal text instance region area. Defaults to 16.

  • downsample_ratio (float) – Downsample ratio. Defaults to 0.25.

返回类型

None

get_text_instances(pred_results, data_sample, **kwargs)[源代码]
参数
返回

A new DataSample with predictions filled in. Polygons and results are saved in TextDetDataSample.pred_instances.polygons. The confidence scores are saved in TextDetDataSample.pred_instances.scores.

返回类型

TextDetDataSample

class mmocr.models.textdet.postprocessors.TextSnakePostprocessor(text_repr_type='poly', min_text_region_confidence=0.6, min_center_region_confidence=0.2, min_center_area=30, disk_overlap_thr=0.03, radius_shrink_ratio=1.03, rescale_fields=['polygons'], **kwargs)[源代码]

Decoding predictions of TextSnake to instances. This was partially adapted from https://github.com/princewang1994/TextSnake.pytorch.

参数
  • text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’.

  • min_text_region_confidence (float) – The confidence threshold of text region in TextSnake.

  • min_center_region_confidence (float) – The confidence threshold of text center region in TextSnake.

  • min_center_area (int) – The minimal text center region area.

  • disk_overlap_thr (float) – The radius overlap threshold for merging disks.

  • radius_shrink_ratio (float) – The shrink ratio of ordered disks radii.

  • rescale_fields (list[str], optional) – The bbox/polygon field names to be rescaled. If None, no rescaling will be performed.

返回类型

None

get_text_instances(pred_results, data_sample)[源代码]
参数
返回

The instance boundary and its confidence.

返回类型

list[list[float]]

split_results(pred_results)[源代码]

Split the prediction results into text score and kernel score.

参数

pred_results (torch.Tensor) – The prediction results.

返回

The text score and kernel score.

返回类型

List[torch.Tensor]

Text Recognition Recognizer

class mmocr.models.textrecog.recognizers.ABINet(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]

Implementation of `Read Like Humans: Autonomous, Bidirectional and Iterative LanguageModeling for Scene Text Recognition.

<https://arxiv.org/pdf/2103.06495.pdf>`_

参数
返回类型

None

class mmocr.models.textrecog.recognizers.BaseRecognizer(data_preprocessor=None, init_cfg=None)[源代码]

Base class for recognizer.

参数
  • data_preprocessor (dict or ConfigDict, optional) – The pre-process config of BaseDataPreprocessor. it usually includes, pad_size_divisor, pad_value, mean and std.

  • init_cfg (dict or ConfigDict or List[dict], optional) – the config to control the initialization. Defaults to None.

abstract extract_feat(inputs)[源代码]

Extract features from images.

参数

inputs (torch.Tensor) –

返回类型

torch.Tensor

forward(inputs, data_samples=None, mode='tensor', **kwargs)[源代码]

The unified entry for a forward process in both training and test.

The method should accept three modes: “tensor”, “predict” and “loss”:

  • “tensor”: Forward the whole network and return tensor or tuple of

tensor without any post-processing, same as a common nn.Module. - “predict”: Forward and return the predictions, which are fully processed to a list of DetDataSample. - “loss”: Forward and return a dict of losses according to the given inputs and data samples.

Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the train_step().

参数
  • inputs (torch.Tensor) – The input tensor with shape (N, C, …) in general.

  • data_samples (list[DetDataSample], optional) – The annotation data of every samples. Defaults to None.

  • mode (str) – Return what kind of value. Defaults to ‘tensor’.

返回

The return type depends on mode.

  • If mode="tensor", return a tensor or a tuple of tensor.

  • If mode="predict", return a list of DetDataSample.

  • If mode="loss", return a dict of tensor.

返回类型

Union[Dict[str, torch.Tensor], List[mmocr.structures.textrecog_data_sample.TextRecogDataSample], Tuple[torch.Tensor], torch.Tensor]

abstract loss(inputs, data_samples, **kwargs)[源代码]

Calculate losses from a batch of inputs and data samples.

参数
返回类型

Union[dict, tuple]

abstract predict(inputs, data_samples, **kwargs)[源代码]

Predict results from a batch of inputs and data samples with post- processing.

参数
返回类型

List[mmocr.structures.textrecog_data_sample.TextRecogDataSample]

property with_backbone

whether the recognizer has a backbone

Type

bool

property with_decoder

whether the recognizer has a decoder

Type

bool

property with_encoder

whether the recognizer has an encoder

Type

bool

property with_preprocessor

whether the recognizer has a preprocessor

Type

bool

class mmocr.models.textrecog.recognizers.CRNN(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]

CTC-loss based recognizer.

参数
返回类型

None

class mmocr.models.textrecog.recognizers.EncoderDecoderRecognizer(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]

Base class for encode-decode recognizer.

参数
  • preprocessor (dict, optional) – Config dict for preprocessor. Defaults to None.

  • backbone (dict, optional) – Backbone config. Defaults to None.

  • encoder (dict, optional) – Encoder config. If None, the output from backbone will be directly fed into decoder. Defaults to None.

  • decoder (dict, optional) – Decoder config. Defaults to None.

  • data_preprocessor (dict, optional) – Model preprocessing config for processing the input image data. Keys allowed are ``to_rgb``(bool), ``pad_size_divisor``(int), ``pad_value``(int or float), ``mean``(int or float) and ``std``(int or float). Preprcessing order: 1. to rgb; 2. normalization 3. pad. Defaults to None.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

extract_feat(inputs)[源代码]

Directly extract features from the backbone.

参数

inputs (torch.Tensor) –

返回类型

torch.Tensor

loss(inputs, data_samples, **kwargs)[源代码]

Calculate losses from a batch of inputs and data samples. :param inputs: Input images of shape (N, C, H, W).

Typically these should be mean centered and std scaled.

参数
  • data_samples (list[TextRecogDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.

  • inputs (tensor) –

返回

A dictionary of loss components.

返回类型

dict[str, tensor]

predict(inputs, data_samples, **kwargs)[源代码]

Predict results from a batch of inputs and data samples with post- processing.

参数
  • inputs (torch.Tensor) – Image input tensor.

  • data_samples (list[TextRecogDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.

返回

A list of N datasamples of prediction results. Results are stored in pred_text.

返回类型

list[TextRecogDataSample]

class mmocr.models.textrecog.recognizers.MASTER(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]

Implementation of MASTER

参数
返回类型

None

class mmocr.models.textrecog.recognizers.NRTR(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]

Implementation of NRTR

参数
返回类型

None

class mmocr.models.textrecog.recognizers.RobustScanner(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]

Implementation of `RobustScanner.

<https://arxiv.org/pdf/2007.07542.pdf>

参数
返回类型

None

class mmocr.models.textrecog.recognizers.SARNet(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]

Implementation of SAR

参数
返回类型

None

class mmocr.models.textrecog.recognizers.SATRN(preprocessor=None, backbone=None, encoder=None, decoder=None, data_preprocessor=None, init_cfg=None)[源代码]

Implementation of SATRN

参数
返回类型

None

Text Recognition Backbones

class mmocr.models.textrecog.backbones.MiniVGG(leaky_relu=True, input_channels=3, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]

A mini VGG backbone for text recognition, modified from `VGG-VeryDeep.

<https://arxiv.org/pdf/1409.1556.pdf>`_

参数
  • leaky_relu (bool) – Use leakyRelu or not.

  • input_channels (int) – Number of channels of input image tensor.

forward(x)[源代码]
参数

x (Tensor) – Images of shape \((N, C, H, W)\).

返回

The feature Tensor of shape \((N, 512, H/32, (W/4+1)\).

返回类型

Tensor

class mmocr.models.textrecog.backbones.MobileNetV2(pooling_layers=[3, 4, 5], init_cfg=None)[源代码]

See mmdet.models.backbones.MobileNetV2 for details.

参数
  • pooling_layers (list) – List of indices of pooling layers.

  • init_cfg (InitConfigType, optional) – Initialization config dict.

返回类型

None

forward(x)[源代码]

Forward function.

参数

x (torch.Tensor) –

返回类型

torch.Tensor

class mmocr.models.textrecog.backbones.NRTRModalityTransform(in_channels=3, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]

Modality transform in NRTR.

参数
  • in_channels (int) – Input channel of image. Defaults to 3.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

forward(x)[源代码]

Backbone forward.

参数

x (torch.Tensor) – Image tensor of shape \((N, C, W, H)\). W, H is the width and height of image.

返回

Output tensor.

返回类型

Tensor

class mmocr.models.textrecog.backbones.ResNet(in_channels, stem_channels, block_cfgs, arch_layers, arch_channels, strides, out_indices=None, plugins=None, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': 'BatchNorm2d'}])[源代码]
参数
  • in_channels (int) – Number of channels of input image tensor.

  • stem_channels (list[int]) – List of channels in each stem layer. E.g., [64, 128] stands for 64 and 128 channels in the first and second stem layers.

  • block_cfgs (dict) – Configs of block

  • arch_layers (list[int]) – List of Block number for each stage.

  • arch_channels (list[int]) – List of channels for each stage.

  • strides (Sequence[int] or Sequence[tuple]) – Strides of the first block of each stage.

  • out_indices (Sequence[int], optional) – Indices of output stages. If not specified, only the last stage will be returned.

  • plugins (dict, optional) – Configs of stage plugins

  • init_cfg (dict or list[dict], optional) – Initialization config dict.

forward(x)[源代码]

Args: x (Tensor): Image tensor of shape \((N, 3, H, W)\).

返回

Feature tensor. It can be a list of feature outputs at specific layers if out_indices is specified.

返回类型

Tensor or list[Tensor]

参数

x (torch.Tensor) –

forward_plugin(x, plugin_name)[源代码]

Forward tensor through plugin.

参数
返回

Output tensor.

返回类型

torch.Tensor

class mmocr.models.textrecog.backbones.ResNet31OCR(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]
Implement ResNet backbone for text recognition, modified from

ResNet

参数
  • base_channels (int) – Number of channels of input image tensor.

  • layers (list[int]) – List of BasicBlock number for each stage.

  • channels (list[int]) – List of out_channels of Conv2d layer.

  • out_indices (None | Sequence[int]) – Indices of output stages.

  • stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.

  • last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.backbones.ResNetABI(in_channels=3, stem_channels=32, base_channels=32, arch_settings=[3, 4, 6, 6, 3], strides=[2, 1, 2, 1, 1], out_indices=None, last_stage_pool=False, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': 'BatchNorm2d'}])[源代码]

Implement ResNet backbone for text recognition, modified from `ResNet.

<https://arxiv.org/pdf/1512.03385.pdf>`_ and https://github.com/FangShancheng/ABINet

参数
  • in_channels (int) – Number of channels of input image tensor.

  • stem_channels (int) – Number of stem channels.

  • base_channels (int) – Number of base channels.

  • arch_settings (list[int]) – List of BasicBlock number for each stage.

  • strides (Sequence[int]) – Strides of the first block of each stage.

  • out_indices (None | Sequence[int]) – Indices of output stages. If not specified, only the last stage will be returned.

  • last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

forward(x)[源代码]
参数

x (Tensor) – Image tensor of shape \((N, 3, H, W)\).

返回

Feature tensor. Its shape depends on ResNetABI’s config. It can be a list of feature outputs at specific layers if out_indices is specified.

返回类型

Tensor or list[Tensor]

class mmocr.models.textrecog.backbones.ShallowCNN(input_channels=1, hidden_dim=512, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]

Implement Shallow CNN block for SATRN.

SATRN: On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention.

参数
  • input_channels (int) – Number of channels of input image tensor \(D_i\). Defaults to 1.

  • hidden_dim (int) – Size of hidden layers of the model \(D_m\). Defaults to 512.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

forward(x)[源代码]
参数

x (Tensor) – Input image feature \((N, D_i, H, W)\).

返回

A tensor of shape \((N, D_m, H/4, W/4)\).

返回类型

Tensor

Text Recognition Data Preprocessors

class mmocr.models.textrecog.data_preprocessors.TextRecogDataPreprocessor(mean=None, std=None, pad_size_divisor=1, pad_value=0, bgr_to_rgb=False, rgb_to_bgr=False, batch_augments=None)[源代码]

Image pre-processor for recognition tasks.

Comparing with the mmengine.ImgDataPreprocessor,

  1. It supports batch augmentations.

2. It will additionally append batch_input_shape and valid_ratio to data_samples considering the object recognition task.

It provides the data pre-processing as follows

  • Collate and move data to the target device.

  • Pad inputs to the maximum size of current batch with defined pad_value. The padding size can be divisible by a defined pad_size_divisor

  • Stack inputs to inputs.

  • Convert inputs from bgr to rgb if the shape of input is (3, H, W).

  • Normalize image with defined std and mean.

  • Do batch augmentations during training.

参数
  • mean (Sequence[Number], optional) – The pixel mean of R, G, B channels. Defaults to None.

  • std (Sequence[Number], optional) – The pixel standard deviation of R, G, B channels. Defaults to None.

  • pad_size_divisor (int) – The size of padded image should be divisible by pad_size_divisor. Defaults to 1.

  • pad_value (Number) – The padded pixel value. Defaults to 0.

  • bgr_to_rgb (bool) – whether to convert image from BGR to RGB. Defaults to False.

  • rgb_to_bgr (bool) – whether to convert image from RGB to RGB. Defaults to False.

  • batch_augments (list[dict], optional) – Batch-level augmentations

返回类型

None

forward(data, training=False)[源代码]

Perform normalization、padding and bgr2rgb conversion based on BaseDataPreprocessor.

参数
  • data (dict) – Data sampled from dataloader.

  • training (bool) – Whether to enable training time augmentation.

返回

Data in the same format as the model input.

返回类型

dict

Text Recognition Layers

class mmocr.models.textrecog.layers.Adaptive2DPositionalEncoding(d_hid=512, n_height=100, n_width=100, dropout=0.1, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}])[源代码]

Implement Adaptive 2D positional encoder for SATRN, see `SATRN.

<https://arxiv.org/abs/1910.04396>`_ Modified from https://github.com/Media-Smart/vedastr Licensed under the Apache License, Version 2.0 (the “License”);

参数
  • d_hid (int) – Dimensions of hidden layer. Defaults to 512.

  • n_height (int) – Max height of the 2D feature output. Defaults to 100.

  • n_width (int) – Max width of the 2D feature output. Defaults to 100.

  • dropout (float) – Dropout rate. Defaults to 0.1.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to [dict(type=’Xavier’, layer=’Conv2d’)]

返回类型

None

forward(x)[源代码]

Forward propagation of Locality Aware Feedforward module.

参数

x (Tensor) – Feature tensor.

返回

Feature tensor after Locality Aware Feedforward.

返回类型

Tensor

class mmocr.models.textrecog.layers.BasicBlock(inplanes, planes, stride=1, downsample=None, use_conv1x1=False, plugins=None)[源代码]
forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

make_block_plugins(in_channels, plugins)[源代码]

make plugins for block.

参数
  • in_channels (int) – Input channels of plugin.

  • plugins (list[dict]) – List of plugins cfg to build.

返回

List of the names of plugin.

返回类型

list[str]

class mmocr.models.textrecog.layers.BidirectionalLSTM(nIn, nHidden, nOut)[源代码]
forward(input)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.Bottleneck(inplanes, planes, stride=1, downsample=False)[源代码]
forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.DotProductAttentionLayer(dim_model=None)[源代码]
forward(query, key, value, mask=None)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.PositionAwareLayer(dim_model, rnn_layers=2)[源代码]
forward(img_feature)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.RobustScannerFusionLayer(dim_model, dim=- 1, init_cfg=None)[源代码]
forward(x0, x1)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.SATRNEncoderLayer(d_model=512, d_inner=512, n_head=8, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, init_cfg=None)[源代码]

Implement encoder layer for SATRN, see `SATRN.

<https://arxiv.org/abs/1910.04396>`_.

参数
  • d_model (int) – Dimension \(D_m\) of the input from previous model. Defaults to 512.

  • d_inner (int) – Hidden dimension of feedforward layers. Defaults to 256.

  • n_head (int) – Number of parallel attention heads. Defaults to 8.

  • d_k (int) – Dimension of the key vector. Defaults to 64.

  • d_v (int) – Dimension of the value vector. Defaults to 64.

  • dropout (float) – Dropout rate. Defaults to 0.1.

  • qkv_bias (bool) – Whether to use bias. Defaults to False.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

forward(x, h, w, mask=None)[源代码]

Forward propagation of encoder.

参数
  • x (Tensor) – Feature tensor of shape \((N, h*w, D_m)\).

  • h (int) – Height of the original feature.

  • w (int) – Width of the original feature.

  • mask (Tensor, optional) – Mask used for masked multi-head attention. Defaults to None.

返回

A tensor of shape \((N, h*w, D_m)\).

返回类型

Tensor

Text Recognition Plugins

class mmocr.models.textrecog.plugins.GCAModule(in_channels, ratio, n_head, pooling_type='att', scale_attn=False, fusion_type='channel_add', **kwargs)[源代码]

GCAModule in MASTER.

参数
  • in_channels (int) – Channels of input tensor.

  • ratio (float) – Scale ratio of in_channels.

  • n_head (int) – Numbers of attention head.

  • pooling_type (str) – Spatial pooling type. Options are [avg, att].

  • scale_attn (bool) – Whether to scale the attention map. Defaults to False.

  • fusion_type (str) – Fusion type of input and context. Options are [channel_add, channel_mul, channel_concat].

返回类型

None

forward(x)[源代码]

Forward function.

参数

x (Tensor) – Input feature map.

返回

Output tensor after GCAModule.

返回类型

Tensor

spatial_pool(x)[源代码]

Spatial pooling function.

参数

x (Tensor) – Input feature map.

返回

Output tensor after spatial pooling.

返回类型

Tensor

class mmocr.models.textrecog.plugins.Maxpool2d(kernel_size, stride, padding=0, **kwargs)[源代码]

A wrapper around nn.Maxpool2d().

参数
  • kernel_size (int or tuple(int)) – Kernel size for max pooling layer

  • stride (int or tuple(int)) – Stride for max pooling layer

  • padding (int or tuple(int)) – Padding for pooling layer

返回类型

None

forward(x)[源代码]

Forward function. :param x: Input feature map. :type x: Tensor

返回

Output tensor after Maxpooling layer.

返回类型

Tensor

Text Recognition Encoders

class mmocr.models.textrecog.encoders.ABIEncoder(n_layers=2, n_head=8, d_model=512, d_inner=2048, dropout=0.1, max_len=256, init_cfg=None)[源代码]

Implement transformer encoder for text recognition, modified from <https://github.com/FangShancheng/ABINet>.

参数
  • n_layers (int) – Number of attention layers. Defaults to 2.

  • n_head (int) – Number of parallel attention heads. Defaults to 8.

  • d_model (int) – Dimension \(D_m\) of the input from previous model. Defaults to 512.

  • d_inner (int) – Hidden dimension of feedforward layers. Defaults to 2048.

  • dropout (float) – Dropout rate. Defaults to 0.1.

  • max_len (int) – Maximum output sequence length \(T\). Defaults to 8 * 32.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

forward(feature, data_samples)[源代码]
参数
  • feature (Tensor) – Feature tensor of shape \((N, D_m, H, W)\).

  • data_samples (List[TextRecogDataSample]) – List of data samples.

返回

Features of shape \((N, D_m, H, W)\).

返回类型

Tensor

class mmocr.models.textrecog.encoders.BaseEncoder(init_cfg=None)[源代码]

Base Encoder class for text recognition.

forward(feat, **kwargs)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.encoders.ChannelReductionEncoder(in_channels, out_channels, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'})[源代码]

Change the channel number with a one by one convoluational layer.

参数
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to dict(type=’Xavier’, layer=’Conv2d’).

返回类型

None

forward(feat, data_samples=None)[源代码]
参数
  • feat (Tensor) – Image features with the shape of \((N, C_{in}, H, W)\).

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.

返回

A tensor of shape \((N, C_{out}, H, W)\).

返回类型

Tensor

class mmocr.models.textrecog.encoders.NRTREncoder(n_layers=6, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, dropout=0.1, init_cfg=None)[源代码]

Transformer Encoder block with self attention mechanism.

参数
  • n_layers (int) – The number of sub-encoder-layers in the encoder. Defaults to 6.

  • n_head (int) – The number of heads in the multiheadattention models Defaults to 8.

  • d_k (int) – Total number of features in key. Defaults to 64.

  • d_v (int) – Total number of features in value. Defaults to 64.

  • d_model (int) – The number of expected features in the decoder inputs. Defaults to 512.

  • d_inner (int) – The dimension of the feedforward network model. Defaults to 256.

  • dropout (float) – Dropout rate for MHSA and FFN. Defaults to 0.1.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

forward(feat, data_samples=None)[源代码]
参数
  • feat (Tensor) – Backbone output of shape \((N, C, H, W)\).

  • data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.

返回

The encoder output tensor. Shape \((N, T, C)\).

返回类型

Tensor

class mmocr.models.textrecog.encoders.SAREncoder(enc_bi_rnn=False, rnn_dropout=0.0, enc_gru=False, d_model=512, d_enc=512, mask=True, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}], **kwargs)[源代码]

Implementation of encoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

参数
  • enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder. Defaults to False.

  • rnn_dropout (float) – Dropout probability of RNN layer in encoder. Defaults to 0.0.

  • enc_gru (bool) – If True, use GRU, else LSTM in encoder. Defaults to False.

  • d_model (int) – Dim \(D_i\) of channels from backbone. Defaults to 512.

  • d_enc (int) – Dim \(D_m\) of encoder RNN layer. Defaults to 512.

  • mask (bool) – If True, mask padding in RNN sequence. Defaults to True.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to [dict(type=’Xavier’, layer=’Conv2d’), dict(type=’Uniform’, layer=’BatchNorm2d’)].

返回类型

None

forward(feat, data_samples=None)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.

返回

A tensor of shape \((N, D_m)\).

返回类型

Tensor

class mmocr.models.textrecog.encoders.SATRNEncoder(n_layers=12, n_head=8, d_k=64, d_v=64, d_model=512, n_position=100, d_inner=256, dropout=0.1, init_cfg=None)[源代码]

Implement encoder for SATRN, see `SATRN.

<https://arxiv.org/abs/1910.04396>`_.

参数
  • n_layers (int) – Number of attention layers. Defaults to 12.

  • n_head (int) – Number of parallel attention heads. Defaults to 8.

  • d_k (int) – Dimension of the key vector. Defaults to 64.

  • d_v (int) – Dimension of the value vector. Defaults to 64.

  • d_model (int) – Dimension \(D_m\) of the input from previous model. Defaults to 512.

  • n_position (int) – Length of the positional encoding vector. Must be greater than max_seq_len. Defaults to 100.

  • d_inner (int) – Hidden dimension of feedforward layers. Defaults to 256.

  • dropout (float) – Dropout rate. Defaults to 0.1.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

forward(feat, data_samples=None)[源代码]

Forward propagation of encoder.

参数
  • feat (Tensor) – Feature tensor of shape \((N, D_m, H, W)\).

  • data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.

返回

A tensor of shape \((N, T, D_m)\).

返回类型

Tensor

Text Recognition Decoders

class mmocr.models.textrecog.decoders.ABIFuser(dictionary, vision_decoder, language_decoder=None, d_model=512, num_iters=1, max_seq_len=40, module_loss=None, postprocessor=None, init_cfg=None, **kwargs)[源代码]

A special decoder responsible for mixing and aligning visual feature and linguistic feature. ABINet

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary. The dictionary must have an end token.

  • vision_decoder (dict) – The config for vision decoder.

  • language_decoder (dict, optional) – The config for language decoder.

  • num_iters (int) – Rounds of iterative correction. Defaults to 1.

  • d_model (int) – Hidden size \(E\) of model. Defaults to 512.

  • max_seq_len (int) – Maximum sequence length \(T\). The sequence is usually generated from decoder. Defaults to 40.

  • module_loss (dict, optional) – Config to build loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

forward_test(feat, logits, data_samples=None)[源代码]
参数
  • feat (torch.Tensor, optional) – Not required. Feature map placeholder. Defaults to None.

  • logits (Tensor) – Raw language logitis. Shape \((N, T, C)\).

  • data_samples (list[TextRecogDataSample], optional) – Not required. DataSample placeholder. Defaults to None.

返回

Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is num_classes.

返回类型

Tensor

forward_train(feat=None, out_enc=None, data_samples=None)[源代码]
参数
  • feat (torch.Tensor, optional) – Not required. Feature map placeholder. Defaults to None.

  • out_enc (Tensor) – Raw language logitis. Shape \((N, T, C)\). Defaults to None.

  • data_samples (list[TextRecogDataSample], optional) – Not required. DataSample placeholder. Defaults to None.

返回

A dict with keys out_enc, out_decs and out_fusers.

  • out_vis (dict): Dict from self.vision_decoder with keys feature, logits and attn_scores.

  • out_langs (dict or list): Dict from self.vision_decoder with keys feature, logits if applicable, or an empty list otherwise.

  • out_fusers (dict or list): Dict of fused visual and language features with keys feature, logits if applicable, or an empty list otherwise.

返回类型

Dict

fuse(l_feature, v_feature)[源代码]

Mix and align visual feature and linguistic feature.

参数
  • l_feature (torch.Tensor) – (N, T, E) where T is length, N is batch size and E is dim of model.

  • v_feature (torch.Tensor) – (N, T, E) shape the same as l_feature.

返回

A dict with key logits. of shape \((N, T, C)\) where N is batch size, T is length and C is the number of characters.

返回类型

dict

class mmocr.models.textrecog.decoders.ABILanguageDecoder(dictionary, d_model=512, n_head=8, d_inner=2048, n_layers=4, dropout=0.1, detach_tokens=True, use_self_attn=False, max_seq_len=40, module_loss=None, postprocessor=None, init_cfg=None, **kwargs)[源代码]

Transformer-based language model responsible for spell correction. Implementation of language model of

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary. The dictionary must have an end token.

  • d_model (int) – Hidden size \(E\) of model. Defaults to 512.

  • n_head (int) – Number of multi-attention heads.

  • d_inner (int) – Hidden size of feedforward network model.

  • n_layers (int) – The number of similar decoding layers.

  • dropout (float) – Dropout rate.

  • detach_tokens (bool) – Whether to block the gradient flow at input tokens.

  • use_self_attn (bool) – If True, use self attention in decoder layers, otherwise cross attention will be used.

  • max_seq_len (int) – Maximum sequence length \(T\). The sequence is usually generated from decoder. Defaults to 40.

  • module_loss (dict, optional) – Config to build loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

forward_test(feat=None, logits=None, data_samples=None)[源代码]
参数
  • feat (torch.Tensor, optional) – Not required. Feature map placeholder. Defaults to None.

  • logits (Tensor) – Raw language logitis. Shape \((N, T, C)\). Defaults to None.

  • data_samples (list[TextRecogDataSample], optional) – Not required. DataSample placeholder. Defaults to None.

返回

A dict with keys feature and logits.

  • feature (Tensor): Shape \((N, T, E)\). Raw textual features for vision language aligner.

  • logits (Tensor): Shape \((N, T, C)\). The raw logits for characters after spell correction.

返回类型

Dict

forward_train(feat=None, out_enc=None, data_samples=None)[源代码]
参数
  • feat (torch.Tensor, optional) – Not required. Feature map placeholder. Defaults to None.

  • out_enc (torch.Tensor) – Logits with shape \((N, T, C)\). Defaults to None.

  • data_samples (list[TextRecogDataSample], optional) – Not required. DataSample placeholder. Defaults to None.

返回

A dict with keys feature and logits.

  • feature (Tensor): Shape \((N, T, E)\). Raw textual features for vision language aligner.

  • logits (Tensor): Shape \((N, T, C)\). The raw logits for characters after spell correction.

返回类型

Dict

class mmocr.models.textrecog.decoders.ABIVisionDecoder(dictionary, in_channels=512, num_channels=64, attn_height=8, attn_width=32, attn_mode='nearest', module_loss=None, postprocessor=None, max_seq_len=40, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'}, **kwargs)[源代码]

Converts visual features into text characters.

Implementation of VisionEncoder in

ABINet.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • in_channels (int) – Number of channels \(E\) of input vector. Defaults to 512.

  • num_channels (int) – Number of channels of hidden vectors in mini U-Net. Defaults to 64.

  • attn_height (int) – Height \(H\) of input image features. Defaults to 8.

  • attn_width (int) – Width \(W\) of input image features. Defaults to 32.

  • attn_mode (str) – Upsampling mode for torch.nn.Upsample in mini U-Net. Defaults to ‘nearest’.

  • module_loss (dict, optional) – Config to build loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to dict(type=’Xavier’, layer=’Conv2d’).

返回类型

None

forward_test(feat=None, out_enc=None, data_samples=None)[源代码]
参数
  • feat (torch.Tensor, optional) – Image features of shape (N, E, H, W). Defaults to None.

  • out_enc (torch.Tensor) – Encoder output. Defaults to None.

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回

A dict with keys feature, logits and attn_scores.

  • feature (Tensor): Shape (N, T, E). Raw visual features for language decoder.

  • logits (Tensor): Shape (N, T, C). The raw logits for characters.

  • attn_scores (Tensor): Shape (N, T, H, W). Intermediate result for vision-language aligner.

返回类型

dict

forward_train(feat=None, out_enc=None, data_samples=None)[源代码]
参数
  • feat (Tensor, optional) – Image features of shape (N, E, H, W). Defaults to None.

  • out_enc (torch.Tensor) – Encoder output. Defaults to None.

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回

A dict with keys feature, logits and attn_scores.

  • feature (Tensor): Shape (N, T, E). Raw visual features for language decoder.

  • logits (Tensor): Shape (N, T, C). The raw logits for characters.

  • attn_scores (Tensor): Shape (N, T, H, W). Intermediate result for vision-language aligner.

返回类型

dict

class mmocr.models.textrecog.decoders.BaseDecoder(dictionary, module_loss=None, postprocessor=None, max_seq_len=40, init_cfg=None)[源代码]

Base decoder for text recognition, build the loss and postprocessor.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • loss (dict, optional) – Config to build loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

  • module_loss (Optional[Dict]) –

返回类型

None

forward(feat=None, out_enc=None, data_samples=None)[源代码]

Decoder forward.

Args:
feat (Tensor, optional): Features from the backbone. Defaults

to None.

out_enc (Tensor, optional): Features from the encoder.

Defaults to None.

data_samples (list[TextRecogDataSample]): A list of N datasamples,

containing meta information and gold annotations for each of the images. Defaults to None.

返回

Features from decoder forward.

返回类型

Tensor

参数
forward_test(feat=None, out_enc=None, data_samples=None)[源代码]

Forward for testing.

参数
  • feat (torch.Tensor, optional) – The feature map from backbone of shape \((N, E, H, W)\). Defaults to None.

  • out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.

  • data_samples (Sequence[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回类型

torch.Tensor

forward_train(feat=None, out_enc=None, data_samples=None)[源代码]

Forward for training.

参数
  • feat (torch.Tensor, optional) – The feature map from backbone of shape \((N, E, H, W)\). Defaults to None.

  • out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.

  • data_samples (Sequence[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回类型

torch.Tensor

loss(feat=None, out_enc=None, data_samples=None)[源代码]

Calculate losses from a batch of inputs and data samples.

参数
  • feat (Tensor, optional) – Features from the backbone. Defaults to None.

  • out_enc (Tensor, optional) – Features from the encoder. Defaults to None.

  • data_samples (list[TextRecogDataSample], optional) – A list of N datasamples, containing meta information and gold annotations for each of the images. Defaults to None.

返回

A dictionary of loss components.

返回类型

dict[str, tensor]

predict(feat=None, out_enc=None, data_samples=None)[源代码]

Perform forward propagation of the decoder and postprocessor.

参数
  • feat (Tensor, optional) – Features from the backbone. Defaults to None.

  • out_enc (Tensor, optional) – Features from the encoder. Defaults to None.

  • data_samples (list[TextRecogDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images. Defaults to None.

返回

A list of N datasamples of prediction results. Results are stored in pred_text.

返回类型

list[TextRecogDataSample]

class mmocr.models.textrecog.decoders.CRNNDecoder(in_channels, dictionary, rnn_flag=False, module_loss=None, postprocessor=None, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'}, **kwargs)[源代码]

Decoder for CRNN.

参数
  • in_channels (int) – Number of input channels.

  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • rnn_flag (bool) – Use RNN or CNN as the decoder. Defaults to False.

  • module_loss (dict, optional) – Config to build module_loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

forward_test(feat=None, out_enc=None, data_samples=None)[源代码]
参数
  • feat (Tensor) – A Tensor of shape \((N, C, 1, W)\).

  • out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.

  • data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回

Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is num_classes.

返回类型

Tensor

forward_train(feat, out_enc=None, data_samples=None)[源代码]
参数
  • feat (Tensor) – A Tensor of shape \((N, C, 1, W)\).

  • out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回

The raw logit tensor. Shape \((N, W, C)\) where \(C\) is num_classes.

返回类型

Tensor

class mmocr.models.textrecog.decoders.MasterDecoder(n_layers=3, n_head=8, d_model=512, feat_size=240, d_inner=2048, attn_drop=0.0, ffn_drop=0.0, feat_pe_drop=0.2, module_loss=None, postprocessor=None, dictionary=None, max_seq_len=30, init_cfg=None)[源代码]

Decoder module in MASTER.

Code is partially modified from https://github.com/wenwenyu/MASTER-pytorch.

参数
  • n_layers (int) – Number of attention layers. Defaults to 3.

  • n_head (int) – Number of parallel attention heads. Defaults to 8.

  • d_model (int) – Dimension \(E\) of the input from previous model. Defaults to 512.

  • feat_size (int) – The size of the input feature from previous model, usually \(H * W\). Defaults to 6 * 40.

  • d_inner (int) – Hidden dimension of feedforward layers. Defaults to 2048.

  • attn_drop (float) – Dropout rate of the attention layer. Defaults to 0.

  • ffn_drop (float) – Dropout rate of the feedforward layer. Defaults to 0.

  • feat_pe_drop (float) – Dropout rate of the feature positional encoding layer. Defaults to 0.2.

  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary. Defaults to None.

  • module_loss (dict, optional) – Config to build module_loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • max_seq_len (int) – Maximum output sequence length \(T\). Defaults to 30.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

decode(tgt_seq, feature, src_mask, tgt_mask)[源代码]

Decode the input sequence.

参数
  • tgt_seq (Tensor) – Target sequence of shape: math: (N, T, C).

  • feature (Tensor) – Input feature map from encoder of shape: math: (N, C, H, W)

  • src_mask (BoolTensor) – The source mask of shape: math: (N, H*W).

  • tgt_mask (BoolTensor) – The target mask of shape: math: (N, T, T).

返回

The decoded sequence.

返回类型

Tensor

forward_test(feat=None, out_enc=None, data_samples=None)[源代码]

Forward for testing.

参数
  • feat (Tensor, optional) – Input feature map from backbone.

  • out_enc (Tensor) – Unused.

  • data_samples (list[TextRecogDataSample]) – Unused.

返回

Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is num_classes.

返回类型

Tensor

forward_train(feat=None, out_enc=None, data_samples=None)[源代码]

Forward for training. Source mask will not be used here.

参数
  • feat (Tensor, optional) – Input feature map from backbone.

  • out_enc (Tensor) – Unused.

  • data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information.

返回

The raw logit tensor. Shape \((N, T, C)\) where \(C\) is num_classes.

返回类型

Tensor

make_target_mask(tgt, device)[源代码]

Make target mask for self attention.

参数
  • tgt (Tensor) – Shape [N, l_tgt]

  • device (torch.device) – Mask device.

返回

Mask of shape [N * self.n_head, l_tgt, l_tgt]

返回类型

Tensor

class mmocr.models.textrecog.decoders.NRTRDecoder(n_layers=6, d_embedding=512, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, n_position=200, dropout=0.1, module_loss=None, postprocessor=None, dictionary=None, max_seq_len=30, init_cfg=None)[源代码]

Transformer Decoder block with self attention mechanism.

参数
  • n_layers (int) – Number of attention layers. Defaults to 6.

  • d_embedding (int) – Language embedding dimension. Defaults to 512.

  • n_head (int) – Number of parallel attention heads. Defaults to 8.

  • d_k (int) – Dimension of the key vector. Defaults to 64.

  • d_v (int) – Dimension of the value vector. Defaults to 64

  • d_model (int) – Dimension \(D_m\) of the input from previous model. Defaults to 512.

  • d_inner (int) – Hidden dimension of feedforward layers. Defaults to 256.

  • n_position (int) – Length of the positional encoding vector. Must be greater than max_seq_len. Defaults to 200.

  • dropout (float) – Dropout rate for text embedding, MHSA, FFN. Defaults to 0.1.

  • module_loss (dict, optional) – Config to build module_loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • max_seq_len (int) – Maximum output sequence length \(T\). Defaults to 30.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

forward_test(feat=None, out_enc=None, data_samples=None)[源代码]

Forward for testing.

参数
  • feat (Tensor, optional) – Unused.

  • out_enc (Tensor) – Encoder output of shape: math:(N, T, D_m) where \(D_m\) is d_model. Defaults to None.

  • data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information. Defaults to None.

返回

Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is num_classes.

返回类型

Tensor

forward_train(feat=None, out_enc=None, data_samples=None)[源代码]

Forward for training. Source mask will be used here.

参数
  • feat (Tensor, optional) – Unused.

  • out_enc (Tensor) – Encoder output of shape : math:(N, T, D_m) where \(D_m\) is d_model. Defaults to None.

  • data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information. Defaults to None.

返回

The raw logit tensor. Shape \((N, T, C)\) where \(C\) is num_classes.

返回类型

Tensor

class mmocr.models.textrecog.decoders.ParallelSARDecoder(dictionary, module_loss=None, postprocessor=None, enc_bi_rnn=False, dec_bi_rnn=False, dec_rnn_dropout=0.0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=30, mask=True, pred_concat=False, init_cfg=None, **kwargs)[源代码]

Implementation Parallel Decoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • module_loss (dict, optional) – Config to build module_loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder. Defaults to False.

  • dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder. Defaults to False.

  • dec_rnn_dropout (float) – Dropout of RNN layer in decoder. Defaults to 0.0.

  • dec_gru (bool) – If True, use GRU, else LSTM in decoder. Defaults to False.

  • d_model (int) – Dim of channels from backbone \(D_i\). Defaults to 512.

  • d_enc (int) – Dim of encoder RNN layer \(D_m\). Defaults to 512.

  • d_k (int) – Dim of channels of attention module. Defaults to 64.

  • pred_dropout (float) – Dropout probability of prediction layer. Defaults to 0.0.

  • max_seq_len (int) – Maximum sequence length for decoding. Defaults to 30.

  • mask (bool) – If True, mask padding in feature map. Defaults to True.

  • pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state. Defaults to False.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

forward_test(feat, out_enc, data_samples=None)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.

返回

Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is num_classes.

返回类型

Tensor

forward_train(feat, out_enc, data_samples)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information.

返回

A raw logit tensor of shape \((N, T, C)\).

返回类型

Tensor

class mmocr.models.textrecog.decoders.ParallelSARDecoderWithBS(beam_width=5, num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=0, pred_concat=False, init_cfg=None, **kwargs)[源代码]

Parallel Decoder module with beam-search in SAR.

参数

beam_width (int) – Width for beam search.

forward_test(feat, out_enc, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing valid_ratio information. Defaults to None.

返回

Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is num_classes.

返回类型

Tensor

class mmocr.models.textrecog.decoders.PositionAttentionDecoder(dictionary, module_loss=None, postprocessor=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, mask=True, return_feature=True, encode_value=False, init_cfg=None)[源代码]

Position attention decoder for RobustScanner.

RobustScanner: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • module_loss (dict, optional) – Config to build module_loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • rnn_layers (int) – Number of RNN layers. Defaults to 2.

  • dim_input (int) – Dimension \(D_i\) of input vector feat. Defaults to 512.

  • dim_model (int) – Dimension \(D_m\) of the model. Should also be the same as encoder output vector out_enc. Defaults to 128.

  • max_seq_len (int) – Maximum output sequence length \(T\). Defaults to 40.

  • mask (bool) – Whether to mask input features according to img_meta['valid_ratio']. Defaults to True.

  • return_feature (bool) – Return feature or logits as the result. Defaults to True.

  • encode_value (bool) – Whether to use the output of encoder out_enc as value of attention layer. If False, the original feature feat will be used. Defaults to False.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

forward_test(feat, out_enc, img_metas)[源代码]
参数
返回

Character probabilities of shape \((N, T, C)\) if return_feature=False. Otherwise it would be the hidden feature before the prediction projection layer, whose shape is \((N, T, D_m)\).

返回类型

Tensor

forward_train(feat, out_enc, data_samples)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回

A raw logit tensor of shape \((N, T, C)\) if return_feature=False. Otherwise it will be the hidden feature before the prediction projection layer, whose shape is \((N, T, D_m)\).

返回类型

Tensor

class mmocr.models.textrecog.decoders.RobustScannerFuser(dictionary, module_loss=None, postprocessor=None, hybrid_decoder={'type': 'SequenceAttentionDecoder'}, position_decoder={'type': 'PositionAttentionDecoder'}, max_seq_len=30, in_channels=[512, 512], dim=- 1, init_cfg=None)[源代码]

Decoder for RobustScanner.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • module_loss (dict, optional) – Config to build module_loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • hybrid_decoder (dict) – Config to build hybrid_decoder. Defaults to dict(type=’SequenceAttentionDecoder’).

  • position_decoder (dict) – Config to build position_decoder. Defaults to dict(type=’PositionAttentionDecoder’).

  • fuser (dict) – Config to build fuser. Defaults to dict(type=’RobustScannerFuser’).

  • max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 30.

  • in_channels (list[int]) – List of input channels. Defaults to [512, 512].

  • dim (int) – The dimension on which to split the input. Defaults to -1.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

forward_test(feat=None, out_enc=None, data_samples=None)[源代码]

Forward for testing.

参数
  • feat (torch.Tensor, optional) – The feature map from backbone of shape \((N, E, H, W)\). Defaults to None.

  • out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.

  • data_samples (Sequence[TextRecogDataSample]) – Batch of TextRecogDataSample, containing vaild_ratio information. Defaults to None.

返回

Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is num_classes.

返回类型

Tensor

forward_train(feat=None, out_enc=None, data_samples=None)[源代码]

Forward for training.

参数
  • feat (torch.Tensor, optional) – The feature map from backbone of shape \((N, E, H, W)\). Defaults to None.

  • out_enc (torch.Tensor, optional) – Encoder output. Defaults to None.

  • data_samples (Sequence[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回类型

torch.Tensor

class mmocr.models.textrecog.decoders.SequenceAttentionDecoder(dictionary, module_loss=None, postprocessor=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, mask=True, dropout=0, return_feature=True, encode_value=False, init_cfg=None)[源代码]

Sequence attention decoder for RobustScanner.

RobustScanner: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • module_loss (dict, optional) – Config to build module_loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • rnn_layers (int) – Number of RNN layers. Defaults to 2.

  • dim_input (int) – Dimension \(D_i\) of input vector feat. Defaults to 512.

  • dim_model (int) – Dimension \(D_m\) of the model. Should also be the same as encoder output vector out_enc. Defaults to 128.

  • max_seq_len (int) – Maximum output sequence length \(T\). Defaults to 40.

  • mask (bool) – Whether to mask input features according to data_sample.valid_ratio. Defaults to True.

  • dropout (float) – Dropout rate for LSTM layer. Defaults to 0.

  • return_feature (bool) – Return feature or logic as the result. Defaults to True.

  • encode_value (bool) – Whether to use the output of encoder out_enc as value of attention layer. If False, the original feature feat will be used. Defaults to False.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

forward_test(feat, out_enc, data_samples)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回

Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is num_classes.

返回类型

Tensor

forward_test_step(feat, out_enc, decode_sequence, current_step, data_samples)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • decode_sequence (Tensor) – Shape \((N, T)\). The tensor that stores history decoding result.

  • current_step (int) – Current decoding step.

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回

Shape \((N, C)\). The logit tensor of predicted tokens at current time step.

返回类型

Tensor

forward_train(feat, out_enc, data_samples=None)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • targets_dict (dict) – A dict with the key padded_targets, a tensor of shape \((N, T)\). Each element is the index of a character.

  • data_samples (list[TextRecogDataSample], optional) – Batch of TextRecogDataSample, containing gt_text information. Defaults to None.

返回

A raw logit tensor of shape \((N, T, C)\) if return_feature=False. Otherwise it would be the hidden feature before the prediction projection layer, whose shape is \((N, T, D_m)\).

返回类型

Tensor

class mmocr.models.textrecog.decoders.SequentialSARDecoder(dictionary=None, module_loss=None, postprocessor=None, enc_bi_rnn=False, dec_bi_rnn=False, dec_gru=False, d_k=64, d_model=512, d_enc=512, pred_dropout=0.0, mask=True, max_seq_len=40, pred_concat=False, init_cfg=None, **kwargs)[源代码]

Implementation Sequential Decoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • module_loss (dict, optional) – Config to build module_loss. Defaults to None.

  • postprocessor (dict, optional) – Config to build postprocessor. Defaults to None.

  • enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder. Defaults to False.

  • dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder. Defaults to False.

  • dec_do_rnn (float) – Dropout of RNN layer in decoder. Defaults to 0.

  • dec_gru (bool) – If True, use GRU, else LSTM in decoder. Defaults to False.

  • d_k (int) – Dim of conv layers in attention module. Defaults to 64.

  • d_model (int) – Dim of channels from backbone \(D_i\). Defaults to 512.

  • d_enc (int) – Dim of encoder RNN layer \(D_m\). Defaults to 512.

  • pred_dropout (float) – Dropout probability of prediction layer. Defaults to 0.

  • max_seq_len (int) – Maximum sequence length during decoding. Defaults to 40.

  • mask (bool) – If True, mask padding in feature map. Defaults to False.

  • pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state. Defaults to False.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

forward_test(feat, out_enc, data_samples=None)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing valid_ratio information.

返回

Character probabilities. of shape \((N, self.max_seq_len, C)\) where \(C\) is num_classes.

返回类型

Tensor

forward_train(feat, out_enc, data_samples=None)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • data_samples (list[TextRecogDataSample]) – Batch of TextRecogDataSample, containing gt_text and valid_ratio information.

返回

A raw logit tensor of shape \((N, T, C)\).

返回类型

Tensor

Text Recognition Module Losses

class mmocr.models.textrecog.module_losses.ABIModuleLoss(dictionary, max_seq_len=40, letter_case='unchanged', weight_vis=1.0, weight_lang=1.0, weight_fusion=1.0, **kwargs)[源代码]

Implementation of ABINet multiloss that allows mixing different types of losses with weights.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.

  • letter_case (str) – There are three options to alter the letter cases of gt texts: - unchanged: Do not change gt texts. - upper: Convert gt texts into uppercase characters. - lower: Convert gt texts into lowercase characters. Usually, it only works for English characters. Defaults to ‘unchanged’.

  • weight_vis (float or int) – The weight of vision decoder loss. Defaults to 1.0.

  • weight_dec (float or int) – The weight of language decoder loss. Defaults to 1.0.

  • weight_fusion (float or int) – The weight of fuser (aligner) loss. Defaults to 1.0.

  • weight_lang (Union[float, int]) –

返回类型

None

forward(outputs, data_samples)[源代码]
参数
  • outputs (dict) – The output dictionary with at least one of out_vis, out_langs and out_fusers specified.

  • data_samples (list[TextRecogDataSample]) – List of TextRecogDataSample which are processed by get_target.

返回

A loss dictionary with loss_visual, loss_lang and loss_fusion. Each should either be the loss tensor or None if the output of its corresponding module is not given.

返回类型

dict

class mmocr.models.textrecog.module_losses.BaseTextRecogModuleLoss(dictionary, max_seq_len=40, letter_case='unchanged', pad_with='auto', **kwargs)[源代码]

Base recognition loss.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.

  • letter_case (str) – There are three options to alter the letter cases of gt texts: - unchanged: Do not change gt texts. - upper: Convert gt texts into uppercase characters. - lower: Convert gt texts into lowercase characters. Usually, it only works for English characters. Defaults to ‘unchanged’.

  • pad_with (str) –

    The padding strategy for gt_text.padded_indexes. Defaults to ‘auto’. Options are: - ‘auto’: Use dictionary.padding_idx to pad gt texts, or

    dictionary.end_idx if dictionary.padding_idx is None.

    • ’padding’: Always use dictionary.padding_idx to pad gt texts.

    • ’end’: Always use dictionary.end_idx to pad gt texts.

    • ’none’: Do not pad gt texts.

返回类型

None

get_targets(data_samples)[源代码]

Target generator.

参数

data_samples (list[TextRecogDataSample]) – It usually includes gt_text information.

返回

Updated data_samples. Two keys will be added to data_sample:

  • indexes (torch.LongTensor): Character indexes representing gt texts. All special tokens are excluded, except for UKN.

  • padded_indexes (torch.LongTensor): Character indexes representing gt texts with BOS and EOS if applicable, following several padding indexes until the length reaches max_seq_len. In particular, if pad_with='none', no padding will be applied.

返回类型

list[TextRecogDataSample]

class mmocr.models.textrecog.module_losses.CEModuleLoss(dictionary, max_seq_len=40, letter_case='unchanged', pad_with='auto', ignore_char='padding', flatten=False, reduction='none', ignore_first_char=False)[源代码]

Implementation of loss module for encoder-decoder based text recognition method with CrossEntropy loss.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • max_seq_len (int) – Maximum sequence length. The sequence is usually generated from decoder. Defaults to 40.

  • letter_case (str) – There are three options to alter the letter cases of gt texts: - unchanged: Do not change gt texts. - upper: Convert gt texts into uppercase characters. - lower: Convert gt texts into lowercase characters. Usually, it only works for English characters. Defaults to ‘unchanged’.

  • pad_with (str) –

    The padding strategy for gt_text.padded_indexes. Defaults to ‘auto’. Options are: - ‘auto’: Use dictionary.padding_idx to pad gt texts, or

    dictionary.end_idx if dictionary.padding_idx is None.

    • ’padding’: Always use dictionary.padding_idx to pad gt texts.

    • ’end’: Always use dictionary.end_idx to pad gt texts.

    • ’none’: Do not pad gt texts.

  • ignore_char (int or str) – Specifies a target value that is ignored and does not contribute to the input gradient. ignore_char can be int or str. If int, it is the index of the ignored char. If str, it is the character to ignore. Apart from single characters, each item can be one of the following reversed keywords: ‘padding’, ‘start’, ‘end’, and ‘unknown’, which refer to their corresponding special tokens in the dictionary. It will not ignore any special tokens when ignore_char == -1 or ‘none’. Defaults to ‘padding’.

  • flatten (bool) – Whether to flatten the output and target before computing CE loss. Defaults to False.

  • reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’). Defaults to ‘none’.

  • ignore_first_char (bool) – Whether to ignore the first token in target ( usually the start token). If True, the last token of the output sequence will also be removed to be aligned with the target length. Defaults to False.

  • flatten – Whether to flatten the vectors for loss computation. Defaults to False.

forward(outputs, data_samples)[源代码]
参数
  • outputs (Tensor) – A raw logit tensor of shape \((N, T, C)\).

  • data_samples (list[TextRecogDataSample]) – List of TextRecogDataSample which are processed by get_target.

返回

A loss dict with the key loss_ce.

返回类型

dict

class mmocr.models.textrecog.module_losses.CTCModuleLoss(dictionary, letter_case='unchanged', flatten=True, reduction='mean', zero_infinity=False, **kwargs)[源代码]

Implementation of loss module for CTC-loss based text recognition.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • letter_case (str) – There are three options to alter the letter cases of gt texts: - unchanged: Do not change gt texts. - upper: Convert gt texts into uppercase characters. - lower: Convert gt texts into lowercase characters. Usually, it only works for English characters. Defaults to ‘unchanged’.

  • flatten (bool) – If True, use flattened targets, else padded targets.

  • reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).

  • zero_infinity (bool) – Whether to zero infinite losses and the associated gradients. Default: False. Infinite losses mainly occur when the inputs are too short to be aligned to the targets.

返回类型

None

forward(outputs, data_samples)[源代码]
参数
  • outputs (Tensor) – A raw logit tensor of shape \((N, T, C)\).

  • data_samples (list[TextRecogDataSample]) – List of TextRecogDataSample which are processed by get_target.

返回

The loss dict with key loss_ctc.

返回类型

dict

get_targets(data_samples)[源代码]

Target generator.

参数

data_samples (list[TextRecogDataSample]) – It usually includes gt_text information.

返回

updated data_samples. It will add two key in data_sample:

  • indexes (torch.LongTensor): The index corresponding to the item.

返回类型

list[TextRecogDataSample]

KIE Extractors

class mmocr.models.kie.extractors.SDMGR(backbone=None, roi_extractor=None, neck=None, kie_head=None, dictionary=None, data_preprocessor=None, init_cfg=None)[源代码]

The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. https://arxiv.org/abs/2103.14470.

参数
  • backbone (dict, optional) – Config of backbone. If None, None will be passed to kie_head during training and testing. Defaults to None.

  • roi_extractor (dict, optional) – Config of roi extractor. Only applicable when backbone is not None. Defaults to None.

  • neck (dict, optional) – Config of neck. Defaults to None.

  • kie_head (dict) – Config of KIE head. Defaults to None.

  • dictionary (dict, optional) – Config of dictionary. Defaults to None.

  • data_preprocessor (dict or ConfigDict, optional) – The pre-process config of BaseDataPreprocessor. it usually includes, pad_size_divisor, pad_value, mean and std. It has to be None when working in non-visual mode. Defaults to None.

  • init_cfg (dict or list[dict], optional) – Initialization configs. Defaults to None.

返回类型

None

extract_feat(img, gt_bboxes)[源代码]

Extract features from images if self.backbone is not None. It returns None otherwise.

参数
  • img (torch.Tensor) – The input image with shape (N, C, H, W).

  • gt_bboxes (list[torch.Tensor)) – A list of ground truth bounding boxes, each of shape \((N_i, 4)\).

返回

The extracted features with shape (N, E).

返回类型

torch.Tensor

forward(inputs, data_samples=None, mode='tensor', **kwargs)[源代码]

The unified entry for a forward process in both training and test.

The method should accept three modes: “tensor”, “predict” and “loss”:

  • “tensor”: Forward the whole network and return tensor or tuple of

tensor without any post-processing, same as a common nn.Module. - “predict”: Forward and return the predictions, which are fully processed to a list of DetDataSample. - “loss”: Forward and return a dict of losses according to the given inputs and data samples.

Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the train_step().

参数
  • inputs (torch.Tensor) – The input tensor with shape (N, C, …) in general.

  • data_samples (list[DetDataSample], optional) – The annotation data of every samples. Defaults to None.

  • mode (str) – Return what kind of value. Defaults to ‘tensor’.

返回

The return type depends on mode.

  • If mode="tensor", return a tensor or a tuple of tensor.

  • If mode="predict", return a list of DetDataSample.

  • If mode="loss", return a dict of tensor.

返回类型

torch.Tensor

loss(inputs, data_samples, **kwargs)[源代码]

Calculate losses from a batch of inputs and data samples.

参数
  • inputs (torch.Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • data_samples (list[KIEDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.

返回

A dictionary of loss components.

返回类型

dict[str, Tensor]

predict(inputs, data_samples, **kwargs)[源代码]

Predict results from a batch of inputs and data samples with post- processing. :param inputs: Input images of shape (N, C, H, W).

Typically these should be mean centered and std scaled.

参数
  • data_samples (list[KIEDataSample]) – A list of N datasamples, containing meta information and gold annotations for each of the images.

  • inputs (torch.Tensor) –

返回

A list of datasamples of prediction results. Results are stored in pred_instances.labels and pred_instances.edge_labels.

返回类型

List[KIEDataSample]

KIE Heads

class mmocr.models.kie.heads.SDMGRHead(dictionary, num_classes=26, visual_dim=64, fusion_dim=1024, node_input=32, node_embed=256, edge_input=5, edge_embed=256, num_gnn=2, bidirectional=False, relation_norm=10.0, module_loss={'type': 'SDMGRModuleLoss'}, postprocessor={'type': 'SDMGRPostProcessor'}, init_cfg={'mean': 0, 'override': {'name': 'edge_embed'}, 'std': 0.01, 'type': 'Normal'})[源代码]

SDMGR Head.

参数
  • dictionary (dict or Dictionary) – The config for Dictionary or the instance of Dictionary.

  • num_classes (int) – Number of class labels. Defaults to 26.

  • visual_dim (int) – Dimension of visual features \(E\). Defaults to 64.

  • fusion_dim (int) – Dimension of fusion layer. Defaults to 1024.

  • node_input (int) – Dimension of raw node embedding. Defaults to 32.

  • node_embed (int) – Dimension of node embedding. Defaults to 256.

  • edge_input (int) – Dimension of raw edge embedding. Defaults to 5.

  • edge_embed (int) – Dimension of edge embedding. Defaults to 256.

  • num_gnn (int) – Number of GNN layers. Defaults to 2.

  • bidirectional (bool) – Whether to use bidirectional RNN to embed nodes. Defaults to False.

  • relation_norm (float) – Norm to map value from one range to another.= Defaults to 10.

  • module_loss (dict) – Module Loss config. Defaults to dict(type='SDMGRModuleLoss').

  • postprocessor (dict) – Postprocessor config. Defaults to dict(type='SDMGRPostProcessor').

  • init_cfg (dict or list[dict], optional) – Initialization configs.

返回类型

None

compute_relations(data_samples)[源代码]

Compute the relations between every two boxes for each datasample, then return the concatenated relations.

参数

data_samples (List[mmocr.structures.kie_data_sample.KIEDataSample]) –

返回类型

torch.Tensor

convert_texts(data_samples)[源代码]

Extract texts in datasamples and pack them into a batch.

参数

data_samples (List[KIEDataSample]) – List of data samples.

返回

  • node_nums (List[int]): A list of node numbers for each sample.

  • char_nums (List[Tensor]): A list of character numbers for each sample.

  • nodes (Tensor): A tensor of shape \((N, C)\) where \(C\) is the maximum number of characters in a sample.

返回类型

tuple(List[int], List[Tensor], Tensor)

forward(inputs, data_samples)[源代码]
参数
返回

  • node_cls (Tensor): Raw logits scores for nodes. Shape \((N, C_{l})\) where \(C_{l}\) is number of classes.

  • edge_cls (Tensor): Raw logits scores for edges. Shape \((N * N, 2)\).

返回类型

tuple(Tensor, Tensor)

loss(inputs, data_samples)[源代码]

Calculate losses from a batch of inputs and data samples. :param inputs: Shape \((N, E)\). :type inputs: torch.Tensor :param data_samples: List of data samples. :type data_samples: List[KIEDataSample]

返回

A dictionary of loss components.

返回类型

dict[str, tensor]

参数
predict(inputs, data_samples)[源代码]

Predict results from a batch of inputs and data samples with post- processing.

参数
返回

A list of datasamples of prediction results. Results are stored in pred_instances.labels, pred_instances.scores, pred_instances.edge_labels and pred_instances.edge_scores.

  • labels (Tensor): An integer tensor of shape (N, ) indicating bbox labels for each image.

  • scores (Tensor): A float tensor of shape (N, ), indicating the confidence scores for node label predictions.

  • edge_labels (Tensor): An integer tensor of shape (N, N) indicating the connection between nodes. Options are 0, 1.

  • edge_scores (Tensor): A float tensor of shape (N, ), indicating the confidence scores for edge predictions.

返回类型

List[KIEDataSample]

KIE Module Losses

class mmocr.models.kie.module_losses.SDMGRModuleLoss(weight_node=1.0, weight_edge=1.0, ignore_idx=- 100)[源代码]

The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.

参数
  • weight_node (float) – Weight of node loss. Defaults to 1.0.

  • weight_edge (float) – Weight of edge loss. Defaults to 1.0.

  • ignore_idx (int) – Node label to ignore. Defaults to -100.

返回类型

None

forward(preds, data_samples)[源代码]

Forward function.

参数
  • preds (tuple(Tensor, Tensor)) –

  • data_samples (list[KIEDataSample]) – A list of datasamples containing gt_instances.labels and gt_instances.edge_labels.

返回

Loss dict, containing loss_node, loss_edge, acc_node and acc_edge.

返回类型

dict(str, Tensor)

mmocr.structures

Text Detection Data Sample

class mmocr.structures.textdet_data_sample.TextDetDataSample(*, metainfo=None, **kwargs)[源代码]

A data structure interface of MMOCR. They are used as interfaces between different components.

The attributes in TextDetDataSample are divided into two parts:

  • ``gt_instances``(InstanceData): Ground truth of instance annotations.

  • ``pred_instances``(InstanceData): Instances of model predictions.

实际案例

>>> import torch
>>> import numpy as np
>>> from mmengine.structures import InstanceData
>>> from mmocr.data import TextDetDataSample
>>> # gt_instances
>>> data_sample = TextDetDataSample()
>>> img_meta = dict(img_shape=(800, 1196, 3),
...                 pad_shape=(800, 1216, 3))
>>> gt_instances = InstanceData(metainfo=img_meta)
>>> gt_instances.bboxes = torch.rand((5, 4))
>>> gt_instances.labels = torch.rand((5,))
>>> data_sample.gt_instances = gt_instances
>>> assert 'img_shape' in data_sample.gt_instances.metainfo_keys()
>>> len(data_sample.gt_instances)
5
>>> print(data_sample)
<TextDetDataSample(

META INFORMATION DATA FIELDS gt_instances: <InstanceData(

META INFORMATION pad_shape: (800, 1216, 3) img_shape: (800, 1196, 3) DATA FIELDS labels: tensor([0.8533, 0.1550, 0.5433, 0.7294, 0.5098]) bboxes: tensor([[9.7725e-01, 5.8417e-01, 1.7269e-01, 6.5694e-01],

[1.7894e-01, 5.1780e-01, 7.0590e-01, 4.8589e-01], [7.0392e-01, 6.6770e-01, 1.7520e-01, 1.4267e-01], [2.2411e-01, 5.1962e-01, 9.6953e-01, 6.6994e-01], [4.1338e-01, 2.1165e-01, 2.7239e-04, 6.8477e-01]])

) at 0x7f21fb1b9190>

) at 0x7f21fb1b9880>
>>> # pred_instances
>>> pred_instances = InstanceData(metainfo=img_meta)
>>> pred_instances.bboxes = torch.rand((5, 4))
>>> pred_instances.scores = torch.rand((5,))
>>> data_sample = TextDetDataSample(pred_instances=pred_instances)
>>> assert 'pred_instances' in data_sample
>>> data_sample = TextDetDataSample()
>>> gt_instances_data = dict(
...                        bboxes=torch.rand(2, 4),
...                        labels=torch.rand(2),
...                        masks=np.random.rand(2, 2, 2))
>>> gt_instances = InstanceData(**gt_instances_data)
>>> data_sample.gt_instances = gt_instances
>>> assert 'gt_instances' in data_sample
>>> assert 'masks' in data_sample.gt_instances
参数

metainfo (Optional[dict]) –

返回类型

None

property gt_instances: mmengine.structures.instance_data.InstanceData

groundtruth instances.

Type

InstanceData

property pred_instances: mmengine.structures.instance_data.InstanceData

prediction instances.

Type

InstanceData

Text Recognition Data Sample

class mmocr.structures.textrecog_data_sample.TextRecogDataSample(*, metainfo=None, **kwargs)[源代码]

A data structure interface of MMOCR for text recognition. They are used as interfaces between different components.

The attributes in TextRecogDataSample are divided into two parts:

  • ``gt_text``(LabelData): Ground truth text.

  • ``pred_text``(LabelData): predictions text.

实际案例

>>> import torch
>>> import numpy as np
>>> from mmengine.structures import LabelData
>>> from mmocr.data import TextRecogDataSample
>>> # gt_text
>>> data_sample = TextRecogDataSample()
>>> img_meta = dict(img_shape=(800, 1196, 3),
...                 pad_shape=(800, 1216, 3))
>>> gt_text = LabelData(metainfo=img_meta)
>>> gt_text.item = 'mmocr'
>>> data_sample.gt_text = gt_text
>>> assert 'img_shape' in data_sample.gt_text.metainfo_keys()
>>> print(data_sample)
<TextRecogDataSample(

META INFORMATION DATA FIELDS gt_text: <LabelData(

META INFORMATION pad_shape: (800, 1216, 3) img_shape: (800, 1196, 3) DATA FIELDS item: ‘mmocr’

) at 0x7f21fb1b9190>

) at 0x7f21fb1b9880>
>>> # pred_text
>>> pred_text = LabelData(metainfo=img_meta)
>>> pred_text.item = 'mmocr'
>>> data_sample = TextRecogDataSample(pred_text=pred_text)
>>> assert 'pred_text' in data_sample
>>> data_sample = TextRecogDataSample()
>>> gt_text_data = dict(item='mmocr')
>>> gt_text = LabelData(**gt_text_data)
>>> data_sample.gt_text = gt_text
>>> assert 'gt_text' in data_sample
>>> assert 'item' in data_sample.gt_text
参数

metainfo (Optional[dict]) –

返回类型

None

property gt_text: mmengine.structures.label_data.LabelData

ground truth text.

Type

LabelData

property pred_text: mmengine.structures.label_data.LabelData

prediction text.

Type

LabelData

KIE Data Sample

class mmocr.structures.kie_data_sample.KIEDataSample(*, metainfo=None, **kwargs)[源代码]

A data structure interface of MMOCR. They are used as interfaces between different components.

The attributes in KIEDataSample are divided into two parts:

  • ``gt_instances``(InstanceData): Ground truth of instance annotations.

  • ``pred_instances``(InstanceData): Instances of model predictions.

实际案例

>>> import torch
>>> import numpy as np
>>> from mmengine.structures import InstanceData
>>> from mmocr.data import KIEDataSample
>>> # gt_instances
>>> data_sample = KIEDataSample()
>>> img_meta = dict(img_shape=(800, 1196, 3),
...                 pad_shape=(800, 1216, 3))
>>> gt_instances = InstanceData(metainfo=img_meta)
>>> gt_instances.bboxes = torch.rand((5, 4))
>>> gt_instances.labels = torch.rand((5,))
>>> data_sample.gt_instances = gt_instances
>>> assert 'img_shape' in data_sample.gt_instances.metainfo_keys()
>>> len(data_sample.gt_instances)
5
>>> print(data_sample)
<KIEDataSample(

META INFORMATION DATA FIELDS gt_instances: <InstanceData(

META INFORMATION pad_shape: (800, 1216, 3) img_shape: (800, 1196, 3) DATA FIELDS labels: tensor([0.8533, 0.1550, 0.5433, 0.7294, 0.5098]) bboxes: tensor([[9.7725e-01, 5.8417e-01, 1.7269e-01, 6.5694e-01],

[1.7894e-01, 5.1780e-01, 7.0590e-01, 4.8589e-01], [7.0392e-01, 6.6770e-01, 1.7520e-01, 1.4267e-01], [2.2411e-01, 5.1962e-01, 9.6953e-01, 6.6994e-01], [4.1338e-01, 2.1165e-01, 2.7239e-04, 6.8477e-01]])

) at 0x7f21fb1b9190>

) at 0x7f21fb1b9880>
>>> # pred_instances
>>> pred_instances = InstanceData(metainfo=img_meta)
>>> pred_instances.bboxes = torch.rand((5, 4))
>>> pred_instances.scores = torch.rand((5,))
>>> data_sample = KIEDataSample(pred_instances=pred_instances)
>>> assert 'pred_instances' in data_sample
>>> data_sample = KIEDataSample()
>>> gt_instances_data = dict(
...                        bboxes=torch.rand(2, 4),
...                        labels=torch.rand(2))
>>> gt_instances = InstanceData(**gt_instances_data)
>>> data_sample.gt_instances = gt_instances
>>> assert 'gt_instances' in data_sample
参数

metainfo (Optional[dict]) –

返回类型

None

property gt_instances: mmengine.structures.instance_data.InstanceData

groundtruth instances.

Type

InstanceData

property pred_instances: mmengine.structures.instance_data.InstanceData

prediction instances.

Type

InstanceData

mmocr.visualization

Text Detection Visualizer

class mmocr.visualization.textdet_visualizer.TextDetLocalVisualizer(name='visualizer', image=None, with_poly=True, with_bbox=False, vis_backends=None, save_dir=None, gt_color='g', pred_color='r', line_width=2, alpha=0.8)[源代码]

The MMOCR Text Detection Local Visualizer.

参数
  • name (str) – Name of the instance. Defaults to ‘visualizer’.

  • image (np.ndarray, optional) – The origin image to draw. The format should be RGB. Defaults to None.

  • with_poly (bool) – Whether to draw polygons. Defaults to True.

  • with_bbox (bool) – Whether to draw bboxes. Defaults to False.

  • vis_backends (list, optional) – Visual backend config list. Defaults to None.

  • save_dir (str, optional) – Save file dir for all storage backends. If it is None, the backend storage will not save any data.

  • gt_color (Union[str, tuple, list[str], list[tuple]]) – The colors of GT polygons and bboxes. colors can have the same length with lines or just single value. If colors is single value, all the lines will have the same colors. Refer to matplotlib.colors for full list of formats that are accepted. Defaults to ‘g’.

  • pred_color (Union[str, tuple, list[str], list[tuple]]) – The colors of pred polygons and bboxes. colors can have the same length with lines or just single value. If colors is single value, all the lines will have the same colors. Refer to matplotlib.colors for full list of formats that are accepted. Defaults to ‘r’.

  • line_width (int, float) – The linewidth of lines. Defaults to 2.

  • alpha (float) – The transparency of bboxes or polygons. Defaults to 0.8.

返回类型

None

add_datasample(name, image, data_sample=None, draw_gt=True, draw_pred=True, show=False, wait_time=0, out_file=None, pred_score_thr=0.3, step=0)[源代码]

Draw datasample and save to all backends.

  • If GT and prediction are plotted at the same time, they are

displayed in a stitched image where the left image is the ground truth and the right image is the prediction. - If show is True, all storage backends are ignored, and the images will be displayed in a local window. - If out_file is specified, the drawn image will be saved to out_file. This is usually used when the display is not available.

参数
  • name (str) – The image identifier.

  • image (np.ndarray) – The image to draw.

  • data_sample (TextDetDataSample, optional) –

    TextDetDataSample which contains gt and prediction. Defaults

    to None.

  • draw_gt (bool) – Whether to draw GT TextDetDataSample. Defaults to True.

  • draw_pred (bool) – Whether to draw Predicted TextDetDataSample. Defaults to True.

  • show (bool) – Whether to display the drawn image. Default to False.

  • wait_time (float) – The interval of show (s). Defaults to 0.

  • out_file (str) – Path to output file. Defaults to None.

  • pred_score_thr (float) – The threshold to visualize the bboxes and masks. Defaults to 0.3.

  • step (int) – Global step value to record. Defaults to 0.

返回类型

None

Text Recognition Visualizer

class mmocr.visualization.textrecog_visualizer.TextRecogLocalVisualizer(name='visualizer', image=None, vis_backends=None, save_dir=None, gt_color='g', pred_color='r')[源代码]

MMOCR Text Detection Local Visualizer.

参数
  • name (str) – Name of the instance. Defaults to ‘visualizer’.

  • image (np.ndarray, optional) – The origin image to draw. The format should be RGB. Defaults to None.

  • vis_backends (list, optional) – Visual backend config list. Defaults to None.

  • save_dir (str, optional) – Save file dir for all storage backends. If it is None, the backend storage will not save any data.

  • gt_color (str or tuple[int, int, int]) – Colors of GT text. The tuple of color should be in RGB order. Or using an abbreviation of color, such as ‘g’ for ‘green’. Defaults to ‘g’.

  • pred_color (str or tuple[int, int, int]) – Colors of Predicted text. The tuple of color should be in RGB order. Or using an abbreviation of color, such as ‘r’ for ‘red’. Defaults to ‘r’.

返回类型

None

add_datasample(name, image, data_sample=None, draw_gt=True, draw_pred=True, show=False, wait_time=0, pred_score_thr=None, out_file=None, step=0)[源代码]

Visualize datasample and save to all backends.

  • If GT and prediction are plotted at the same time, they are

displayed in a stitched image where the left image is the ground truth and the right image is the prediction. - If show is True, all storage backends are ignored, and the images will be displayed in a local window. - If out_file is specified, the drawn image will be saved to out_file. This is usually used when the display is not available.

参数
  • name (str) – The image title. Defaults to ‘image’.

  • image (np.ndarray) – The image to draw.

  • data_sample (TextRecogDataSample, optional) – TextRecogDataSample which contains gt and prediction. Defaults to None.

  • draw_gt (bool) – Whether to draw GT TextRecogDataSample. Defaults to True.

  • draw_pred (bool) – Whether to draw Predicted TextRecogDataSample. Defaults to True.

  • show (bool) – Whether to display the drawn image. Defaults to False.

  • wait_time (float) – The interval of show (s). Defaults to 0.

  • out_file (str) – Path to output file. Defaults to None.

  • step (int) – Global step value to record. Defaults to 0.

  • pred_score_thr (float) – Threshold of prediction score. It’s not used in this function. Defaults to None.

返回类型

None

Text Spotting Visualizer

class mmocr.visualization.textspotting_visualizer.TextSpottingLocalVisualizer(name='visualizer', image=None, vis_backends=None, save_dir=None, fig_save_cfg={'frameon': False}, fig_show_cfg={'frameon': False})[源代码]
参数
  • image (Optional[numpy.ndarray]) –

  • vis_backends (Optional[List[Dict]]) –

  • save_dir (Optional[str]) –

返回类型

None

add_datasample(name, image, data_sample=None, draw_gt=True, draw_pred=True, show=False, wait_time=0, pred_score_thr=None, out_file=None, step=0)[源代码]

Draw datasample.

参数
返回类型

None

KIE Visualizer

class mmocr.visualization.kie_visualizer.KIELocalVisualizer(name='kie_visualizer', is_openset=False, **kwargs)[源代码]

The MMOCR Text Detection Local Visualizer.

参数
  • name (str) – Name of the instance. Defaults to ‘visualizer’.

  • image (np.ndarray, optional) – the origin image to draw. The format should be RGB. Defaults to None.

  • vis_backends (list, optional) – Visual backend config list. Default to None.

  • save_dir (str, optional) – Save file dir for all storage backends. If it is None, the backend storage will not save any data.

  • fig_save_cfg (dict) – Keyword parameters of figure for saving. Defaults to empty dict.

  • fig_show_cfg (dict) – Keyword parameters of figure for showing. Defaults to empty dict.

  • is_openset (bool, optional) – Whether the visualizer is used in OpenSet. Defaults to False.

返回类型

None

add_datasample(name, image, data_sample=None, draw_gt=True, draw_pred=True, show=False, wait_time=0, pred_score_thr=None, out_file=None, step=0)[源代码]

Draw datasample and save to all backends.

  • If GT and prediction are plotted at the same time, they are

displayed in a stitched image where the left image is the ground truth and the right image is the prediction. - If show is True, all storage backends are ignored, and the images will be displayed in a local window. - If out_file is specified, the drawn image will be saved to out_file. This is usually used when the display is not available.

参数
  • name (str) – The image identifier.

  • image (np.ndarray) – The image to draw.

  • data_sample (KIEDataSample, optional) –

    KIEDataSample which contains gt and prediction. Defaults

    to None.

  • draw_gt (bool) – Whether to draw GT KIEDataSample. Defaults to True.

  • draw_pred (bool) – Whether to draw Predicted KIEDataSample. Defaults to True.

  • show (bool) – Whether to display the drawn image. Default to False.

  • wait_time (float) – The interval of show (s). Defaults to 0.

  • pred_score_thr (float) – The threshold to visualize the bboxes and masks. Defaults to 0.3.

  • out_file (str) – Path to output file. Defaults to None.

  • step (int) – Global step value to record. Defaults to 0.

返回类型

None

draw_arrows(x_data, y_data, colors='C1', line_widths=1, line_styles='-', arrow_tail_widths=0.001, arrow_head_widths=None, arrow_head_lengths=None, arrow_shapes='full', overhangs=0)[源代码]

Draw single or multiple arrows.

参数
  • x_data (np.ndarray or torch.Tensor) – The x coordinate of each line’ start and end points.

  • y_data (np.ndarray, torch.Tensor) – The y coordinate of each line’ start and end points.

  • colors (str or tuple or list[str or tuple]) – The colors of lines. colors can have the same length with lines or just single value. If colors is single value, all the lines will have the same colors. Reference to https://matplotlib.org/stable/gallery/color/named_colors.html for more details. Defaults to ‘g’.

  • line_widths (int or float or list[int or float]) – The linewidth of lines. line_widths can have the same length with lines or just single value. If line_widths is single value, all the lines will have the same linewidth. Defaults to 2.

  • line_styles (str or list[str]]) – The linestyle of lines. line_styles can have the same length with lines or just single value. If line_styles is single value, all the lines will have the same linestyle. Defaults to ‘-‘.

  • arrow_tail_widths (int or float or list[int, float]) – The width of arrow tails. arrow_tail_widths can have the same length with lines or just single value. If arrow_tail_widths is single value, all the lines will have the same width. Defaults to 0.001.

  • arrow_head_widths (int or float or list[int, float]) – The width of arrow heads. arrow_head_widths can have the same length with lines or just single value. If arrow_head_widths is single value, all the lines will have the same width. Defaults to None.

  • arrow_head_lengths (int or float or list[int, float]) – The length of arrow heads. arrow_head_lengths can have the same length with lines or just single value. If arrow_head_lengths is single value, all the lines will have the same length. Defaults to None.

  • arrow_shapes (str or list[str]]) – The shapes of arrow heads. arrow_shapes can have the same length with lines or just single value. If arrow_shapes is single value, all the lines will have the same shape. Defaults to ‘full’.

  • overhangs (int or list[int]]) – The overhangs of arrow heads. overhangs can have the same length with lines or just single value. If overhangs is single value, all the lines will have the same overhangs. Defaults to 0.

返回类型

mmengine.visualization.visualizer.Visualizer

Read the Docs v: test_docs
Versions
latest
stable
test_docs
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.