Dataset Types¶
General Introduction¶
To support the tasks of text detection, text recognition and key information extraction, we have designed some new types of dataset which consist of loader and parser to load and parse different types of annotation files.
loader: Load the annotation file. There are two types of loader,
HardDiskLoader
andLmdbLoader
HardDiskLoader
: Loadtxt
format annotation file from hard disk to memory.LmdbLoader
: Loadlmdb
format annotation file with lmdb backend, which is very useful for extremely large annotation files to avoid out-of-memory problem when ten or more GPUs are used, since each GPU will start multiple processes to load annotation file to memory.
parser: Parse the annotation file line-by-line and return with
dict
format. There are two types of parser,LineStrParser
andLineJsonParser
.LineStrParser
: Parse one line in ann file while treating it as a string and separating it to several parts by aseparator
. It can be used on tasks with simple annotation files such as text recognition where each line of the annotation files contains thefilename
andlabel
attribute only.LineJsonParser
: Parse one line in ann file while treating it as a json-string and usingjson.loads
to convert it todict
. It can be used on tasks with complex annotation files such as text detection where each line of the annotation files contains multiple attributes (e.g.filename
,height
,width
,box
,segmentation
,iscrowd
,category_id
, etc.).
Here we show some examples of using different combination of loader
and parser
.
General Task¶
UniformConcatDataset¶
UniformConcatDataset
is a dataset wrapper which allows users to apply a universal pipeline on multiple datasets without specifying the pipeline for each of them.
For example, to apply train_pipeline
on both train1
and train2
,
data = dict(
...
train=dict(
type='UniformConcatDataset',
datasets=[train1, train2],
pipeline=train_pipeline))
Text Detection Task¶
TextDetDataset¶
Dataset with annotation file in line-json txt format
dataset_type = 'TextDetDataset'
img_prefix = 'tests/data/toy_dataset/imgs'
test_anno_file = 'tests/data/toy_dataset/instances_test.txt'
test = dict(
type=dataset_type,
img_prefix=img_prefix,
ann_file=test_anno_file,
loader=dict(
type='HardDiskLoader',
repeat=4,
parser=dict(
type='LineJsonParser',
keys=['file_name', 'height', 'width', 'annotations'])),
pipeline=test_pipeline,
test_mode=True)
The results are generated in the same way as the segmentation-based text recognition task above.
You can check the content of the annotation file in tests/data/toy_dataset/instances_test.txt
.
The combination of HardDiskLoader
and LineJsonParser
will return a dict for each file by calling __getitem__
:
{"file_name": "test/img_10.jpg", "height": 720, "width": 1280, "annotations": [{"iscrowd": 1, "category_id": 1, "bbox": [260.0, 138.0, 24.0, 20.0], "segmentation": [[261, 138, 284, 140, 279, 158, 260, 158]]}, {"iscrowd": 0, "category_id": 1, "bbox": [288.0, 138.0, 129.0, 23.0], "segmentation": [[288, 138, 417, 140, 416, 161, 290, 157]]}, {"iscrowd": 0, "category_id": 1, "bbox": [743.0, 145.0, 37.0, 18.0], "segmentation": [[743, 145, 779, 146, 780, 163, 746, 163]]}, {"iscrowd": 0, "category_id": 1, "bbox": [783.0, 129.0, 50.0, 26.0], "segmentation": [[783, 129, 831, 132, 833, 155, 785, 153]]}, {"iscrowd": 1, "category_id": 1, "bbox": [831.0, 133.0, 43.0, 23.0], "segmentation": [[831, 133, 870, 135, 874, 156, 835, 155]]}, {"iscrowd": 1, "category_id": 1, "bbox": [159.0, 204.0, 72.0, 15.0], "segmentation": [[159, 205, 230, 204, 231, 218, 159, 219]]}, {"iscrowd": 1, "category_id": 1, "bbox": [785.0, 158.0, 75.0, 21.0], "segmentation": [[785, 158, 856, 158, 860, 178, 787, 179]]}, {"iscrowd": 1, "category_id": 1, "bbox": [1011.0, 157.0, 68.0, 16.0], "segmentation": [[1011, 157, 1079, 160, 1076, 173, 1011, 170]]}]}
IcdarDataset¶
Dataset with annotation file in coco-like json format
For text detection, you can also use an annotation file in a COCO format that is defined in MMDetection:
dataset_type = 'IcdarDataset'
prefix = 'tests/data/toy_dataset/'
test=dict(
type=dataset_type,
ann_file=prefix + 'instances_test.json',
img_prefix=prefix + 'imgs',
pipeline=test_pipeline)
You can check the content of the annotation file in tests/data/toy_dataset/instances_test.json
.
注解
Icdar 2015/2017 and ctw1500 annotations need to be converted into the COCO format following the steps in datasets.md.
Text Recognition Task¶
OCRDataset¶
Dataset for encoder-decoder based recognizer
dataset_type = 'OCRDataset'
img_prefix = 'tests/data/ocr_toy_dataset/imgs'
train_anno_file = 'tests/data/ocr_toy_dataset/label.txt'
train = dict(
type=dataset_type,
img_prefix=img_prefix,
ann_file=train_anno_file,
loader=dict(
type='HardDiskLoader',
repeat=10,
parser=dict(
type='LineStrParser',
keys=['filename', 'text'],
keys_idx=[0, 1],
separator=' ')),
pipeline=train_pipeline,
test_mode=False)
You can check the content of the annotation file in tests/data/ocr_toy_dataset/label.txt
.
The combination of HardDiskLoader
and LineStrParser
will return a dict for each file by calling __getitem__
: {'filename': '1223731.jpg', 'text': 'GRAND'}
.
Optional Arguments:
repeat
: The number of repeated lines in the annotation files. For example, if there are10
lines in the annotation file, settingrepeat=10
will generate a corresponding annotation file with size100
.
If the annotation file is extremely large, you can convert it from txt format to lmdb format with the following command:
python tools/data_converter/txt2lmdb.py -i ann_file.txt -o ann_file.lmdb
After that, you can use LmdbLoader
in dataset like below.
img_prefix = 'tests/data/ocr_toy_dataset/imgs'
train_anno_file = 'tests/data/ocr_toy_dataset/label.lmdb'
train = dict(
type=dataset_type,
img_prefix=img_prefix,
ann_file=train_anno_file,
loader=dict(
type='LmdbLoader',
repeat=10,
parser=dict(
type='LineStrParser',
keys=['filename', 'text'],
keys_idx=[0, 1],
separator=' ')),
pipeline=train_pipeline,
test_mode=False)
OCRSegDataset¶
Dataset for segmentation-based recognizer
prefix = 'tests/data/ocr_char_ann_toy_dataset/'
train = dict(
type='OCRSegDataset',
img_prefix=prefix + 'imgs',
ann_file=prefix + 'instances_train.txt',
loader=dict(
type='HardDiskLoader',
repeat=10,
parser=dict(
type='LineJsonParser',
keys=['file_name', 'annotations', 'text'])),
pipeline=train_pipeline,
test_mode=True)
You can check the content of the annotation file in tests/data/ocr_char_ann_toy_dataset/instances_train.txt
.
The combination of HardDiskLoader
and LineJsonParser
will return a dict for each file by calling __getitem__
each time:
{"file_name": "resort_88_101_1.png", "annotations": [{"char_text": "F", "char_box": [11.0, 0.0, 22.0, 0.0, 12.0, 12.0, 0.0, 12.0]}, {"char_text": "r", "char_box": [23.0, 2.0, 31.0, 1.0, 24.0, 11.0, 16.0, 11.0]}, {"char_text": "o", "char_box": [33.0, 2.0, 43.0, 2.0, 36.0, 12.0, 25.0, 12.0]}, {"char_text": "m", "char_box": [46.0, 2.0, 61.0, 2.0, 53.0, 12.0, 39.0, 12.0]}, {"char_text": ":", "char_box": [61.0, 2.0, 69.0, 2.0, 63.0, 12.0, 55.0, 12.0]}], "text": "From:"}