Merge Datasets
Consider the following task: there is a set of images (the original dataset) we want to annotate. Suppose we did this manually and/or automated it using models, and now we have few sets of annotations for the same images. We want to merge them and produce a single set of high-precision annotations.
Another use case: there are few datasets with different sets of images and labels, which we need to combine in a single dataset. If the labels were the same, we could just join the datasets. But in this case we need to merge labels and adjust the annotations in the resulting dataset.
In Datumaro, it can be done with the merge
command. This command merges 2
or more datasets and checks annotations for errors.
In simple cases, when dataset images do not intersect and new labels are not added, the recommended way of merging is using the
patch
command. It will offer better performance and provide the same results.
Datasets are merged by items, and item annotations are merged by finding the
unique ones across datasets. Annotations are matched between matching dataset
items by distance. Spatial annotations are compared by the applicable distance
measure (IoU, OKS, PDJ etc.), labels and annotation attributes are selected
by voting. Each set of matching annotations produces a single annotation in
the resulting dataset. The score
(a number in the range [0; 1]) attribute
indicates the agreement between different sources in the produced annotation.
The working time of the function can be estimated as
O( (summary dataset length) * (dataset count) ^ 2 * (item annotations) ^ 2 )
This command also allows to merge datasets with different, or partially overlapping sets of labels (which is impossible by simple joining).
During the process, some merge conflicts can appear. For example,
it can be mismatching dataset images having the same ids, label voting
can be unsuccessful if quorum is not reached (the --quorum
parameter),
bboxes may be too close (the -iou
parameter) etc. Found merge
conflicts, missing items or annotations, and other errors are saved into
an output .json
file.
In Datumaro, annotations can be grouped. It can be useful to represent
different parts of a single object - for example, it can be different parts
of a human body, parts of a vehicle etc. This command allows to check
annotation groups for completeness with the -g/--groups
option. If used,
this parameter must specify a list of labels for annotations that must be
in the same group. It can be particularly useful to check if separate
keypoints are grouped and all the necessary object components in the same
group.
This command has multiple forms:
1) datum merge <revpath>
2) datum merge <revpath> <revpath> ...
<revpath> - either a dataset path or a revision path.
1 - Merges the current project’s main target (“project”) in the working tree with the specified dataset.
2 - Merges the specified datasets. Note that the current project is not included in the list of merged sources automatically.
The command supports passing extra exporting options for the output
dataset. The format can be specified with the -f/--format
option.
Extra options should be passed after the main arguments
and after the --
separator. Particularly, this is useful to include
images in the output dataset with --save-images
.
Usage:
datum merge [-h] [-iou IOU_THRESH] [-oconf OUTPUT_CONF_THRESH]
[--quorum QUORUM] [-g GROUPS] [-o DST_DIR] [--overwrite]
[-p PROJECT_DIR] [-f FORMAT]
target [target ...] [-- EXTRA_FORMAT_ARGS]
Parameters:
<target>
(string) - Target dataset revpaths (repeatable)-iou
,--iou-thresh
(number) - IoU matching threshold for spatial annotations (both maximum inter-cluster and pairwise). Default is 0.25.--quorum
(number) - Minimum count of votes for a label or attribute to be counted. Default is 0.-g, --groups
(string) - A comma-separated list of label names in annotation groups to check. The?
postfix can be added to a label to make it optional in the group (repeatable)-oconf
,--output-conf-thresh
(number) - Confidence threshold for output annotations to be included in the resulting dataset. Default is 0.-o, --output-dir
(string) - Output directory. By default, a new directory is created in the current directory.--overwrite
- Allows to overwrite existing files in the output directory, when it is specified and is not empty.-f, --format
(string) - Output format. The default format isdatumaro
.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.-- <extra format args>
- Additional arguments for the format writer (use-- -h
for help). Must be specified after the main command arguments.
Examples:
Merge 4 (partially-)intersecting projects,
- consider voting successful when there are no less than 3 same votes
- consider shapes intersecting when IoU >= 0.6
- check annotation groups to have
person
,hand
,head
andfoot
(?
is used for optional parts)
datum merge project1/ project2/ project3/ project4/ \
--quorum 3 \
-iou 0.6 \
--groups 'person,hand?,head,foot?'
Merge images and annotations from 2 datasets in COCO format:
datum merge dataset1/:image_dir dataset2/:coco dataset3/:coco
Check groups of the merged dataset for consistency:
look for groups consisting of person
, hand
head
, foot
datum merge project1/ project2/ -g 'person,hand?,head,foot?'
Merge two datasets, specify formats:
datum merge path/to/dataset1:voc path/to/dataset2:coco
Merge the current working tree and a dataset:
datum merge path/to/dataset2:coco
Merge a source from a previous revision and a dataset:
datum merge HEAD~2:source-2 path/to/dataset2:yolo
Merge datasets and save in different format:
datum merge -f voc dataset1/:yolo path2/:coco -- --save-images