Command reference
%%{init { 'theme':'neutral' }}%%
flowchart LR
d(("#0009; datum #0009;")):::mainclass
s(source):::nofillclass
m(model):::nofillclass
p(project):::nofillclass
d===s
s===id1[add]:::hideclass
s===id2[remove]:::hideclass
s===id3[info]:::hideclass
d===m
m===id4[add]:::hideclass
m===id5[remove]:::hideclass
m===id6[run]:::hideclass
m===id7[info]:::hideclass
d===p
p===migrate:::hideclass
p===info:::hideclass
d====str1[create]:::filloneclass
d====str2[import]:::filloneclass
d====str3[export]:::filloneclass
d====str4[add]:::filloneclass
d====str5[remove]:::filloneclass
d====str6[info]:::filloneclass
d====str7[transform]:::filltwoclass
d====str8[filter]:::filltwoclass
d====str9[diff]:::fillthreeclass
d====str10[merge]:::fillthreeclass
d====str11[patch]:::fillthreeclass
d====str12[validate]:::fillthreeclass
d====str13[explain]:::fillthreeclass
d====str14[stats]:::fillthreeclass
d====str15[commit]:::fillfourclass
d====str16[checkout]:::fillfourclass
d====str17[status]:::fillfourclass
d====str18[log]:::fillfourclass
classDef nofillclass fill-opacity:0;
classDef hideclass fill-opacity:0,stroke-opacity:0;
classDef filloneclass fill:#CCCCFF,stroke-opacity:0;
classDef filltwoclass fill:#FFFF99,stroke-opacity:0;
classDef fillthreeclass fill:#CCFFFF,stroke-opacity:0;
classDef fillfourclass fill:#CCFFCC,stroke-opacity:0;
The command line is split into the separate commands and command contexts.
Contexts group multiple commands related to a specific topic, e.g.
project operations, data source operations etc. Almost all the commands
operate on projects, so the project
context and commands without a context
are mostly the same. By default, commands look for a project in the current
directory. If the project you’re working on is located somewhere else, you
can pass the -p/--project <path>
argument to the command.
Note: command behavior is subject to change, so this text might be
outdated,
always check the --help
output of the specific command
Note: command parameters must be passed prior to the positional arguments.
Datumaro functionality is available with the datum
command.
Usage:
datum [-h] [--version] [--loglevel LOGLEVEL] [command] [command args]
Parameters:
--loglevel
(string) - Logging level, one of
debug
, info
, warning
, error
, critical
(default: info
)
--version
- Print the version number and exit.
-h, --help
- Print the help message and exit.
1 - Convert datasets
This command allows to convert a dataset from one format to another.
The command is a usability alias for create
,
add
and export
and just provides
a simpler way to obtain the same results in simple cases. A list of supported
formats can be found in the --help
output of this command.
Usage:
datum convert [-h] [-i SOURCE] [-if INPUT_FORMAT] -f OUTPUT_FORMAT
[-o DST_DIR] [--overwrite] [-e FILTER] [--filter-mode FILTER_MODE]
[-- EXTRA_EXPORT_ARGS]
Parameters:
-i, --input-path
(string) - Input dataset path. The current directory is
used by default.
-if, --input-format
(string) - Input dataset format. Will try to detect,
if not specified.
-f, --output-format
(string) - Output format
-o, --output-dir
(string) - Output directory. By default, a subdirectory
in the current directory is used.
--overwrite
- Allows overwriting existing files in the output directory,
when it is not empty.
-e, --filter
(string) - XML XPath filter expression for dataset items
--filter-mode
(string) - The filtering mode. Default is the i
mode.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
-- <extra export args>
- Additional arguments for the format writer
(use -- -h
for help). Must be specified after the main command arguments.
Example: convert a VOC-like dataset to a COCO-like one:
datum convert --input-format voc --input-path <path/to/voc/> \
--output-format coco \
-- --save-images
2 - Create project
The command creates an empty project. A project is required for the most of
Datumaro functionality.
By default, the project is created in the current directory. To specify
another output directory, pass the -o/--output-dir
parameter. If output
already directory contains a Datumaro project, an error is raised, unless
--overwrite
is used.
Usage:
datum create [-h] [-o DST_DIR] [--overwrite]
Parameters:
-o, --output-dir
(string) - Allows to specify an output directory.
The current directory is used by default.
--overwrite
- Allows to overwrite existing project files in the output
directory. Any other files are not touched.
-h, --help
- Print the help message and exit.
Examples:
Example: create an empty project in the my_dataset
directory
datum create -o my_dataset/
Example: create a new empty project in the current directory, remove the
existing one
datum create
...
datum create --overwrite
3 - Export Datasets
This command exports a project or a source as a dataset in some format.
Check supported formats for more info
about format specifications, supported options and other details.
The list of formats can be extended by custom plugins, check extending tips
for information on this topic.
Available formats are listed in the command help output.
Dataset format writers support additional export options. To pass
such options, use the --
separator after the main command arguments.
The usage information can be printed with datum import -f <format> -- --help
.
Common export options:
- Most formats (where applicable) support the
--save-images
option, which
allows to export dataset images along with annotations. The option is
disabled be default.
- If
--save-images
is used, the image-ext
option can be passed to
specify the output image file extension (.jpg
, .png
etc.). By default,
tries to Datumaro keep the original image extension. This option
allows to convert all the images from one format into another.
This command allows to use the -f/--filter
parameter to select dataset
elements needed for exporting. Read the filter
command
description for more info about this functionality.
The command can only be applied to a project build target, a stage
or the combined project
target, in which case all the targets will
be affected.
Usage:
datum export [-h] [-e FILTER] [--filter-mode FILTER_MODE] [-o DST_DIR]
[--overwrite] [-p PROJECT_DIR] -f FORMAT [target] [-- EXTRA_FORMAT_ARGS]
Parameters:
<target>
(string) - A project build target to be exported.
By default, all project targets are affected.
-f, --format
(string) - Output format.
-e, --filter
(string) - XML XPath filter expression for dataset items
--filter-mode
(string) - The filtering mode. Default is the i
mode.
-o, --output-dir
(string) - Output directory. By default, a subdirectory
in the current directory is used.
--overwrite
- Allows overwriting existing files in the output directory,
when it is not empty.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
-- <extra format args>
- Additional arguments for the format writer
(use -- -h
for help). Must be specified after the main command arguments.
Example: save a project as a VOC-like dataset, include images, convert
images to PNG
from other formats.
datum export \
-p test_project \
-o test_project-export \
-f voc \
-- --save-images --image-ext='.png'
4 - Filter datasets
This command allows to extract a sub-dataset from a dataset. The new dataset
includes only items satisfying some condition. The XML XPath
is used as a query format.
The command can be applied to a dataset or a project build target,
a stage or the combined project
target, in which case all the project
targets will be affected. A build tree stage will be recorded
if --stage
is enabled, and the resulting dataset(-s) will be
saved if --apply
is enabled.
By default, datasets are updated in-place. The -o/--output-dir
option can be used to specify another output directory. When
updating in-place, use the --overwrite
parameter (in-place
updates fail by default to prevent data loss), unless a project
target is modified.
The current project (-p/--project
) is also used as a context for
plugins, so it can be useful for dataset paths having custom formats.
When not specified, the current project’s working tree is used.
There are several filtering modes available (the -m/--mode
parameter).
Supported modes:
i
, items
a
, annotations
i+a
, a+i
, items+annotations
, annotations+items
When filtering annotations, use the items+annotations
mode to point that annotation-less dataset items should be
removed, otherwise they will be kept in the resulting dataset.
To select an annotation, write an XPath that returns annotation
elements (see examples).
Item representations can be printed with the --dry-run
parameter:
<item>
<id>290768</id>
<subset>minival2014</subset>
<image>
<width>612</width>
<height>612</height>
<depth>3</depth>
</image>
<annotation>
<id>80154</id>
<type>bbox</type>
<label_id>39</label_id>
<x>264.59</x>
<y>150.25</y>
<w>11.19</w>
<h>42.31</h>
<area>473.87</area>
</annotation>
<annotation>
<id>669839</id>
<type>bbox</type>
<label_id>41</label_id>
<x>163.58</x>
<y>191.75</y>
<w>76.98</w>
<h>73.63</h>
<area>5668.77</area>
</annotation>
...
</item>
The command can only be applied to a project build target, a stage or the
combined project
target, in which case all the targets will be affected.
A build tree stage will be added if --stage
is enabled, and the resulting
dataset(-s) will be saved if --apply
is enabled.
Usage:
datum filter [-h] [-e FILTER] [-m MODE] [--dry-run] [--stage STAGE]
[--apply APPLY] [-o DST_DIR] [--overwrite] [-p PROJECT_DIR] [target]
Parameters:
<target>
(string) - Target
dataset revpath.
By default, filters all targets of the current project.
-e, --filter
(string) - XML XPath filter expression for dataset items
-m, --mode
(string) - The filtering mode. Default is the i
mode.
--dry-run
- Print XML representations of the filtered dataset and exit.
--stage
(bool) - Include this action as a project build step.
If true, this operation will be saved in the project
build tree, allowing to reproduce the resulting dataset later.
Applicable only to main project targets (i.e. data sources
and the project
target, but not intermediate stages). Enabled by default.
--apply
(bool) - Run this command immediately. If disabled, only the
build tree stage will be written. Enabled by default.
-o, --output-dir
(string) - Output directory. Can be omitted for
main project targets (i.e. data sources and the project
target, but not
intermediate stages) and dataset targets. If not specified, the results
will be saved inplace.
--overwrite
- Allows to overwrite existing files in the output directory,
when it is specified and is not empty.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Example: extract a dataset with images with width
< height
datum filter \
-p test_project \
-e '/item[image/width < image/height]'
Example: extract a dataset with images of the train
subset
datum filter \
-p test_project \
-e '/item[subset="train"]'
Example: extract a dataset with only large annotations of the cat
class and
any non-persons
datum filter \
-p test_project \
--mode annotations \
-e '/item/annotation[(label="cat" and area > 99.5) or label!="person"]'
Example: extract a dataset with non-occluded annotations, remove empty images.
Use data only from the “s1” source of the project.
datum create
datum import --format voc -i <path/to/dataset1/> --name s1
datum import --format voc -i <path/to/dataset2/> --name s2
datum filter s1 \
-m i+a -e '/item/annotation[occluded="False"]'
5 - Merge Datasets
Consider the following task: there is a set of images (the original dataset)
we want to annotate. Suppose we did this manually and/or automated it
using models, and now we have few sets of annotations for the same images.
We want to merge them and produce a single set of high-precision annotations.
Another use case: there are few datasets with different sets of images
and labels, which we need to combine in a single dataset. If the labels
were the same, we could just join the datasets. But in this case we need
to merge labels and adjust the annotations in the resulting dataset.
In Datumaro, it can be done with the merge
command. This command merges 2
or more datasets and checks annotations for errors.
In simple cases, when dataset images do not intersect and new
labels are not added, the recommended way of merging is using
the patch
command.
It will offer better performance and provide the same results.
Datasets are merged by items, and item annotations are merged by finding the
unique ones across datasets. Annotations are matched between matching dataset
items by distance. Spatial annotations are compared by the applicable distance
measure (IoU, OKS, PDJ etc.), labels and annotation attributes are selected
by voting. Each set of matching annotations produces a single annotation in
the resulting dataset. The score
(a number in the range [0; 1]) attribute
indicates the agreement between different sources in the produced annotation.
The working time of the function can be estimated as
O( (summary dataset length) * (dataset count) ^ 2 * (item annotations) ^ 2 )
This command also allows to merge datasets with different, or partially
overlapping sets of labels (which is impossible by simple joining).
During the process, some merge conflicts can appear. For example,
it can be mismatching dataset images having the same ids, label voting
can be unsuccessful if quorum is not reached (the --quorum
parameter),
bboxes may be too close (the -iou
parameter) etc. Found merge
conflicts, missing items or annotations, and other errors are saved into
an output .json
file.
In Datumaro, annotations can be grouped. It can be useful to represent
different parts of a single object - for example, it can be different parts
of a human body, parts of a vehicle etc. This command allows to check
annotation groups for completeness with the -g/--groups
option. If used,
this parameter must specify a list of labels for annotations that must be
in the same group. It can be particularly useful to check if separate
keypoints are grouped and all the necessary object components in the same
group.
This command has multiple forms:
1) datum merge <revpath>
2) datum merge <revpath> <revpath> ...
<revpath> - either a dataset path or a revision path.
1 - Merges the current project’s main target (“project”)
in the working tree with the specified dataset.
2 - Merges the specified datasets.
Note that the current project is not included in the list of merged
sources automatically.
The command supports passing extra exporting options for the output
dataset. The format can be specified with the -f/--format
option.
Extra options should be passed after the main arguments
and after the --
separator. Particularly, this is useful to include
images in the output dataset with --save-images
.
Usage:
datum merge [-h] [-iou IOU_THRESH] [-oconf OUTPUT_CONF_THRESH]
[--quorum QUORUM] [-g GROUPS] [-o DST_DIR] [--overwrite]
[-p PROJECT_DIR] [-f FORMAT]
target [target ...] [-- EXTRA_FORMAT_ARGS]
Parameters:
<target>
(string) - Target dataset revpaths (repeatable)
-iou
, --iou-thresh
(number) - IoU matching threshold for spatial
annotations (both maximum inter-cluster and pairwise). Default is 0.25.
--quorum
(number) - Minimum count of votes for a label or attribute
to be counted. Default is 0.
-g, --groups
(string) - A comma-separated list of label names in
annotation groups to check. The ?
postfix can be added to a label to
make it optional in the group (repeatable)
-oconf
, --output-conf-thresh
(number) - Confidence threshold for output
annotations to be included in the resulting dataset. Default is 0.
-o, --output-dir
(string) - Output directory. By default, a new directory
is created in the current directory.
--overwrite
- Allows to overwrite existing files in the output directory,
when it is specified and is not empty.
-f, --format
(string) - Output format. The default format is datumaro
.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
-- <extra format args>
- Additional arguments for the format writer
(use -- -h
for help). Must be specified after the main command arguments.
Examples:
Merge 4 (partially-)intersecting projects,
- consider voting successful when there are no less than 3 same votes
- consider shapes intersecting when IoU >= 0.6
- check annotation groups to have
person
, hand
, head
and foot
(?
is used for optional parts)
datum merge project1/ project2/ project3/ project4/ \
--quorum 3 \
-iou 0.6 \
--groups 'person,hand?,head,foot?'
Merge images and annotations from 2 datasets in COCO format:
datum merge dataset1/:image_dir dataset2/:coco dataset3/:coco
Check groups of the merged dataset for consistency:
look for groups consisting of person
, hand
head
, foot
datum merge project1/ project2/ -g 'person,hand?,head,foot?'
Merge two datasets, specify formats:
datum merge path/to/dataset1:voc path/to/dataset2:coco
Merge the current working tree and a dataset:
datum merge path/to/dataset2:coco
Merge a source from a previous revision and a dataset:
datum merge HEAD~2:source-2 path/to/dataset2:yolo
Merge datasets and save in different format:
datum merge -f voc dataset1/:yolo path2/:coco -- --save-images
6 - Patch Datasets
Updates items of the first dataset with items from the second one.
By default, datasets are updated in-place. The -o/--output-dir
option can be used to specify another output directory. When
updating in-place, use the --overwrite
parameter along with the
--save-images
export option (in-place updates fail by default
to prevent data loss).
Unlike the regular project data source joining,
the datasets are not required to have the same labels. The labels from
the “patch” dataset are projected onto the labels of the patched dataset,
so only the annotations with the matching labels are used, i.e.
all the annotations having unknown labels are ignored. Currently,
this command doesn’t allow to update the label information in the
patched dataset.
The command supports passing extra exporting options for the output
dataset. The extra options should be passed after the main arguments
and after the --
separator. Particularly, this is useful to include
images in the output dataset with --save-images
.
This command can be applied to the current project targets or
arbitrary datasets outside a project. Note that if the target dataset
is read-only (e.g. if it is a project, stage or a cache entry),
the output directory must be provided.
Usage:
datum patch [-h] [-o DST_DIR] [--overwrite] [-p PROJECT_DIR]
target patch
[-- EXPORT_ARGS]
<revpath> - either a dataset path or a revision path.
The current project (-p/--project
) is also used as a context for
plugins, so it can be useful for dataset paths having custom formats.
When not specified, the current project’s working tree is used.
Parameters:
<target dataset>
(string) - Target dataset revpath
<patch dataset>
(string) - Patch dataset revpath
-o, --output-dir
(string) - Output directory. By default, saves in-place
--overwrite
- Allows to overwrite existing files in the output directory,
when it is not empty.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
-- <export args>
- Additional arguments for the format writer
(use -- -h
for help). Must be specified after the main command arguments.
Examples:
- Update a VOC-like dataset with COCO-like annotations:
datum patch --overwrite dataset1/:voc dataset2/:coco -- --save-images
- Generate a patched dataset, based on a project:
datum patch -o patched_proj1/ proj1/ proj2/
- Update the “source1” source in the current project with a dataset:
datum patch -p proj/ --overwrite source1 path/to/dataset2:coco
- Generate a patched source from a previous revision and a dataset:
datum patch -o new_src2/ HEAD~2:source-2 path/to/dataset2:yolo
- Update a dataset in a custom format, described in a project plugin:
datum patch -p proj/ --overwrite dataset/:my_format dataset2/:coco
7 - Compare datasets
The command compares two datasets and saves the results in the
specified directory. The current project is considered to be
“ground truth”.
Datasets can be compared using different methods:
equality
- Annotations are compared to be equal
distance
- A distance metric is used
This command has multiple forms:
1) datum diff <revpath>
2) datum diff <revpath> <revpath>
1 - Compares the current project’s main target (project
)
in the working tree with the specified dataset.
2 - Compares two specified datasets.
<revpath> - a dataset path or a revision path.
Usage:
datum diff [-h] [-o DST_DIR] [-m METHOD] [--overwrite] [-p PROJECT_DIR]
[--iou-thresh IOU_THRESH] [-f FORMAT]
[-iia IGNORE_ITEM_ATTR] [-ia IGNORE_ATTR] [-if IGNORE_FIELD]
[--match-images] [--all]
first_target [second_target]
Parameters:
-
<target>
(string) - Target dataset revpaths
-
-m, --method
(string) - Comparison method.
-
-o, --output-dir
(string) - Output directory. By default, a new directory
is created in the current directory.
-
--overwrite
- Allows to overwrite existing files in the output directory,
when it is specified and is not empty.
-
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-
-h, --help
- Print the help message and exit.
-
Distance comparison options:
--iou-thresh
(number) - The IoU threshold for spatial annotations
(default is 0.5).
-f, --format
(string) - Output format, one of simple
(text files and images) and tensorboard
(a TB log directory)
-
Equality comparison options:
-iia, --ignore-item-attr
(string) - Ignore an item attribute (repeatable)
-ia, --ignore-attr
(string) - Ignore an annotation attribute (repeatable)
-if, --ignore-field
(string) - Ignore an annotation field (repeatable)
Default is id
and group
--match-images
- Match dataset items by image pixels instead of ids
--all
- Include matches in the output. By default, only differences are
printed.
Examples:
-
Compare two projects by distance, match boxes if IoU > 0.7,
save results to TensorBoard:
datum diff other/project -o diff/ -f tensorboard --iou-thresh 0.7
-
Compare two projects for equality, exclude annotation groups
and the is_crowd
attribute from comparison:
datum diff other/project/ -if group -ia is_crowd
-
Compare two datasets, specify formats:
datum diff path/to/dataset1:voc path/to/dataset2:coco
-
Compare the current working tree and a dataset:
datum diff path/to/dataset2:coco
-
Compare a source from a previous revision and a dataset:
datum diff HEAD~2:source-2 path/to/dataset2:yolo
-
Compare a dataset with model inference
datum create
datum import <...>
datum model add mymodel <...>
datum transform <...> -o inference
datum diff inference -o diff
8 - Print dataset info
This command outputs high level dataset information such as sample count,
categories and subsets.
Usage:
datum info [-h] [--all] [-p PROJECT_DIR] [revpath]
Parameters:
<target>
(string) - Target dataset revpath.
By default, prints info about the joined project
dataset.
--all
- Print all the information: do not fold long lists of labels etc.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Examples:
Sample output:
length: 5000
categories: label
label:
count: 80
labels: person, bicycle, car, motorcycle (and 76 more)
subsets: minival2014
'minival2014':
length: 5000
categories: label
label:
count: 80
labels: person, bicycle, car, motorcycle (and 76 more)
9 - Get Project Statistics
This command computes various project statistics, such as:
- image mean and std. dev.
- class and attribute balance
- mask pixel balance
- segment area distribution
Usage:
datum stats [-h] [-p PROJECT_DIR] [target]
Parameters:
<target>
(string) - Target
source revpath.
By default, computes statistics of the merged dataset.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Example:
datum stats -p test_project
Sample output:
{
"annotations": {
"labels": {
"attributes": {
"gender": {
"count": 358,
"distribution": {
"female": [
149,
0.41620111731843573
],
"male": [
209,
0.5837988826815642
]
},
"values count": 2,
"values present": [
"female",
"male"
]
},
"view": {
"count": 340,
"distribution": {
"__undefined__": [
4,
0.011764705882352941
],
"front": [
54,
0.1588235294117647
],
"left": [
14,
0.041176470588235294
],
"rear": [
235,
0.6911764705882353
],
"right": [
33,
0.09705882352941177
]
},
"values count": 5,
"values present": [
"__undefined__",
"front",
"left",
"rear",
"right"
]
}
},
"count": 2038,
"distribution": {
"car": [
340,
0.16683022571148184
],
"cyclist": [
194,
0.09519136408243375
],
"head": [
354,
0.17369970559371933
],
"ignore": [
100,
0.04906771344455348
],
"left_hand": [
238,
0.11678115799803729
],
"person": [
358,
0.17566241413150147
],
"right_hand": [
77,
0.037782139352306184
],
"road_arrows": [
326,
0.15996074582924436
],
"traffic_sign": [
51,
0.025024533856722278
]
}
},
"segments": {
"area distribution": [
{
"count": 1318,
"max": 11425.1,
"min": 0.0,
"percent": 0.9627465303140978
},
{
"count": 1,
"max": 22850.2,
"min": 11425.1,
"percent": 0.0007304601899196494
},
{
"count": 0,
"max": 34275.3,
"min": 22850.2,
"percent": 0.0
},
{
"count": 0,
"max": 45700.4,
"min": 34275.3,
"percent": 0.0
},
{
"count": 0,
"max": 57125.5,
"min": 45700.4,
"percent": 0.0
},
{
"count": 0,
"max": 68550.6,
"min": 57125.5,
"percent": 0.0
},
{
"count": 0,
"max": 79975.7,
"min": 68550.6,
"percent": 0.0
},
{
"count": 0,
"max": 91400.8,
"min": 79975.7,
"percent": 0.0
},
{
"count": 0,
"max": 102825.90000000001,
"min": 91400.8,
"percent": 0.0
},
{
"count": 50,
"max": 114251.0,
"min": 102825.90000000001,
"percent": 0.036523009495982466
}
],
"avg. area": 5411.624543462382,
"pixel distribution": {
"car": [
13655,
0.0018431496518735067
],
"cyclist": [
939005,
0.12674674030446592
],
"head": [
0,
0.0
],
"ignore": [
5501200,
0.7425510702956085
],
"left_hand": [
0,
0.0
],
"person": [
954654,
0.12885903974805205
],
"right_hand": [
0,
0.0
],
"road_arrows": [
0,
0.0
],
"traffic_sign": [
0,
0.0
]
}
}
},
"annotations by type": {
"bbox": {
"count": 548
},
"caption": {
"count": 0
},
"label": {
"count": 0
},
"mask": {
"count": 0
},
"points": {
"count": 669
},
"polygon": {
"count": 821
},
"polyline": {
"count": 0
}
},
"annotations count": 2038,
"dataset": {
"image mean": [
107.06903686941979,
79.12831698580979,
52.95829558185416
],
"image std": [
49.40237673503467,
43.29600731496902,
35.47373007603151
],
"images count": 100
},
"images count": 100,
"subsets": {},
"unannotated images": [
"img00051",
"img00052",
"img00053",
"img00054",
"img00055",
],
"unannotated images count": 5,
"unique images count": 97,
"repeating images count": 3,
"repeating images": [
[("img00057", "default"), ("img00058", "default")],
[("img00059", "default"), ("img00060", "default")],
[("img00061", "default"), ("img00062", "default")],
],
}
10 - Validate Dataset
This command inspects annotations with respect to the task type
and stores the results in JSON file.
The task types supported are classification
, detection
, and
segmentation
(the -t/--task-type
parameter).
The validation result contains
annotation statistics
based on the task type
validation reports
, such as
- items not having annotations
- items having undefined annotations
- imbalanced distribution in class/attributes
- too small or large values
summary
Usage:
datum validate [-h] -t TASK [-s SUBSET_NAME] [-p PROJECT_DIR]
[target] [-- EXTRA_ARGS]
Parameters:
<target>
(string) - Target
dataset revpath.
By default, validates the current project.
-t, --task-type
(string) - Task type for validation
-s, --subset
(string) - Dataset subset to be validated
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
<extra args>
- The list of extra validation parameters. Should be passed
after the --
separator after the main command arguments:
-fs, --few-samples-thr
(number) - The threshold for giving a warning
for minimum number of samples per class
-ir, --imbalance-ratio-thr
(number) - The threshold for giving
imbalance data warning
-m, --far-from-mean-thr
(number) - The threshold for giving
a warning that data is far from mean
-dr, --dominance-ratio-thr
(number) - The threshold for giving
a warning bounding box imbalance
-k, --topk-bins
(number) - The ratio of bins with the highest
number of data to total bins in the histogram
Example : give warning when imbalance ratio of data with classification task
over 40
datum validate -p prj/ -t classification -- -ir 40
Here is the list of validation items(a.k.a. anomaly types).
Anomaly Type |
Description |
Task Type |
MissingLabelCategories |
Metadata (ex. LabelCategories) should be defined |
common |
MissingAnnotation |
No annotation found for an Item |
common |
MissingAttribute |
An attribute key is missing for an Item |
common |
MultiLabelAnnotations |
Item needs a single label |
classification |
UndefinedLabel |
A label not defined in the metadata is found for an item |
common |
UndefinedAttribute |
An attribute not defined in the metadata is found for an item |
common |
LabelDefinedButNotFound |
A label is defined, but not found actually |
common |
AttributeDefinedButNotFound |
An attribute is defined, but not found actually |
common |
OnlyOneLabel |
The dataset consists of only label |
common |
OnlyOneAttributeValue |
The dataset consists of only attribute value |
common |
FewSamplesInLabel |
The number of samples in a label might be too low |
common |
FewSamplesInAttribute |
The number of samples in an attribute might be too low |
common |
ImbalancedLabels |
There is an imbalance in the label distribution |
common |
ImbalancedAttribute |
There is an imbalance in the attribute distribution |
common |
ImbalancedDistInLabel |
Values (ex. bbox width) are not evenly distributed for a label |
detection, segmentation |
ImbalancedDistInAttribute |
Values (ex. bbox width) are not evenly distributed for an attribute |
detection, segmentation |
NegativeLength |
The width or height of bounding box is negative |
detection |
InvalidValue |
There’s invalid (ex. inf, nan) value for bounding box info. |
detection |
FarFromLabelMean |
An annotation has an too small or large value than average for a label |
detection, segmentation |
FarFromAttrMean |
An annotation has an too small or large value than average for an attribute |
detection, segmentation |
Validation Result Format:
{
'statistics': {
## common statistics
'label_distribution': {
'defined_labels': <dict>, # <label:str>: <count:int>
'undefined_labels': <dict>
# <label:str>: {
# 'count': <int>,
# 'items_with_undefined_label': [<item_key>, ]
# }
},
'attribute_distribution': {
'defined_attributes': <dict>,
# <label:str>: {
# <attribute:str>: {
# 'distribution': {<attr_value:str>: <count:int>, },
# 'items_missing_attribute': [<item_key>, ]
# }
# }
'undefined_attributes': <dict>
# <label:str>: {
# <attribute:str>: {
# 'distribution': {<attr_value:str>: <count:int>, },
# 'items_with_undefined_attr': [<item_key>, ]
# }
# }
},
'total_ann_count': <int>,
'items_missing_annotation': <list>, # [<item_key>, ]
## statistics for classification task
'items_with_multiple_labels': <list>, # [<item_key>, ]
## statistics for detection task
'items_with_invalid_value': <dict>,
# '<item_key>': {<ann_id:int>: [ <property:str>, ], }
# - properties: 'x', 'y', 'width', 'height',
# 'area(wxh)', 'ratio(w/h)', 'short', 'long'
# - 'short' is min(w,h) and 'long' is max(w,h).
'items_with_negative_length': <dict>,
# '<item_key>': { <ann_id:int>: { <'width'|'height'>: <value>, }, }
'bbox_distribution_in_label': <dict>, # <label:str>: <bbox_template>
'bbox_distribution_in_attribute': <dict>,
# <label:str>: {<attribute:str>: { <attr_value>: <bbox_template>, }, }
'bbox_distribution_in_dataset_item': <dict>,
# '<item_key>': <bbox count:int>
## statistics for segmentation task
'items_with_invalid_value': <dict>,
# '<item_key>': {<ann_id:int>: [ <property:str>, ], }
# - properties: 'area', 'width', 'height'
'mask_distribution_in_label': <dict>, # <label:str>: <mask_template>
'mask_distribution_in_attribute': <dict>,
# <label:str>: {
# <attribute:str>: { <attr_value>: <mask_template>, }
# }
'mask_distribution_in_dataset_item': <dict>,
# '<item_key>': <mask/polygon count: int>
},
'validation_reports': <list>, # [ <validation_error_format>, ]
# validation_error_format = {
# 'anomaly_type': <str>,
# 'description': <str>,
# 'severity': <str>, # 'warning' or 'error'
# 'item_id': <str>, # optional, when it is related to a DatasetItem
# 'subset': <str>, # optional, when it is related to a DatasetItem
# }
'summary': {
'errors': <count: int>,
'warnings': <count: int>
}
}
item_key
is defined as,
item_key = (<DatasetItem.id:str>, <DatasetItem.subset:str>)
bbox_template
and mask_template
are defined as,
bbox_template = {
'width': <numerical_stat_template>,
'height': <numerical_stat_template>,
'area(wxh)': <numerical_stat_template>,
'ratio(w/h)': <numerical_stat_template>,
'short': <numerical_stat_template>, # short = min(w, h)
'long': <numerical_stat_template> # long = max(w, h)
}
mask_template = {
'area': <numerical_stat_template>,
'width': <numerical_stat_template>,
'height': <numerical_stat_template>
}
numerical_stat_template
is defined as,
numerical_stat_template = {
'items_far_from_mean': <dict>,
# {'<item_key>': {<ann_id:int>: <value:float>, }, }
'mean': <float>,
'stddev': <float>,
'min': <float>,
'max': <float>,
'median': <float>,
'histogram': {
'bins': <list>, # [<float>, ]
'counts': <list>, # [<int>, ]
}
}
11 - Commit
This command allows to fix the current state of a project and
create a new revision from the working tree.
By default, this command checks sources in the working tree for
changes. If there are unknown changes found, an error will be raised,
unless --allow-foreign
is used. If such changes are committed,
the source will only be available for reproduction from the project
cache, because Datumaro will not know how to repeat them.
The command will add the sources into the project cache. If you only
need to record revision metadata, you can use the --no-cache
parameter.
This can be useful if you want to save disk space and/or have a backup copy
of datasets used in the project.
If there are no changes found, the command will stop. To allow empty
commits, use --allow-empty
.
Usage:
datum commit [-h] -m MESSAGE [--allow-empty] [--allow-foreign]
[--no-cache] [-p PROJECT_DIR]
Parameters:
--allow-empty
- Allow commits with no changes
--allow-foreign
- Allow commits with changes made not by Datumaro
--no-cache
- Don’t put committed datasets into cache, save only metadata
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Example:
datum create
datum import -f coco <path/to/coco/>
datum commit -m "Added COCO"
12 - Transform Dataset
Often datasets need to be modified during preparation for model training and
experimenting. In trivial cases it can be done manually - e.g. image renaming
or label renaming. However, in more complex cases even simple modifications
can require too much efforts, distracting the user from the real work.
Datumaro provides the datum transform
command to help in such cases.
This command allows to modify dataset images or annotations all at once.
This command is designed for batch dataset processing, so if you only
need to modify few elements of a dataset, you might want to use
other approaches for better performance. A possible solution can be
a simple script, which uses Datumaro API.
The command can be applied to a dataset or a project build target,
a stage or the combined project
target, in which case all the project
targets will be affected. A build tree stage will be recorded
if --stage
is enabled, and the resulting dataset(-s) will be
saved if --apply
is enabled.
By default, datasets are updated in-place. The -o/--output-dir
option can be used to specify another output directory. When
updating in-place, use the --overwrite
parameter (in-place
updates fail by default to prevent data loss), unless a project
target is modified.
The current project (-p/--project
) is also used as a context for
plugins, so it can be useful for dataset paths having custom formats.
When not specified, the current project’s working tree is used.
Usage:
datum transform [-h] -t TRANSFORM [-o DST_DIR] [--overwrite]
[-p PROJECT_DIR] [--stage STAGE] [--apply APPLY] [target] [-- EXTRA_ARGS]
Parameters:
<target>
(string) - Target
dataset revpath.
By default, transforms all targets of the current project.
-t, --transform
(string) - Transform method name
--stage
(bool) - Include this action as a project build step.
If true, this operation will be saved in the project
build tree, allowing to reproduce the resulting dataset later.
Applicable only to main project targets (i.e. data sources
and the project
target, but not intermediate stages). Enabled by default.
--apply
(bool) - Run this command immediately. If disabled, only the
build tree stage will be written. Enabled by default.
-o, --output-dir
(string) - Output directory. Can be omitted for
main project targets (i.e. data sources and the project
target, but not
intermediate stages) and dataset targets. If not specified, the results
will be saved inplace.
--overwrite
- Allows to overwrite existing files in the output directory,
when it is specified and is not empty.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
<extra args>
- The list of extra transformation parameters. Should be
passed after the --
separator after the main command arguments. See
transform descriptions for info about extra parameters. Use the --help
option to print parameter info.
Examples:
- Split a VOC-like dataset randomly:
datum transform -t random_split --overwrite path/to/dataset:voc
- Rename images in a project data source by a regex from
frame_XXX
to XXX
:
datum create <...>
datum import <...> -n source-1
datum transform -t rename source-1 -- -e '|frame_(\d+)|\\1|'
Basic dataset item manipulations:
rename
- Renames dataset items by regular expression
id_from_image_name
- Renames dataset items to their image filenames
reindex
- Renames dataset items with numbers
ndr
- Removes duplicated images from dataset
sampler
- Runs inference and leaves only the most representative images
resize
- Resizes images and annotations in the dataset
Subset manipulations:
random_split
- Splits dataset into subsets randomly
split
- Splits dataset into subsets for classification, detection,
segmentation or re-identification
map_subsets
- Renames and removes subsets
Annotation manipulations:
remap_labels
- Renames, adds or removes labels in dataset
project_labels
- Sets dataset labels to the requested sequence
shapes_to_boxes
- Replaces spatial annotations with bounding boxes
boxes_to_masks
- Converts bounding boxes to instance masks
polygons_to_masks
- Converts polygons to instance masks
masks_to_polygons
- Converts instance masks to polygons
anns_to_labels
- Replaces annotations having labels with label annotations
merge_instance_segments
- Merges grouped spatial annotations into a mask
crop_covered_segments
- Removes occluded segments of covered masks
bbox_value_decrement
- Subtracts 1 from bbox coordinates
Examples:
- Split a dataset randomly to
train
and test
subsets, ratio is 2:1
datum transform -t random_split -- --subset train:.67 --subset test:.33
- Split a dataset for a specific task. The tasks supported are
classification, detection, segmentation and re-identification.
datum transform -t split -- \
-t classification --subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- \
-t detection --subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- \
-t segmentation --subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- \
-t reid --subset train:.5 --subset val:.2 --subset test:.3 \
--query .5
- Convert spatial annotations between each other
datum transform -t boxes_to_masks
datum transform -t masks_to_polygons
datum transform -t polygons_to_masks
datum transform -t shapes_to_boxes
- Set dataset labels to {
person
, cat
, dog
}, remove others, add missing.
Original labels (can be any): cat
, dog
, elephant
, human
New labels: person
(added), cat
(kept), dog
(kept)
datum transform -t project_labels -- -l person -l cat -l dog
- Remap dataset labels,
person
to car
and cat
to dog
,
keep bus
, remove others
datum transform -t remap_labels -- \
-l person:car -l bus:bus -l cat:dog \
--default delete
- Rename dataset items by a regular expression
- Replace
pattern
with replacement
- Remove
frame_
from item ids
datum transform -t rename -- -e '|pattern|replacement|'
datum transform -t rename -- -e '|frame_(\d+)|\\1|'
- Create a dataset from K the most hard items for a model. The dataset will
be split into the
sampled
and unsampled
subsets, based on the model
confidence, which is stored in the scores
annotation attribute.
There are five methods of sampling (the -m/--method
option):
topk
- Return the k with high uncertainty data
lowk
- Return the k with low uncertainty data
randk
- Return the random k data
mixk
- Return half to topk method and the rest to lowk method
randtopk
- First, select 3 times the number of k randomly, and return
the topk among them.
datum transform -t sampler -- \
-a entropy \
-i train \
-o sampled \
-u unsampled \
-m topk \
-k 20
- Remove duplicated images from a dataset. Keep at most N resulting images.
- Available sampling options (the
-e
parameter):
random
- sample from removed data randomly
similarity
- sample from removed data with ascending
- Available sampling methods (the
-u
parameter):
uniform
- sample data with uniform distribution
inverse
- sample data with reciprocal of the number
datum transform -t ndr -- \
-w train \
-a gradient \
-k 100 \
-e random \
-u uniform
- Resize dataset images and annotations. Supports upscaling, downscaling
and mixed variants.
datum transform -t resize -- -dw 256 -dh 256
13 - Checkout
This command allows to restore a specific project revision in the project
tree or to restore separate revisions of sources. A revision can be a commit
hash, branch, tag, or any relative reference in the Git format.
This command has multiple forms:
1) datum checkout <revision>
2) datum checkout [--] <source1> ...
3) datum checkout <revision> [--] <source1> <source2> ...
1 - Restores a revision and all the corresponding sources in the
working directory. If there are conflicts between modified files in the
working directory and the target revision, an error is raised, unless
--force
is used.
2, 3 - Restores only selected sources from the specified revision.
The current revision is used, when not set.
“–” can be used to separate source names and revisions:
datum checkout name
- will look for revision “name”
datum checkout -- name
- will look for source “name” in the current
revision
Usage:
datum checkout [-h] [-f] [-p PROJECT_DIR] [rev] [--] [sources [sources ...]]
Parameters:
--force
- Allows to overwrite unsaved changes in case of conflicts
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Examples:
-
Restore the previous revision:
datum checkout HEAD~1
-
Restore the saved version of a source in the working tree
datum checkout -- source-1
-
Restore a previous version of a source
datum checkout 33fbfbe my-source
14 - Status
This command prints the summary of the source changes between
the working tree of a project and its HEAD revision.
Prints lines in the following format:
<status> <source name>
The list of possible status
values:
modified
- the source data exists and it is changed
foreign_modified
- the source data exists and it is changed,
but Datumaro does not know about the way the differences were made.
If changes are committed, they will only be available for reproduction
from the project cache.
added
- the source was added in the working tree
removed
- the source was removed from the working tree. This status won’t
be reported if just the source data is removed in the working tree.
In such situation the status will be missing
.
missing
- the source data is removed from the working directory.
The source still can be restored from the project cache or reproduced.
Usage:
datum status [-h] [-p PROJECT_DIR]
Parameters:
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Example output:
added source-1
modified source-2
foreign_modified source-3
removed source-4
missing source-5
15 - Log
This command prints the history of the current project revision.
Prints lines in the following format:
<short commit hash> <commit message>
Usage:
datum log [-h] [-n MAX_COUNT] [-p PROJECT_DIR]
Parameters:
-n, --max-count
(number, default: 10) - The maximum number of
previous revisions in the output
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Example output:
affbh33 Added COCO dataset
eeffa35 Added VOC dataset
16 - Run model inference explanation (explain)
Runs an explainable AI algorithm for a model.
This tool is supposed to help an AI developer to debug a model and a dataset.
Basically, it executes model inference and tries to find relation between
inputs and outputs of the trained model, i.e. determine decision boundaries
and belief intervals for the classifier.
Currently, the only available algorithm is RISE (article),
which runs model a single time and then re-runs a model multiple times on
each image to produce a heatmap of activations for each output of the
first inference. Each time a part of the input image is masked. As a result,
we obtain a number heatmaps, which show, how specific image pixels affected
the inference result. This algorithm doesn’t require any special information
about the model, but it requires the model to return all the outputs and
confidences. The original algorithm supports only classification scenario,
but Datumaro extends it for detection models.
The following use cases available:
- RISE for classification
- RISE for object detection
Usage:
datum explain [-h] -m MODEL [-o SAVE_DIR] [-p PROJECT_DIR]
[target] {rise} [RISE_ARGS]
Parameters:
-
<target>
(string) - Target
dataset revpath.By default,
uses the whole current project. An image path can be specified instead.
<image path> - a path to the file.
<revpath> - a dataset path or a revision path.
-
<method>
(string) - The algorithm to use. Currently, only rise
is supported.
-
-m, --model
(string) - The model to use for inference
-
-o, --output-dir
(string) - Directory to save results to
(default: display only)
-
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-
-h, --help
- Print the help message and exit.
-
RISE options:
-s, --max-samples
(number) - Number of algorithm model runs per image
(default: mask size ^ 2).
--mw, --mask-width
(number) - Mask width in pixels (default: 7)
--mh, --mask-height
(number) - Mask height in pixels (default: 7)
--prob
(number) - Mask pixel inclusion probability, controls
mask density (default: 0.5)
--iou, --iou-thresh
(number) - IoU match threshold for detections
(default: 0.9)
--nms, --nms-iou-thresh
(number) - IoU match threshold for detections
for non-maxima suppression (default: no NMS)
--conf, --det-conf-thresh
(number) - Confidence threshold for
detections (default: include all)
-b, --batch-size
(number) - Batch size for inference (default: 1)
--display
- Visualize results during computations
Examples:
-
Run RISE on an image, display results:
datum explain path/to/image.jpg -m mymodel rise --max-samples 50
-
Run RISE on a source revision:
datum explain HEAD~1:source-1 -m model rise
-
Run inference explanation on a single image with online visualization
datum create <...>
datum model add mymodel <...>
datum explain -t image.png -m mymodel \
rise --max-samples 1000 --display
Note: this algorithm requires the model to return
all (or a reasonable amount) the outputs and confidences unfiltered,
i.e. all the Label
annotations for classification models and
all the Bbox
es for detection models.
You can find examples of the expected model outputs in tests/test_RISE.py
For OpenVINO models the output processing script would look like this:
Classification scenario:
from datumaro.components.extractor import *
from datumaro.util.annotation_util import softmax
def process_outputs(inputs, outputs):
# inputs = model input, array or images, shape = (N, C, H, W)
# outputs = model output, logits, shape = (N, n_classes)
# results = conversion result, [ [ Annotation, ... ], ... ]
results = []
for input, output in zip(inputs, outputs):
input_height, input_width = input.shape[:2]
confs = softmax(output[0])
for label, conf in enumerate(confs):
results.append(Label(int(label)), attributes={'score': float(conf)})
return results
Object Detection scenario:
from datumaro.components.extractor import *
# return a significant number of output boxes to make multiple runs
# statistically correct and meaningful
max_det = 1000
def process_outputs(inputs, outputs):
# inputs = model input, array or images, shape = (N, C, H, W)
# outputs = model output, shape = (N, 1, K, 7)
# results = conversion result, [ [ Annotation, ... ], ... ]
results = []
for input, output in zip(inputs, outputs):
input_height, input_width = input.shape[:2]
detections = output[0]
image_results = []
for i, det in enumerate(detections):
label = int(det[1])
conf = float(det[2])
x = max(int(det[3] * input_width), 0)
y = max(int(det[4] * input_height), 0)
w = min(int(det[5] * input_width - x), input_width)
h = min(int(det[6] * input_height - y), input_height)
image_results.append(Bbox(x, y, w, h,
label=label, attributes={'score': conf} ))
results.append(image_results[:max_det])
return results
17 - Models
Register model
Datumaro can execute deep learning models in various frameworks. Check
the plugins section
for more info.
Supported frameworks:
- OpenVINO
- Custom models via custom
launchers
Models need to be added to the Datumaro project first. It can be done with
the datum model add
command.
Usage:
datum model add [-h] [-n NAME] -l LAUNCHER [--copy] [--no-check]
[-p PROJECT_DIR] [-- EXTRA_ARGS]
Parameters:
-l, --launcher
(string) - Model launcher name
--copy
- Copy model data into project. By default, only the link is saved.
--no-check
- Don’t check the model can be loaded
-n
, --name
(string) - Name of the new model (default: generate
automatically)
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
<extra args>
- Additional arguments for the model launcher
(use -- -h
for help). Must be specified after the main command arguments.
Example: register an OpenVINO model
A model consists of a graph description and weights. There is also a script
used to convert model outputs to internal data structures.
datum create
datum model add \
-n <model_name> -l openvino -- \
-d <path_to_xml> -w <path_to_bin> -i <path_to_interpretation_script>
Interpretation script for an OpenVINO detection model (convert.py
):
You can find OpenVINO model interpreter samples in
datumaro/plugins/openvino/samples
(instruction).
from datumaro.components.extractor import *
max_det = 10
conf_thresh = 0.1
def process_outputs(inputs, outputs):
# inputs = model input, array or images, shape = (N, C, H, W)
# outputs = model output, shape = (N, 1, K, 7)
# results = conversion result, [ [ Annotation, ... ], ... ]
results = []
for input, output in zip(inputs, outputs):
input_height, input_width = input.shape[:2]
detections = output[0]
image_results = []
for i, det in enumerate(detections):
label = int(det[1])
conf = float(det[2])
if conf <= conf_thresh:
continue
x = max(int(det[3] * input_width), 0)
y = max(int(det[4] * input_height), 0)
w = min(int(det[5] * input_width - x), input_width)
h = min(int(det[6] * input_height - y), input_height)
image_results.append(Bbox(x, y, w, h,
label=label, attributes={'score': conf} ))
results.append(image_results[:max_det])
return results
def get_categories():
# Optionally, provide output categories - label map etc.
# Example:
label_categories = LabelCategories()
label_categories.add('person')
label_categories.add('car')
return { AnnotationType.label: label_categories }
Remove Models
To remove a model from a project, use the datum model remove
command.
Usage:
datum model remove [-h] [-p PROJECT_DIR] name
Parameters:
<name>
(string) - The name of the model to be removed
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Example:
datum create
datum model add <...> -n model1
datum remove model1
Run Model
This command applies model to dataset images and produces a new dataset.
Usage:
Parameters:
<target>
(string) - A project build target to be used.
By default, uses the combined project
target.
-m, --model
(string) - Model name
-o, --output-dir
(string) - Output directory. By default, results will
be stored in an auto-generated directory in the current directory.
--overwrite
- Allows to overwrite existing files in the output directory,
when it is specified and is not empty.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Example: launch inference on a dataset
datum create
datum import <...>
datum model add mymodel <...>
datum model run -m mymodel -o inference
18 - Sources
These commands are specific for Data Sources. Read more about them here.
Import Dataset
Datasets can be added to a Datumaro project with the import
command,
which adds a dataset link into the project and downloads (or copies)
the dataset. If you need to add a dataset already copied into the project,
use the add
command.
Dataset format readers can provide some additional import options. To pass
such options, use the --
separator after the main command arguments.
The usage information can be printed with datum import -f <format> -- --help
.
The list of currently available formats is listed in the command help output.
A dataset is imported by its URL. Currently, only local filesystem
paths are supported. The URL can be a file or a directory path
to a dataset. When the dataset is read, it is read as a whole.
However, many formats can have multiple subsets like train
, val
, test
etc. If you want to limit reading only to a specific subset, use
the -r/--path
parameter. It can also be useful when subset files have
non-standard placement or names.
When a dataset is imported, the following things are done:
- URL is saved in the project config
- data in copied into the project
Each data source has a name assigned, which can be used in other commands. To
set a specific name, use the -n/--name
parameter.
The dataset is added into the working tree of the project. A new commit
is not done automatically.
Usage:
datum import [-h] [-n NAME] -f FORMAT [-r PATH] [--no-check]
[-p PROJECT_DIR] url [-- EXTRA_FORMAT_ARGS]
Parameters:
<url>
(string) - A file of directory path to the dataset.
-f, --format
(string) - Dataset format
-r, --path
(string) - A path relative to the source URL the data source.
Useful to specify a path to a subset, subtask, or a specific file in URL.
--no-check
- Don’t try to read the source after importing
-n
, --name
(string) - Name of the new source (default: generate
automatically)
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
-- <extra format args>
- Additional arguments for the format reader
(use -- -h
for help). Must be specified after the main command arguments.
Example: create a project from images and annotations in different formats,
export as TFrecord for TF Detection API for model training
# 'default' is the name of the subset below
datum create
datum import -f coco_instances -r annotations/instances_default.json path/to/coco
datum import -f cvat <path/to/cvat/default.xml>
datum import -f voc_detection -r custom_subset_dir/default.txt <path/to/voc>
datum import -f datumaro <path/to/datumaro/default.json>
datum import -f image_dir <path/to/images/dir>
datum export -f tf_detection_api -- --save-images
Add Dataset
Existing datasets can be added to a Datumaro project with the add
command.
The command adds a project-local directory as a data source in the project.
Unlike the import
command, it does not copy datasets and only works with local directories.
The source name is defined by the directory name.
Dataset format readers can provide some additional import options. To pass
such options, use the --
separator after the main command arguments.
The usage information can be printed with datum add -f <format> -- --help
.
The list of currently available formats is listed in the command help output.
A dataset is imported as a directory. When the dataset is read, it is read
as a whole. However, many formats can have multiple subsets like train
,
val
, test
etc. If you want to limit reading only to a specific subset,
use the -r/--path
parameter. It can also be useful when subset files have
non-standard placement or names.
The dataset is added into the working tree of the project. A new commit
is not done automatically.
Usage:
datum add [-h] -f FORMAT [-r PATH] [--no-check]
[-p PROJECT_DIR] path [-- EXTRA_FORMAT_ARGS]
Parameters:
<url>
(string) - A file of directory path to the dataset.
-f, --format
(string) - Dataset format
-r, --path
(string) - A path relative to the source URL the data source.
Useful to specify a path to a subset, subtask, or a specific file in URL.
--no-check
- Don’t try to read the source after importing
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
-- <extra format args>
- Additional arguments for the format reader
(use -- -h
for help). Must be specified after the main command arguments.
Example: create a project from images and annotations in different formats,
export in YOLO for model training
datum create
datum add -f coco -r annotations/instances_train.json dataset1/
datum add -f cvat dataset2/train.xml
datum export -f yolo -- --save-images
Remove Datasets
To remove a data source from a project, use the remove
command.
Usage:
datum remove [-h] [--force] [--keep-data] [-p PROJECT_DIR] name [name ...]
Parameters:
<name>
(string) - The name of the source to be removed (repeatable)
-f, --force
- Do not fail and stop on errors during removal
--keep-data
- Do not remove source data from the working directory, remove
only project metainfo.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Example:
datum create
datum import -f voc -n src1 <path/to/dataset/>
datum remove src1
19 - Projects
Migrate project
Updates the project from an old version to the current one and saves the
resulting project in the output directory. Projects cannot be updated
inplace.
The command tries to map the old source configuration to the new one.
This can fail in some cases, so the command will exit with an error,
unless -f/--force
is specified. With this flag, the command will
skip these errors an continue its work.
Usage:
datum project migrate [-h] -o DST_DIR [-f] [-p PROJECT_DIR] [--overwrite]
Parameters:
-o, --output-dir
(string) - Output directory for the updated project
-f, --force
- Ignore source import errors (default: False)
--overwrite
- Overwrite existing files in the save directory.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Examples:
- Migrate a project from v1 to v2, save the new project in other dir:
datum project migrate -o <output/dir>
Print project info
Prints project configuration info such as available plugins, registered models,
imported sources and build tree.
Usage:
datum project info [-h] [-p PROJECT_DIR] [revision]
Parameters:
<revision>
(string) - Target project revision. By default,
uses the working tree.
-p, --project
(string) - Directory of the project to operate on
(default: current directory).
-h, --help
- Print the help message and exit.
Examples:
Sample output:
Project:
location: /test_proj
Plugins:
extractors: ade20k2017, ade20k2020, camvid, cifar, cityscapes, coco, coco_captions, coco_image_info, coco_instances, coco_labels, coco_panoptic, coco_person_keypoints, coco_stuff, cvat, datumaro, icdar_text_localization, icdar_text_segmentation, icdar_word_recognition, image_dir, image_zip, imagenet, imagenet_txt, kitti, kitti_detection, kitti_raw, kitti_segmentation, label_me, lfw, market1501, mnist, mnist_csv, mot_seq, mots, mots_png, open_images, sly_pointcloud, tf_detection_api, vgg_face2, voc, voc_action, voc_classification, voc_detection, voc_layout, voc_segmentation, wider_face, yolo
converters: camvid, mot_seq_gt, coco_captions, coco, coco_image_info, coco_instances, coco_labels, coco_panoptic, coco_person_keypoints, coco_stuff, kitti, kitti_detection, kitti_segmentation, icdar_text_localization, icdar_text_segmentation, icdar_word_recognition, lfw, datumaro, open_images, image_zip, cifar, yolo, voc_action, voc_classification, voc, voc_detection, voc_layout, voc_segmentation, tf_detection_api, label_me, mnist, cityscapes, mnist_csv, kitti_raw, wider_face, vgg_face2, sly_pointcloud, mots_png, image_dir, imagenet_txt, market1501, imagenet, cvat
launchers:
Models:
Sources:
'source-2':
format: voc
url: /datasets/pascal/VOC2012
location: /test_proj/source-2/
options: {}
hash: 3eb282cdd7339d05b75bd932a1fd3201
stages:
'root':
type: source
hash: 3eb282cdd7339d05b75bd932a1fd3201
'source-3':
format: imagenet
url: /datasets/imagenet/ILSVRC2012_img_val/train
location: /test_proj/source-3/
options: {}
hash: e47804a3ec1a54c9b145e5f1007ec72f
stages:
'root':
type: source
hash: e47804a3ec1a54c9b145e5f1007ec72f