This the multi-page printable view of this section. Click here to print.
User Manual
- 1: Installation
- 2: How to use Datumaro
- 3: Supported Formats
- 4: Media formats
- 5: Command reference
- 5.1: Checkout
- 5.2: Commit
- 5.3: Convert datasets
- 5.4: Create project
- 5.5: Describe downloadable datasets
- 5.6: Detect dataset format
- 5.7: Compare datasets
- 5.8: Download datasets
- 5.9: Run model inference explanation (explain)
- 5.10: Export Datasets
- 5.11: Filter datasets
- 5.12: Generate Datasets
- 5.13: Print dataset info
- 5.14: Log
- 5.15: Merge Datasets
- 5.16: Models
- 5.17: Patch Datasets
- 5.18: Projects
- 5.19: Sources
- 5.20: Get Project Statistics
- 5.21: Status
- 5.22: Transform Dataset
- 5.23: Utilities
- 5.24: Validate Dataset
- 6: Extending
- 7: Links
- 8: How to control telemetry data collection
1 - Installation
Dependencies
- Python (3.7+)
- Optional: OpenVINO, TensorFlow, PyTorch, MxNet, Caffe, Accuracy Checker, Git
Installation steps
Optionally, set up a virtual environment:
python -m pip install virtualenv
python -m virtualenv venv
. venv/bin/activate
Install:
# From PyPI:
pip install datumaro[default]
# From the GitHub repository:
pip install 'git+https://github.com/cvat-ai/datumaro[default]'
Read more about choosing between datumaro
and datumaro[default]
here.
Plugins
Datumaro has many plugins, which are responsible for dataset formats, model launchers and other optional components. If a plugin has dependencies, they can require additional installation. You can find the list of all the plugin dependencies in the plugins section.
Customizing installation
-
Datumaro has the following installation options:
pip install datumaro
- for core library functionalitypip install datumaro[default]
- for normal CLI experience
In restricted installation environments, where some dependencies are not available, or if you need only the core library functionality, you can install Datumaro without extra plugins.
The CLI variant (
datumaro[default]
) requires Git to be installed and available to work with Datumaro projects and dataset versioning features. You can find installation instructions for your platform here.In some cases, installing just the core library may be not enough, because there can be limited options of installing graphical libraries in the system (various Docker environments, servers etc). You can select between using
opencv-python
andopencv-python-headless
by setting theDATUMARO_HEADLESS
environment variable to0
or1
before installing the package. It requires installation from sources (using--no-binary
):DATUMARO_HEADLESS=1 pip install datumaro --no-binary=datumaro
This option can’t be covered by extras due to Python packaging system limitations.
-
When installing directly from the repository, you can change the installation branch with
...@<branch_name>
. Also use--force-reinstall
parameter in this case. It can be useful for testing of unreleased versions from GitHub pull requests.
2 - How to use Datumaro
As a standalone tool or a Python module:
datum --help
python -m datumaro --help
python datumaro/ --help
python datum.py --help
As a Python library:
import datumaro as dm
...
dataset = dm.Dataset.import_from(path, format)
...
Glossary
-
Basic concepts:
- Dataset - A collection of dataset items, which consist of media and associated annotations.
- Dataset item - A basic single element of the dataset. Also known as “sample”, “entry”. In different datasets it can be an image, a video frame, a whole video, a 3d point cloud etc. Typically, has corresponding annotations.
- (Datumaro) Project - A combination of multiple datasets, plugins, models and metadata.
-
Project versioning concepts:
- Data source - A link to a dataset or a copy of a dataset inside a project. Basically, a URL + dataset format name.
- Project revision - A commit or a reference from Git (branch, tag,
HEAD~3 etc.). A revision is referenced by data hash. The
HEAD
revision is the currently selected revision of the project. - Revision tree - A project build tree and plugins at a specified revision.
- Working tree - The revision tree in the working directory of a project.
- data source revision - a state of a data source at a specific stage. A revision is referenced by the data hash.
- Object - The data of a revision tree or a data source revision. An object is referenced by the data hash.
-
-
Dataset revpath - A path to a dataset in a special format. They are supposed to specify paths to files, directories or data source revisions in a uniform way in the CLI.
-
dataset path - a path to a dataset in the following format:
<dataset path>:<format>
format
is optional. If not specified, will try to detect automatically
-
revision path - a path to a data source revision in a project. The syntax is:
<project path>@<revision>:<target name>
, any part can be omitted.- Default project is the current project (
-p
/--project
CLI arg.) Local revpaths imply that the current project is used and this part should be omitted. - Default revision is the working tree of the project
- Default build target is
project
If a path refers to
project
(i.e. target name is not set, or this target is exactly specified), the target dataset is the result of joining all the project data sources. Otherwise, if the path refers to a data source revision, the corresponding stage from the revision build tree will be used. - Default project is the current project (
-
-
-
Dataset building concepts:
- Stage - A revision of a dataset - the original dataset or its modification after transformation, filtration or something else. A build tree node. A stage is referred by a name.
- Build tree - A directed graph (tree) with root nodes at data sources
and a single top node called
project
, which represents a joined dataset. Each data source has a startingroot
node, which corresponds to the original dataset. The internal graph nodes are stages. - Build target - A data source or a stage name. Data source names correspond to the last stages of data sources.
- Pipeline - A subgraph of a stage, which includes all the ancestors.
-
Other:
- Transform - A transformation operation over dataset elements. Examples
are image renaming, image flipping, image and subset renaming,
label remapping etc. Corresponds to the
transform
command.
- Transform - A transformation operation over dataset elements. Examples
are image renaming, image flipping, image and subset renaming,
label remapping etc. Corresponds to the
Command-line workflow
In Datumaro, most command-line commands operate on projects, but there are also few commands operating on datasets directly. There are 2 basic ways to use Datumaro from the command-line:
-
Use the
convert
,diff
,merge
commands directly on existing datasets -
Create a Datumaro project and operate on it:
- Create an empty project with
create
- Import existing datasets with
import
- Modify the project with
transform
andfilter
- Create new revisions of the project with
commit
, navigate over them usingcheckout
, compare withdiff
, compute statistics withstats
- Export the resulting dataset with
export
- Check project config with
project info
- Create an empty project with
Basically, a project is a combination of datasets, models and environment.
A project can contain an arbitrary number of datasets (data sources).
A project acts as a manager for them and allows to manipulate them
separately or as a whole, in which case it combines dataset items
from all the sources into one composite dataset. You can manage separate
datasets in a project by commands in the datum source
command line context.
Note that modifying operations (transform
, filter
, patch
)
are applied in-place to the datasets by default.
If you want to interact with models, you need to add them to the project
first using the model add
command.
A typical way to obtain Datumaro projects is to export tasks in CVAT UI.
Project data model
Datumaro tries to combine a “Git for datasets” and a build system like
make or CMake for datasets in a single solution. Currently, Project
represents a Version Control System for datasets, which is based on Git and DVC
projects. Each project Revision
describes a build tree of a dataset
with all the related metadata. A build tree consists of a number of data
sources and transformation stages. Each data source has its own set of build
steps (stages). Datumaro supposes copying of datasets and working in-place by
default. Modifying operations are recorded in the project, so any of the
dataset revisions can be reproduced when needed. Multiple dataset versions can
be stored in different branches with the common data shared.
Let’s consider an example of a build tree: There are 2 data sources in the example project. The resulting dataset is obtained by simple merging (joining) the results of the input datasets. “Source 1” and “Source 2” are the names of data sources in the project. Each source has several stages with their own names. The first stage (called “root”) represents the original contents of a data source - the data at the user-provided URL. The following stages represent operations, which needs to be done with the data source to prepare the resulting dataset.
Roughly, such build tree can be created by the following commands (arguments are omitted for simplicity):
datum create
# describe the first source
datum import <...> -n source1
datum filter <...> source1
datum transform <...> source1
datum transform <...> source1
# describe the second source
datum import <...> -n source2
datum model add <...>
datum transform <...> source2
datum transform <...> source2
Now, the resulting dataset can be built with:
datum export <...>
Project layout
project/
├── .dvc/
├── .dvcignore
├── .git/
├── .gitignore
├── .datumaro/
│ ├── cache/ # object cache
│ │ └── <2 leading symbols of obj hash>/
│ │ └── <remaining symbols of obj hash>/
│ │ └── <object data>
│ │
│ ├── models/ # project-specific models
│ │
│ ├── plugins/ # project-specific plugins
│ │ ├── plugin1/ # composite plugin, a directory
│ │ | ├── __init__.py
│ │ | └── file2.py
│ │ ├── plugin2.py # simple plugin, a file
│ │ └── ...
│ │
│ ├── tmp/ # temp files
│ └── tree/ # working tree metadata
│ ├── config.yml
│ └── sources/
│ ├── <source name 1>.dvc
│ ├── <source name 2>.dvc
│ └── ...
│
├── <source name 1>/ # working directory for the source 1
│ └── <source data>
└── <source name 2>/ # working directory for the source 2
└── <source data>
Datasets and Data Sources
A project can contain an arbitrary number of Data Sources. Each Data Source
describes a dataset in a specific format. A project acts as a manager for
the data sources and allows to manipulate them separately or as a whole, in
which case it combines dataset items from all the sources into one composite
dataset. You can manage separate sources in a project by commands in
the datum source
command
line context.
Datasets come in a wide variety of formats. Each dataset format defines its own data structure and rules on how to interpret the data. For example, the following data structure is used in COCO format:
/dataset/
- /images/<id>.jpg
- /annotations/
Datumaro supports complete datasets, having both image data and annotations, or incomplete ones, having annotations only. Incomplete datasets can be used to prepare images and annotations independently of each other, or to analyze or modify just the lightweight annotations without the need to download the whole dataset.
Check supported formats for more info about format specifications, supported import and export options and other details. The list of formats can be extended by custom plugins, check extending tips for information on this topic.
Use cases
Let’s consider few examples describing what Datumaro does for you behind the scene.
The first example explains how working trees, working directories and the cache interact. Suppose, there is a dataset which we want to modify and export in some other format. To do it with Datumaro, we need to create a project and register the dataset as a data source:
datum create
datum import <...> -n source1
The dataset will be copied to the working directory inside the project. It will be added to the project working tree.
After the dataset is added, we want to transform it and filter out some irrelevant samples, so we run the following commands:
datum transform <...> source1
datum filter <...> source1
The commands modify the data source inside the working directory, inplace. The operations done are recorded in the working tree.
Now, we want to make a new version of the dataset and make a snapshot in the
project cache. So we commit
the working tree:
datum commit <...>
At this time, the data source is copied into the project cache and a new
project revision is created. The dataset operation history is saved, so
the dataset can be reproduced even if it is removed from the cache and the
working directory. Note, however, that the original dataset hash was not
computed, so Datumaro won’t be able to compare dataset hash on re-downloading.
If it is desired, consider making a commit
with an unmodified data source.
After this, we do some other modifications to the dataset and make a new
commit. Note that the dataset is not cached, until a commit
is done.
When the dataset is ready and all the required operations are done, we
can export
it to the required format. We can export the resulting dataset,
or any previous stage.
datum export <...> source1
datum export <...> source1.stage3
Let’s extend the example. Imagine we have a project with 2 data sources. Roughly, it corresponds to the following set of commands:
datum create
datum import <...> -n source1
datum import <...> -n source2
datum transform <...> source1 # used 3 times
datum transform <...> source2 # used 5 times
Then, for some reasons, the project cache was cleaned from source1
revisions.
We also don’t have anything in the project working directories - suppose,
the user removed them to save disk space.
Let’s see what happens, if we call the diff
command with 2 different
revisions now.
Datumaro needs to reproduce 2 dataset revisions requested so that they could be read and compared. Let’s see how the first dataset is reproduced step-by-step:
source1.stage2
will be looked for in the project cache. It won’t be found, since the cache was cleaned.- Then, Datumaro will look for previous source revisions in the cache and won’t find any.
- The project can be marked read-only, if we are not working with the
“current” project (which is specified by the
-p/--project
command parameter). In the example, the command isdatum diff rev1:... rev2:...
, which means there is a project in the current directory, so the project we are working with is not read-only. If a command target was specified asdatum diff <project>@<rev>:<source>
, the project would be loaded as read-only. If a project is read-only, we can’t do anything more to reproduce the dataset and can only exit with an error (3a). The reason for such behavior is that the dataset downloading can be quite expensive (in terms of time, disk space etc.). It is supposed, that such side-effects should be controlled manually. - If the project is not read-only (3b), Datumaro will try to download the original dataset and reproduce the resulting dataset. The data hash will be computed and hashes will be compared (if the data source had hash computed on addition). On success, the data will be put into the cache.
- The downloaded dataset will be read and the remaining operations from the source history will be re-applied.
- The resulting dataset might be cached in some cases.
- The resulting dataset is returned.
The source2
will be looked for the same way. In our case, it will be found
in the cache and returned. Once both datasets are restored and read, they
are compared.
Consider other situation. Let’s try to export
the source1
. Suppose
we have a clear project cache and the source1
has a copy in the working
directory.
Again, Datumaro needs to reproduce a dataset revision (stage) requested.
- It looks for the dataset in the working directory and finds some data. If there is no source working directory, Datumaro will try to reproduce the source using the approach described above (1b).
- The data hash is computed and compared with the one saved in the history. If the hashes match, the dataset is read and returned (4). Note: we can’t use the cached hash stored in the working tree info - it can be outdated, so we need to compute it again.
- Otherwise, Datumaro tries to detect the stage by the data hash. If the current stage is not cached, the tree is the working tree and the working directory is not empty, the working copy is hashed and matched against the source stage list. If there is a matching stage, it will be read and the missing stages will be added. The result might be cached in some cases. If there is no matching stage in the source history, the situation can be contradictory. Currently, an error is raised (3b).
- The resulting dataset is returned.
After the requested dataset is obtained, it is exported in the requested format.
To sum up, Datumaro tries to restore a dataset from the project cache or reproduce it from sources. It can be done as long as the source operations are recorded and any step data is available. Note that cache objects share common files, so if there are only annotation differences between datasets, or data sources contain the same images, there will only be a single copy of the related media files. This helps to keep storage use reasonable and avoid unnecessary data copies.
Examples
Example: create a project, add dataset, modify, restore an old version
datum create
datum import <path/to/dataset> -f coco -n source1
datum commit -m "Added a dataset"
datum transform -t shapes_to_boxes
datum filter -e '/item/annotation[label="cat" or label="dog"]' -m i+a
datum commit -m "Transformed"
datum checkout HEAD~1 -- source1 # restore a previous revision
datum status # prints "modified source1"
datum checkout source1 # restore the last revision
datum export -f voc -- --save-images
3 - Supported Formats
List of supported formats:
- ADE20k (v2017) (import-only)
- ADE20k (v2020) (import-only)
- Align CelebA (
classification
,landmarks
) (import-only) - BraTS (
segmentation
) (import-only) - BraTS Numpy (
detection
,segmentation
) (import-only) - CamVid (
segmentation
) - CelebA (
classification
,detection
,landmarks
) (import-only) - CIFAR-10/100 (
classification
(python version)) - Cityscapes (
segmentation
) - Common Semantic Segmentation (
segmentation
) - Common Super Resolution
- CVAT (
for images
,for video
(import-only)) - ICDAR13/15 (
word_recognition
,text_localization
,text_segmentation
) - ImageNet (
classification
,detection
)- Dataset example
- Dataset example (txt for classification)
- Detection format is the same as in PASCAL VOC
- Format documentation
- KITTI (
segmentation
,detection
) - KITTI 3D (
raw
/tracklets
/velodyne points
) - Kinetics 400/600/700
- LabelMe (
labels
,boxes
,masks
) - LFW (
classification
,person re-identification
,landmarks
) - Mapillary Vistas (import-only)
- Market-1501 (
person re-identification
) - MARS (import-only)
- MNIST (
classification
) - MNIST in CSV (
classification
) - MOT sequences
- MOTS (png)
- MPII Human Pose Dataset (
detection
,pose estimation
) (import-only) - MPII Human Pose Dataset (JSON) (
detection
,pose estimation
) (import-only) - MS COCO (
image_info
,instances
,person_keypoints
,captions
,labels
,panoptic
,stuff
)- Format specification
- Dataset example
labels
are our extension - likeinstances
with onlycategory_id
- Format documentation
- NYU Depth Dataset V2 (
depth estimation
) (import-only) - Open Images (
classification
,detection
,segmentation
) - PASCAL VOC (
classification
,detection
,segmentation
(class, instances),action_classification
,person_layout
) - Supervisely (
pointcloud
) - SYNTHIA (
segmentation
) (import-only) - TF Detection API (
bboxes
,masks
)- Format specifications: bboxes, masks
- Dataset example
- VGGFace2 (
landmarks
,bboxes
) - VoTT CSV (
detection
) (import-only) - VoTT JSON (
detection
) (import-only) - WIDER Face (
bboxes
) - YOLO (
bboxes
)
Supported annotation types
- Labels
- Bounding boxes
- Polygons
- Polylines
- (Segmentation) Masks
- (Key-)Points
- Captions
- 3D cuboids
- Super Resolution Annotation
- Depth Annotation
Datumaro does not separate datasets by tasks like classification, detection etc. Instead, datasets can have any annotations. When a dataset is exported in a specific format, only relevant annotations are exported.
Dataset meta info file
It is possible to use classes that are not original to the format.
To do this, use dataset_meta.json
.
{
"label_map": {"0": "background", "1": "car", "2": "person"},
"segmentation_colors": [[0, 0, 0], [255, 0, 0], [0, 0, 255]],
"background_label": "0"
}
label_map
is a dictionary where the class ID is the key and the class name is the value.segmentation_colors
is a list of channel-wise values for each class. This is only necessary for the segmentation task.background_label
is a background label ID in the dataset.
4 - Media formats
Datumaro supports the following media types:
- 2D RGB(A) images
- KITTI Point Clouds
To create an unlabelled dataset from an arbitrary directory with images use
image_dir
and image_zip
formats:
datum create -o <project/dir>
datum import -p <project/dir> -f image_dir <directory/path/>
or, if you work with Datumaro API:
-
for using with a project:
from datumaro.project import Project project = Project.init() project.import_source('source1', format='image_dir', url='directory/path/') dataset = project.working_tree.make_dataset()
-
for using as a dataset:
from datumaro import Dataset dataset = Dataset.import_from('directory/path/', 'image_dir')
This will search for images in the directory recursively and add
them as dataset entries with names like <subdir1>/<subsubdir1>/<image_name1>
.
The list of formats matches the list of supported image formats in OpenCV:
.jpg, .jpeg, .jpe, .jp2, .png, .bmp, .dib, .tif, .tiff, .tga, .webp, .pfm,
.sr, .ras, .exr, .hdr, .pic, .pbm, .pgm, .ppm, .pxm, .pnm
Once there is a Dataset
instance, its items can be split into subsets,
renamed, filtered, joined with annotations, exported in various formats etc.
To import frames from a video, you can split the video into frames with
the split_video
command
and then use the image_dir
format described above. In more complex cases,
consider using FFmpeg and other tools for
video processing.
Alternatively, you can use the video_frames
format directly:
Note, however, that it can produce different results if the system environment changes. If you want to obtain reproducible results, consider splitting the video into frames by any method.
datum create -o <project/dir>
datum import -p <project/dir> -f video_frames <video/path.avi>
from datumaro import Dataset
dataset = Dataset.import_from('video.mp4', 'video_frames')
Datumaro supports the following video formats:
.3gp, .3g2, .asf, .wmv, .avi, .divx, .evo, .f4v, .flv, .mkv, .mk3d,
.mp4, .mpg, .mpeg, .m2p, .ps, .ts, .m2ts, .mxf, .ogg, .ogv, .ogx,
.mov, .qt, .rmvb, .vob, .webm
5 - Command reference
%%{init { 'theme':'neutral' }}%%
flowchart LR
d(("#0009; datum #0009;")):::mainclass
m(model):::nofillclass
p(project):::nofillclass
s(source):::nofillclass
d===m
m===m_add[add]:::hideclass
m===m_info[info]:::hideclass
m===m_remove[remove]:::hideclass
m===m_run[run]:::hideclass
d===p
p===p_info[info]:::hideclass
p===p_migrate[migrate]:::hideclass
d===s
s===s_add[add]:::hideclass
s===s_info[info]:::hideclass
s===s_remove[remove]:::hideclass
d====_add[add]:::filloneclass
d====_create[create]:::filloneclass
d====_describe_downloads[describe-downloads]:::filloneclass
d====_detect_format[detect-format]:::filloneclass
d====_download[download]:::filloneclass
d====_export[export]:::filloneclass
d====_import[import]:::filloneclass
d====_info[info]:::filloneclass
d====_remove[remove]:::filloneclass
d====_generate[generate]:::filloneclass
d====_filter[filter]:::filltwoclass
d====_transform[transform]:::filltwoclass
d====_diff[diff]:::fillthreeclass
d====_explain[explain]:::fillthreeclass
d====_merge[merge]:::fillthreeclass
d====_patch[patch]:::fillthreeclass
d====_stats[stats]:::fillthreeclass
d====_validate[validate]:::fillthreeclass
d====_checkout[checkout]:::fillfourclass
d====_commit[commit]:::fillfourclass
d====_log[log]:::fillfourclass
d====_status[status]:::fillfourclass
classDef nofillclass fill-opacity:0;
classDef hideclass fill-opacity:0,stroke-opacity:0;
classDef filloneclass fill:#CCCCFF,stroke-opacity:0;
classDef filltwoclass fill:#FFFF99,stroke-opacity:0;
classDef fillthreeclass fill:#CCFFFF,stroke-opacity:0;
classDef fillfourclass fill:#CCFFCC,stroke-opacity:0;
The command line is split into the separate commands and command contexts.
Contexts group multiple commands related to a specific topic, e.g.
project operations, data source operations etc. Almost all the commands
operate on projects, so the project
context and commands without a context
are mostly the same. By default, commands look for a project in the current
directory. If the project you’re working on is located somewhere else, you
can pass the -p/--project <path>
argument to the command.
Note: command behavior is subject to change, so this text might be outdated, always check the
--help
output of the specific command
Note: command parameters must be passed prior to the positional arguments.
Datumaro functionality is available with the datum
command.
Usage:
datum [-h] [--version] [--loglevel LOGLEVEL] [command] [command args]
Parameters:
--loglevel
(string) - Logging level, one ofdebug
,info
,warning
,error
,critical
(default:info
)--version
- Print the version number and exit.-h, --help
- Print the help message and exit.
5.1 - Checkout
This command allows to restore a specific project revision in the project tree or to restore separate revisions of sources. A revision can be a commit hash, branch, tag, or any relative reference in the Git format.
This command has multiple forms:
1) datum checkout <revision>
2) datum checkout [--] <source1> ...
3) datum checkout <revision> [--] <source1> <source2> ...
1 - Restores a revision and all the corresponding sources in the
working directory. If there are conflicts between modified files in the
working directory and the target revision, an error is raised, unless
--force
is used.
2, 3 - Restores only selected sources from the specified revision. The current revision is used, when not set.
“–” can be used to separate source names and revisions:
datum checkout name
- will look for revision “name”datum checkout -- name
- will look for source “name” in the current revision
Usage:
datum checkout [-h] [-f] [-p PROJECT_DIR] [rev] [--] [sources [sources ...]]
Parameters:
--force
- Allows to overwrite unsaved changes in case of conflicts-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Examples:
-
Restore the previous revision:
datum checkout HEAD~1
-
Restore the saved version of a source in the working tree
datum checkout -- source-1
-
Restore a previous version of a source
datum checkout 33fbfbe my-source
5.2 - Commit
This command allows to fix the current state of a project and create a new revision from the working tree.
By default, this command checks sources in the working tree for
changes. If there are unknown changes found, an error will be raised,
unless --allow-foreign
is used. If such changes are committed,
the source will only be available for reproduction from the project
cache, because Datumaro will not know how to repeat them.
The command will add the sources into the project cache. If you only
need to record revision metadata, you can use the --no-cache
parameter.
This can be useful if you want to save disk space and/or have a backup copy
of datasets used in the project.
If there are no changes found, the command will stop. To allow empty
commits, use --allow-empty
.
Usage:
datum commit [-h] -m MESSAGE [--allow-empty] [--allow-foreign]
[--no-cache] [-p PROJECT_DIR]
Parameters:
--allow-empty
- Allow commits with no changes--allow-foreign
- Allow commits with changes made not by Datumaro--no-cache
- Don’t put committed datasets into cache, save only metadata-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Example:
datum create
datum import -f coco <path/to/coco/>
datum commit -m "Added COCO"
5.3 - Convert datasets
This command allows to convert a dataset from one format to another.
The command is a usability alias for create
,
add
and export
and just provides
a simpler way to obtain the same results in simple cases. A list of supported
formats can be found in the --help
output of this command.
Usage:
datum convert [-h] [-i SOURCE] [-if INPUT_FORMAT] -f OUTPUT_FORMAT
[-o DST_DIR] [--overwrite] [-e FILTER] [--filter-mode FILTER_MODE]
[-- EXTRA_EXPORT_ARGS]
Parameters:
-i, --input-path
(string) - Input dataset path. The current directory is used by default.-if, --input-format
(string) - Input dataset format. Will try to detect, if not specified.-f, --output-format
(string) - Output format-o, --output-dir
(string) - Output directory. By default, a subdirectory in the current directory is used.--overwrite
- Allows overwriting existing files in the output directory, when it is not empty.-e, --filter
(string) - XML XPath filter expression for dataset items--filter-mode
(string) - The filtering mode. Default is thei
mode.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.-- <extra export args>
- Additional arguments for the format writer (use-- -h
for help). Must be specified after the main command arguments.
Example: convert a VOC-like dataset to a COCO-like one:
datum convert --input-format voc --input-path <path/to/voc/> \
--output-format coco \
-- --save-images
5.4 - Create project
The command creates an empty project. A project is required for the most of Datumaro functionality.
By default, the project is created in the current directory. To specify
another output directory, pass the -o/--output-dir
parameter. If output
already directory contains a Datumaro project, an error is raised, unless
--overwrite
is used.
Usage:
datum create [-h] [-o DST_DIR] [--overwrite]
Parameters:
-o, --output-dir
(string) - Allows to specify an output directory. The current directory is used by default.--overwrite
- Allows to overwrite existing project files in the output directory. Any other files are not touched.-h, --help
- Print the help message and exit.
Examples:
Example: create an empty project in the my_dataset
directory
datum create -o my_dataset/
Example: create a new empty project in the current directory, remove the existing one
datum create
...
datum create --overwrite
5.5 - Describe downloadable datasets
This command reports reports various information about datasets that can be
downloaded with the download
command. The information is reported either as
human-readable text (the default) or as a JSON object. The format can be selected
with the --report-format
option.
When the JSON output format is selected, the output document has the following schema:
{
"<dataset name>": {
"default_output_format": "<Datumaro format name>",
"description": "<human-readable description>",
"download_size": <total size of the downloaded files in bytes>,
"home_url": "<URL of a web page describing the dataset>",
"human_name": "<human-readable dataset name>",
"num_classes": <number of classes in the dataset>,
"subsets": {
"<subset name>": {
"num_items": <number of items in the subset>
},
...
},
"version": "<version number>"
},
...
}
home_url
may be null
if there is no suitable web page for the dataset.
num_classes
may be null
if the dataset does not involve classification.
version
currently contains the version number supplied by TFDS.
In future versions of Datumaro, datasets might come from other sources;
the way version numbers will be set for those is to be determined.
New object members may be added in future versions of Datumaro.
Usage:
datum describe-downloads [-h] [--report-format {text,json}]
[--report-file REPORT_FILE]
Parameters:
-h
,--help
- Print the help message and exit.--report-format
(text
orjson
) - Format in which to report the information. By default,text
is used.--report-file
(string) - File to which to write the report. By default, the report is written to the standard output stream.
5.6 - Detect dataset format
This command attempts to detect the format of a dataset in a directory. Currently, only local directories are supported.
The detection result may be one of:
- a single format being detected;
- no formats being detected (if the dataset doesn’t match any known format);
- multiple formats being detected (if the dataset is ambiguous).
The command outputs this result in a human-readable form and
optionally as a machine-readable JSON report (see --json-report
).
The format of the machine-readable report is as follows:
{
"detected_formats": [
"detected-format-name-1", "detected-format-name-2", ...
],
"rejected_formats": {
"rejected-format-name-1": {
"reason": <reason-code>,
"message": "line 1\nline 2\n...\nline N"
},
"rejected-format-name-2": ...,
...
}
}
The <reason-code>
can be one of:
-
"detection_unsupported"
: the corresponding format does not support detection. -
"insufficient_confidence"
: the dataset matched the corresponding format, but it matched at least one other format better. -
"unmet_requirements"
: the dataset didn’t meet at least one requirement of the corresponding format.
Other reason codes may be defined in the future.
Usage:
datum detect-format [-h] [-p PROJECT_DIR] [--show-rejections]
[--json-report JSON_REPORT]
url
Parameters:
<url>
- Path to the dataset to analyse.-h
,--help
- Print the help message and exit.-p, --project
(string) - Directory of the project to use as the context (default: current directory). The project might contain local plugins with custom formats, which will be used for detection.--show-rejections
- Describe why each supported format that wasn’t detected was rejected. This only affects the human-readable output; the machine-readable report always includes rejection information.--json-report
(string) - Path to which to save a JSON report describing detected and rejected formats. By default, no report is saved.
Example: detect the format of a dataset in a given directory, showing rejection information:
datum detect-format --show-rejections path/to/dataset
5.7 - Compare datasets
The command compares two datasets and saves the results in the specified directory. The current project is considered to be “ground truth”.
Datasets can be compared using different methods:
equality
- Annotations are compared to be equaldistance
- A distance metric is used
This command has multiple forms:
1) datum diff <revpath>
2) datum diff <revpath> <revpath>
1 - Compares the current project’s main target (project
)
in the working tree with the specified dataset.
2 - Compares two specified datasets.
<revpath> - a dataset path or a revision path.
Usage:
datum diff [-h] [-o DST_DIR] [-m METHOD] [--overwrite] [-p PROJECT_DIR]
[--iou-thresh IOU_THRESH] [-f FORMAT]
[-iia IGNORE_ITEM_ATTR] [-ia IGNORE_ATTR] [-if IGNORE_FIELD]
[--match-images] [--all]
first_target [second_target]
Parameters:
-
<target>
(string) - Target dataset revpaths -
-m, --method
(string) - Comparison method. -
-o, --output-dir
(string) - Output directory. By default, a new directory is created in the current directory. -
--overwrite
- Allows to overwrite existing files in the output directory, when it is specified and is not empty. -
-p, --project
(string) - Directory of the project to operate on (default: current directory). -
-h, --help
- Print the help message and exit. -
Distance comparison options:
--iou-thresh
(number) - The IoU threshold for spatial annotations (default is 0.5).-f, --format
(string) - Output format, one ofsimple
(text files and images) andtensorboard
(a TB log directory)
-
Equality comparison options:
-iia, --ignore-item-attr
(string) - Ignore an item attribute (repeatable)-ia, --ignore-attr
(string) - Ignore an annotation attribute (repeatable)-if, --ignore-field
(string) - Ignore an annotation field (repeatable) Default isid
andgroup
--match-images
- Match dataset items by image pixels instead of ids--all
- Include matches in the output. By default, only differences are printed.
Examples:
-
Compare two projects by distance, match boxes if IoU > 0.7, save results to TensorBoard:
datum diff other/project -o diff/ -f tensorboard --iou-thresh 0.7
-
Compare two projects for equality, exclude annotation groups and the
is_crowd
attribute from comparison:datum diff other/project/ -if group -ia is_crowd
-
Compare two datasets, specify formats:
datum diff path/to/dataset1:voc path/to/dataset2:coco
-
Compare the current working tree and a dataset:
datum diff path/to/dataset2:coco
-
Compare a source from a previous revision and a dataset:
datum diff HEAD~2:source-2 path/to/dataset2:yolo
-
Compare a dataset with model inference
datum create
datum import <...>
datum model add mymodel <...>
datum transform <...> -o inference
datum diff inference -o diff
5.8 - Download datasets
This command downloads a publicly available dataset and saves it to a local
directory.
In terms of syntax, this command is similar to convert
,
but instead of taking a local directory as the source, it takes a dataset ID.
A list of supported datasets and output formats can be found in the --help
output of this command.
Currently, the only source of datasets is the TensorFlow Datasets library. Therefore, to use this command you must install TensorFlow & TFDS, which you can do as follows:
pip install datumaro[tf,tfds]
To use a proxy for downloading, configure it with the conventional curl environment variables.
Usage:
datum download [-h] -i DATASET_ID [-f OUTPUT_FORMAT] [-o DST_DIR]
[--overwrite] [-s SUBSET] [-- EXTRA_EXPORT_ARGS]
Parameters:
-h
,--help
- Print the help message and exit.-i
,--dataset-id
(string) - ID of the dataset to download.-f
,--output-format
(string) - Output format. By default, the format of the original dataset is used.-o, --output-dir
(string) - Output directory. By default, a subdirectory in the current directory is used.--overwrite
- Allows overwriting existing files in the output directory, when it is not empty.--subset
(string) - Which subset of the dataset to save. By default, all subsets are saved. Note that due to limitations of TFDS, all subsets are downloaded even if this option is specified.-- <extra export args>
- Additional arguments for the format writer (use-- -h
for help). Must be specified after the main command arguments.
Example: download the MNIST dataset, saving it in the ImageNet text format:
datum download -i tfds:mnist -f imagenet_txt -- --save-images
5.9 - Run model inference explanation (explain)
Runs an explainable AI algorithm for a model.
This tool is supposed to help an AI developer to debug a model and a dataset. Basically, it executes model inference and tries to find relation between inputs and outputs of the trained model, i.e. determine decision boundaries and belief intervals for the classifier.
Currently, the only available algorithm is RISE (article), which runs model a single time and then re-runs a model multiple times on each image to produce a heatmap of activations for each output of the first inference. Each time a part of the input image is masked. As a result, we obtain a number heatmaps, which show, how specific image pixels affected the inference result. This algorithm doesn’t require any special information about the model, but it requires the model to return all the outputs and confidences. The original algorithm supports only classification scenario, but Datumaro extends it for detection models.
The following use cases available:
- RISE for classification
- RISE for object detection
Usage:
datum explain [-h] -m MODEL [-o SAVE_DIR] [-p PROJECT_DIR]
[target] {rise} [RISE_ARGS]
Parameters:
-
<target>
(string) - Target dataset revpath.By default, uses the whole current project. An image path can be specified instead. <image path> - a path to the file. <revpath> - a dataset path or a revision path. -
<method>
(string) - The algorithm to use. Currently, onlyrise
is supported. -
-m, --model
(string) - The model to use for inference -
-o, --output-dir
(string) - Directory to save results to (default: display only) -
-p, --project
(string) - Directory of the project to operate on (default: current directory). -
-h, --help
- Print the help message and exit. -
RISE options:
-s, --max-samples
(number) - Number of algorithm model runs per image (default: mask size ^ 2).--mw, --mask-width
(number) - Mask width in pixels (default: 7)--mh, --mask-height
(number) - Mask height in pixels (default: 7)--prob
(number) - Mask pixel inclusion probability, controls mask density (default: 0.5)--iou, --iou-thresh
(number) - IoU match threshold for detections (default: 0.9)--nms, --nms-iou-thresh
(number) - IoU match threshold for detections for non-maxima suppression (default: no NMS)--conf, --det-conf-thresh
(number) - Confidence threshold for detections (default: include all)-b, --batch-size
(number) - Batch size for inference (default: 1)--display
- Visualize results during computations
Examples:
-
Run RISE on an image, display results:
datum explain path/to/image.jpg -m mymodel rise --max-samples 50
-
Run RISE on a source revision:
datum explain HEAD~1:source-1 -m model rise
-
Run inference explanation on a single image with online visualization
datum create <...>
datum model add mymodel <...>
datum explain -t image.png -m mymodel \
rise --max-samples 1000 --display
Note: this algorithm requires the model to return all (or a reasonable amount) the outputs and confidences unfiltered, i.e. all the
Label
annotations for classification models and all theBbox
es for detection models. You can find examples of the expected model outputs intests/test_RISE.py
For OpenVINO models the output processing script would look like this:
Classification scenario:
import datumaro as dm
from datumaro.util.annotation_util import softmax
def process_outputs(inputs, outputs):
# inputs = model input, array or images, shape = (N, C, H, W)
# outputs = model output, logits, shape = (N, n_classes)
# results = conversion result, [ [ Annotation, ... ], ... ]
results = []
for output in outputs:
confs = softmax(output[0])
for label, conf in enumerate(confs):
results.append(dm.Label(int(label)), attributes={'score': float(conf)})
return results
Object Detection scenario:
import datumaro as dm
# return a significant number of output boxes to make multiple runs
# statistically correct and meaningful
max_det = 1000
def process_outputs(inputs, outputs):
# inputs = model input, array or images, shape = (N, C, H, W)
# outputs = model output, shape = (N, 1, K, 7)
# results = conversion result, [ [ Annotation, ... ], ... ]
results = []
for input, output in zip(inputs, outputs):
input_height, input_width = input.shape[:2]
detections = output[0]
image_results = []
for det in detections:
label = int(det[1])
conf = float(det[2])
x = max(int(det[3] * input_width), 0)
y = max(int(det[4] * input_height), 0)
w = min(int(det[5] * input_width - x), input_width)
h = min(int(det[6] * input_height - y), input_height)
image_results.append(dm.Bbox(x, y, w, h,
label=label, attributes={'score': conf} ))
results.append(image_results[:max_det])
return results
5.10 - Export Datasets
This command exports a project or a source as a dataset in some format.
Check supported formats for more info about format specifications, supported options and other details. The list of formats can be extended by custom plugins, check extending tips for information on this topic.
Available formats are listed in the command help output.
Dataset format writers support additional export options. To pass
such options, use the --
separator after the main command arguments.
The usage information can be printed with datum import -f <format> -- --help
.
Common export options:
- Most formats (where applicable) support the
--save-images
option, which allows to export dataset images along with annotations. The option is disabled be default. - If
--save-images
is used, theimage-ext
option can be passed to specify the output image file extension (.jpg
,.png
etc.). By default, tries to Datumaro keep the original image extension. This option allows to convert all the images from one format into another.
This command allows to use the -f/--filter
parameter to select dataset
elements needed for exporting. Read the filter
command description for more info about this functionality.
The command can only be applied to a project build target, a stage
or the combined project
target, in which case all the targets will
be affected.
Usage:
datum export [-h] [-e FILTER] [--filter-mode FILTER_MODE] [-o DST_DIR]
[--overwrite] [-p PROJECT_DIR] -f FORMAT [target] [-- EXTRA_FORMAT_ARGS]
Parameters:
<target>
(string) - A project build target to be exported. By default, all project targets are affected.-f, --format
(string) - Output format.-e, --filter
(string) - XML XPath filter expression for dataset items--filter-mode
(string) - The filtering mode. Default is thei
mode.-o, --output-dir
(string) - Output directory. By default, a subdirectory in the current directory is used.--overwrite
- Allows overwriting existing files in the output directory, when it is not empty.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.-- <extra format args>
- Additional arguments for the format writer (use-- -h
for help). Must be specified after the main command arguments.
Example: save a project as a VOC-like dataset, include images, convert
images to PNG
from other formats.
datum export \
-p test_project \
-o test_project-export \
-f voc \
-- --save-images --image-ext='.png'
5.11 - Filter datasets
This command allows to extract a sub-dataset from a dataset. The new dataset includes only items satisfying some condition. The XML XPath is used as a query format.
The command can be applied to a dataset or a project build target,
a stage or the combined project
target, in which case all the project
targets will be affected. A build tree stage will be recorded
if --stage
is enabled, and the resulting dataset(-s) will be
saved if --apply
is enabled.
By default, datasets are updated in-place. The -o/--output-dir
option can be used to specify another output directory. When
updating in-place, use the --overwrite
parameter (in-place
updates fail by default to prevent data loss), unless a project
target is modified.
The current project (-p/--project
) is also used as a context for
plugins, so it can be useful for dataset paths having custom formats.
When not specified, the current project’s working tree is used.
There are several filtering modes available (the -m/--mode
parameter).
Supported modes:
i
,items
a
,annotations
i+a
,a+i
,items+annotations
,annotations+items
When filtering annotations, use the items+annotations
mode to point that annotation-less dataset items should be
removed, otherwise they will be kept in the resulting dataset.
To select an annotation, write an XPath that returns annotation
elements (see examples).
Item representations can be printed with the --dry-run
parameter:
<item>
<id>290768</id>
<subset>minival2014</subset>
<image>
<width>612</width>
<height>612</height>
<depth>3</depth>
</image>
<annotation>
<id>80154</id>
<type>bbox</type>
<label_id>39</label_id>
<x>264.59</x>
<y>150.25</y>
<w>11.19</w>
<h>42.31</h>
<area>473.87</area>
</annotation>
<annotation>
<id>669839</id>
<type>bbox</type>
<label_id>41</label_id>
<x>163.58</x>
<y>191.75</y>
<w>76.98</w>
<h>73.63</h>
<area>5668.77</area>
</annotation>
...
</item>
The command can only be applied to a project build target, a stage or the
combined project
target, in which case all the targets will be affected.
A build tree stage will be added if --stage
is enabled, and the resulting
dataset(-s) will be saved if --apply
is enabled.
Usage:
datum filter [-h] [-e FILTER] [-m MODE] [--dry-run] [--stage STAGE]
[--apply APPLY] [-o DST_DIR] [--overwrite] [-p PROJECT_DIR] [target]
Parameters:
<target>
(string) - Target dataset revpath. By default, filters all targets of the current project.-e, --filter
(string) - XML XPath filter expression for dataset items-m, --mode
(string) - The filtering mode. Default is thei
mode.--dry-run
- Print XML representations of the filtered dataset and exit.--stage
(bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and theproject
target, but not intermediate stages). Enabled by default.--apply
(bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default.-o, --output-dir
(string) - Output directory. Can be omitted for main project targets (i.e. data sources and theproject
target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace.--overwrite
- Allows to overwrite existing files in the output directory, when it is specified and is not empty.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Example: extract a dataset with images with width
< height
datum filter \
-p test_project \
-e '/item[image/width < image/height]'
Example: extract a dataset with images of the train
subset
datum filter \
-p test_project \
-e '/item[subset="train"]'
Example: extract a dataset with only large annotations of the cat
class and
any non-persons
datum filter \
-p test_project \
--mode annotations \
-e '/item/annotation[(label="cat" and area > 99.5) or label!="person"]'
Example: extract a dataset with non-occluded annotations, remove empty images. Use data only from the “s1” source of the project.
datum create
datum import --format voc -i <path/to/dataset1/> --name s1
datum import --format voc -i <path/to/dataset2/> --name s2
datum filter s1 \
-m i+a -e '/item/annotation[occluded="False"]'
5.12 - Generate Datasets
Creates a synthetic dataset with elements of the specified type and shape, and saves it in the provided directory.
Currently, can only generate fractal images, useful for network compression. To create 3-channel images, you should provide the number of images, height and width. The images are colorized with a model, which will be downloaded automatically. Uses the algorithm from the article: https://arxiv.org/abs/2103.13023
Usage:
datum generate [-h] -o OUTPUT_DIR -k COUNT --shape SHAPE [SHAPE ...]
[-t {image}] [--overwrite] [--model-dir MODEL_PATH]
Parameters:
-o, --output-dir
(string) - Output directory-k, --count
(integer) - Number of images to be generated--shape
(integer, repeatable) - Dimensions of data to be generated (H, W)-t, --type
(one of:image
) - Specify the type of data to generate (default:image
)--model-dir
(path) - Path to load the colorization model from. If no model is found, the model will be downloaded (default: current dir)--overwrite
- Allows overwriting existing files in the output directory, when it is not empty.-h, --help
- Print the help message and exit.
Examples:
Generate 300 3-channel fractal images with H=224, W=256 and store in the images/
dir:
datum generate -o images/ --count 300 --shape 224 256
5.13 - Print dataset info
This command outputs high level dataset information such as sample count, categories and subsets.
Usage:
datum info [-h] [--json] [-p PROJECT_DIR] [revpath]
Parameters:
<target>
(string) - Target dataset revpath. By default, prints info about the joinedproject
dataset.--json
- Print output data in JSON format-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Examples:
-
Print info about a project dataset:
datum info -p test_project/
-
Print info about a COCO-like dataset:
datum info path/to/dataset:coco
Sample output:
format: voc
media type: image
length: 5
categories:
labels: background, aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair (and 12 more)
subsets:
trainval:
length: 5
JSON output format:
{
"format": string,
"media type": string,
"length": integer,
"categories": {
"count": integer,
"labels": [
{
"id": integer,
"name": string,
"parent": string,
"attributes": [ string, ... ]
},
...
]
},
"subsets": [
{
"name": string,
"length": integer
},
...
]
}
5.14 - Log
This command prints the history of the current project revision.
Prints lines in the following format:
<short commit hash> <commit message>
Usage:
datum log [-h] [-n MAX_COUNT] [-p PROJECT_DIR]
Parameters:
-n, --max-count
(number, default: 10) - The maximum number of previous revisions in the output-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Example output:
affbh33 Added COCO dataset
eeffa35 Added VOC dataset
5.15 - Merge Datasets
Consider the following task: there is a set of images (the original dataset) we want to annotate. Suppose we did this manually and/or automated it using models, and now we have few sets of annotations for the same images. We want to merge them and produce a single set of high-precision annotations.
Another use case: there are few datasets with different sets of images and labels, which we need to combine in a single dataset. If the labels were the same, we could just join the datasets. But in this case we need to merge labels and adjust the annotations in the resulting dataset.
In Datumaro, it can be done with the merge
command. This command merges 2
or more datasets and checks annotations for errors.
In simple cases, when dataset images do not intersect and new labels are not added, the recommended way of merging is using the
patch
command. It will offer better performance and provide the same results.
Datasets are merged by items, and item annotations are merged by finding the
unique ones across datasets. Annotations are matched between matching dataset
items by distance. Spatial annotations are compared by the applicable distance
measure (IoU, OKS, PDJ etc.), labels and annotation attributes are selected
by voting. Each set of matching annotations produces a single annotation in
the resulting dataset. The score
(a number in the range [0; 1]) attribute
indicates the agreement between different sources in the produced annotation.
The working time of the function can be estimated as
O( (summary dataset length) * (dataset count) ^ 2 * (item annotations) ^ 2 )
This command also allows to merge datasets with different, or partially overlapping sets of labels (which is impossible by simple joining).
During the process, some merge conflicts can appear. For example,
it can be mismatching dataset images having the same ids, label voting
can be unsuccessful if quorum is not reached (the --quorum
parameter),
bboxes may be too close (the -iou
parameter) etc. Found merge
conflicts, missing items or annotations, and other errors are saved into
an output .json
file.
In Datumaro, annotations can be grouped. It can be useful to represent
different parts of a single object - for example, it can be different parts
of a human body, parts of a vehicle etc. This command allows to check
annotation groups for completeness with the -g/--groups
option. If used,
this parameter must specify a list of labels for annotations that must be
in the same group. It can be particularly useful to check if separate
keypoints are grouped and all the necessary object components in the same
group.
This command has multiple forms:
1) datum merge <revpath>
2) datum merge <revpath> <revpath> ...
<revpath> - either a dataset path or a revision path.
1 - Merges the current project’s main target (“project”) in the working tree with the specified dataset.
2 - Merges the specified datasets. Note that the current project is not included in the list of merged sources automatically.
The command supports passing extra exporting options for the output
dataset. The format can be specified with the -f/--format
option.
Extra options should be passed after the main arguments
and after the --
separator. Particularly, this is useful to include
images in the output dataset with --save-images
.
Usage:
datum merge [-h] [-iou IOU_THRESH] [-oconf OUTPUT_CONF_THRESH]
[--quorum QUORUM] [-g GROUPS] [-o DST_DIR] [--overwrite]
[-p PROJECT_DIR] [-f FORMAT]
target [target ...] [-- EXTRA_FORMAT_ARGS]
Parameters:
<target>
(string) - Target dataset revpaths (repeatable)-iou
,--iou-thresh
(number) - IoU matching threshold for spatial annotations (both maximum inter-cluster and pairwise). Default is 0.25.--quorum
(number) - Minimum count of votes for a label or attribute to be counted. Default is 0.-g, --groups
(string) - A comma-separated list of label names in annotation groups to check. The?
postfix can be added to a label to make it optional in the group (repeatable)-oconf
,--output-conf-thresh
(number) - Confidence threshold for output annotations to be included in the resulting dataset. Default is 0.-o, --output-dir
(string) - Output directory. By default, a new directory is created in the current directory.--overwrite
- Allows to overwrite existing files in the output directory, when it is specified and is not empty.-f, --format
(string) - Output format. The default format isdatumaro
.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.-- <extra format args>
- Additional arguments for the format writer (use-- -h
for help). Must be specified after the main command arguments.
Examples:
Merge 4 (partially-)intersecting projects,
- consider voting successful when there are no less than 3 same votes
- consider shapes intersecting when IoU >= 0.6
- check annotation groups to have
person
,hand
,head
andfoot
(?
is used for optional parts)
datum merge project1/ project2/ project3/ project4/ \
--quorum 3 \
-iou 0.6 \
--groups 'person,hand?,head,foot?'
Merge images and annotations from 2 datasets in COCO format:
datum merge dataset1/:image_dir dataset2/:coco dataset3/:coco
Check groups of the merged dataset for consistency:
look for groups consisting of person
, hand
head
, foot
datum merge project1/ project2/ -g 'person,hand?,head,foot?'
Merge two datasets, specify formats:
datum merge path/to/dataset1:voc path/to/dataset2:coco
Merge the current working tree and a dataset:
datum merge path/to/dataset2:coco
Merge a source from a previous revision and a dataset:
datum merge HEAD~2:source-2 path/to/dataset2:yolo
Merge datasets and save in different format:
datum merge -f voc dataset1/:yolo path2/:coco -- --save-images
5.16 - Models
Register model
Datumaro can execute deep learning models in various frameworks. Check the plugins section for more info.
Supported frameworks:
- OpenVINO
- Custom models via custom
launchers
Models need to be added to the Datumaro project first. It can be done with
the datum model add
command.
Usage:
datum model add [-h] [-n NAME] -l LAUNCHER [--copy] [--no-check]
[-p PROJECT_DIR] [-- EXTRA_ARGS]
Parameters:
-l, --launcher
(string) - Model launcher name--copy
- Copy model data into project. By default, only the link is saved.--no-check
- Don’t check the model can be loaded-n
,--name
(string) - Name of the new model (default: generate automatically)-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.<extra args>
- Additional arguments for the model launcher (use-- -h
for help). Must be specified after the main command arguments.
Example: register an OpenVINO model
A model consists of a graph description and weights. There is also a script used to convert model outputs to internal data structures.
datum create
datum model add \
-n <model_name> -l openvino -- \
-d <path_to_xml> -w <path_to_bin> -i <path_to_interpretation_script>
Interpretation script for an OpenVINO detection model (convert.py
):
You can find OpenVINO model interpreter samples in
datumaro/plugins/openvino/samples
(instruction).
import datumaro as dm
max_det = 10
conf_thresh = 0.1
def process_outputs(inputs, outputs):
# inputs = model input, array or images, shape = (N, C, H, W)
# outputs = model output, shape = (N, 1, K, 7)
# results = conversion result, [ [ Annotation, ... ], ... ]
results = []
for input, output in zip(inputs, outputs):
input_height, input_width = input.shape[:2]
detections = output[0]
image_results = []
for det in detections:
label = int(det[1])
conf = float(det[2])
if conf <= conf_thresh:
continue
x = max(int(det[3] * input_width), 0)
y = max(int(det[4] * input_height), 0)
w = min(int(det[5] * input_width - x), input_width)
h = min(int(det[6] * input_height - y), input_height)
image_results.append(dm.Bbox(x, y, w, h,
label=label, attributes={'score': conf} ))
results.append(image_results[:max_det])
return results
def get_categories():
# Optionally, provide output categories - label map etc.
# Example:
label_categories = dm.LabelCategories()
label_categories.add('person')
label_categories.add('car')
return { dm.AnnotationType.label: label_categories }
Remove Models
To remove a model from a project, use the datum model remove
command.
Usage:
datum model remove [-h] [-p PROJECT_DIR] name
Parameters:
<name>
(string) - The name of the model to be removed-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Example:
datum create
datum model add <...> -n model1
datum remove model1
Run Model
This command applies model to dataset images and produces a new dataset.
Usage:
datum model run
Parameters:
<target>
(string) - A project build target to be used. By default, uses the combinedproject
target.-m, --model
(string) - Model name-o, --output-dir
(string) - Output directory. By default, results will be stored in an auto-generated directory in the current directory.--overwrite
- Allows to overwrite existing files in the output directory, when it is specified and is not empty.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Example: launch inference on a dataset
datum create
datum import <...>
datum model add mymodel <...>
datum model run -m mymodel -o inference
5.17 - Patch Datasets
Updates items of the first dataset with items from the second one.
By default, datasets are updated in-place. The -o/--output-dir
option can be used to specify another output directory. When
updating in-place, use the --overwrite
parameter along with the
--save-images
export option (in-place updates fail by default
to prevent data loss).
Unlike the regular project data source joining, the datasets are not required to have the same labels. The labels from the “patch” dataset are projected onto the labels of the patched dataset, so only the annotations with the matching labels are used, i.e. all the annotations having unknown labels are ignored. Currently, this command doesn’t allow to update the label information in the patched dataset.
The command supports passing extra exporting options for the output
dataset. The extra options should be passed after the main arguments
and after the --
separator. Particularly, this is useful to include
images in the output dataset with --save-images
.
This command can be applied to the current project targets or arbitrary datasets outside a project. Note that if the target dataset is read-only (e.g. if it is a project, stage or a cache entry), the output directory must be provided.
Usage:
datum patch [-h] [-o DST_DIR] [--overwrite] [-p PROJECT_DIR]
target patch
[-- EXPORT_ARGS]
<revpath> - either a dataset path or a revision path.
The current project (-p/--project
) is also used as a context for
plugins, so it can be useful for dataset paths having custom formats.
When not specified, the current project’s working tree is used.
Parameters:
<target dataset>
(string) - Target dataset revpath<patch dataset>
(string) - Patch dataset revpath-o, --output-dir
(string) - Output directory. By default, saves in-place--overwrite
- Allows to overwrite existing files in the output directory, when it is not empty.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.-- <export args>
- Additional arguments for the format writer (use-- -h
for help). Must be specified after the main command arguments.
Examples:
- Update a VOC-like dataset with COCO-like annotations:
datum patch --overwrite dataset1/:voc dataset2/:coco -- --save-images
- Generate a patched dataset, based on a project:
datum patch -o patched_proj1/ proj1/ proj2/
- Update the “source1” source in the current project with a dataset:
datum patch -p proj/ --overwrite source1 path/to/dataset2:coco
- Generate a patched source from a previous revision and a dataset:
datum patch -o new_src2/ HEAD~2:source-2 path/to/dataset2:yolo
- Update a dataset in a custom format, described in a project plugin:
datum patch -p proj/ --overwrite dataset/:my_format dataset2/:coco
5.18 - Projects
Migrate project
Updates the project from an old version to the current one and saves the resulting project in the output directory. Projects cannot be updated inplace.
The command tries to map the old source configuration to the new one.
This can fail in some cases, so the command will exit with an error,
unless -f/--force
is specified. With this flag, the command will
skip these errors an continue its work.
Usage:
datum project migrate [-h] -o DST_DIR [-f] [-p PROJECT_DIR] [--overwrite]
Parameters:
-o, --output-dir
(string) - Output directory for the updated project-f, --force
- Ignore source import errors (default: False)--overwrite
- Overwrite existing files in the save directory.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Examples:
- Migrate a project from v1 to v2, save the new project in other dir:
datum project migrate -o <output/dir>
Print project info
Prints project configuration info such as available plugins, registered models, imported sources and build tree.
Usage:
datum project info [-h] [-p PROJECT_DIR] [revision]
Parameters:
<revision>
(string) - Target project revision. By default, uses the working tree.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Examples:
-
Print project info for the current working tree:
datum project info
-
Print project info for the previous revision:
datum project info HEAD~1
Sample output:
Project:
location: /test_proj
Plugins:
extractors: ade20k2017, ade20k2020, camvid, cifar, cityscapes, coco, coco_captions, coco_image_info, coco_instances, coco_labels, coco_panoptic, coco_person_keypoints, coco_stuff, cvat, datumaro, icdar_text_localization, icdar_text_segmentation, icdar_word_recognition, image_dir, image_zip, imagenet, imagenet_txt, kitti, kitti_detection, kitti_raw, kitti_segmentation, label_me, lfw, market1501, mnist, mnist_csv, mot_seq, mots, mots_png, open_images, sly_pointcloud, tf_detection_api, vgg_face2, voc, voc_action, voc_classification, voc_detection, voc_layout, voc_segmentation, wider_face, yolo
converters: camvid, mot_seq_gt, coco_captions, coco, coco_image_info, coco_instances, coco_labels, coco_panoptic, coco_person_keypoints, coco_stuff, kitti, kitti_detection, kitti_segmentation, icdar_text_localization, icdar_text_segmentation, icdar_word_recognition, lfw, datumaro, open_images, image_zip, cifar, yolo, voc_action, voc_classification, voc, voc_detection, voc_layout, voc_segmentation, tf_detection_api, label_me, mnist, cityscapes, mnist_csv, kitti_raw, wider_face, vgg_face2, sly_pointcloud, mots_png, image_dir, imagenet_txt, market1501, imagenet, cvat
launchers:
Models:
Sources:
'source-2':
format: voc
url: /datasets/pascal/VOC2012
location: /test_proj/source-2/
options: {}
hash: 3eb282cdd7339d05b75bd932a1fd3201
stages:
'root':
type: source
hash: 3eb282cdd7339d05b75bd932a1fd3201
'source-3':
format: imagenet
url: /datasets/imagenet/ILSVRC2012_img_val/train
location: /test_proj/source-3/
options: {}
hash: e47804a3ec1a54c9b145e5f1007ec72f
stages:
'root':
type: source
hash: e47804a3ec1a54c9b145e5f1007ec72f
5.19 - Sources
These commands are specific for Data Sources. Read more about them here.
Import Dataset
Datasets can be added to a Datumaro project with the import
command,
which adds a dataset link into the project and downloads (or copies)
the dataset. If you need to add a dataset already copied into the project,
use the add
command.
Dataset format readers can provide some additional import options. To pass
such options, use the --
separator after the main command arguments.
The usage information can be printed with datum import -f <format> -- --help
.
The list of currently available formats is listed in the command help output.
A dataset is imported by its URL. Currently, only local filesystem
paths are supported. The URL can be a file or a directory path
to a dataset. When the dataset is read, it is read as a whole.
However, many formats can have multiple subsets like train
, val
, test
etc. If you want to limit reading only to a specific subset, use
the -r/--path
parameter. It can also be useful when subset files have
non-standard placement or names.
When a dataset is imported, the following things are done:
- URL is saved in the project config
- data in copied into the project
Each data source has a name assigned, which can be used in other commands. To
set a specific name, use the -n/--name
parameter.
The dataset is added into the working tree of the project. A new commit is not done automatically.
Usage:
datum import [-h] [-n NAME] -f FORMAT [-r PATH] [--no-check]
[-p PROJECT_DIR] url [-- EXTRA_FORMAT_ARGS]
Parameters:
<url>
(string) - A file of directory path to the dataset.-f, --format
(string) - Dataset format-r, --path
(string) - A path relative to the source URL the data source. Useful to specify a path to a subset, subtask, or a specific file in URL.--no-check
- Don’t try to read the source after importing-n
,--name
(string) - Name of the new source (default: generate automatically)-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.-- <extra format args>
- Additional arguments for the format reader (use-- -h
for help). Must be specified after the main command arguments.
Example: create a project from images and annotations in different formats, export as TFrecord for TF Detection API for model training
# 'default' is the name of the subset below
datum create
datum import -f coco_instances -r annotations/instances_default.json path/to/coco
datum import -f cvat <path/to/cvat/default.xml>
datum import -f voc_detection -r custom_subset_dir/default.txt <path/to/voc>
datum import -f datumaro <path/to/datumaro/default.json>
datum import -f image_dir <path/to/images/dir>
datum export -f tf_detection_api -- --save-images
Add Dataset
Existing datasets can be added to a Datumaro project with the add
command.
The command adds a project-local directory as a data source in the project.
Unlike the import
command, it does not copy datasets and only works with local directories.
The source name is defined by the directory name.
Dataset format readers can provide some additional import options. To pass
such options, use the --
separator after the main command arguments.
The usage information can be printed with datum add -f <format> -- --help
.
The list of currently available formats is listed in the command help output.
A dataset is imported as a directory. When the dataset is read, it is read
as a whole. However, many formats can have multiple subsets like train
,
val
, test
etc. If you want to limit reading only to a specific subset,
use the -r/--path
parameter. It can also be useful when subset files have
non-standard placement or names.
The dataset is added into the working tree of the project. A new commit is not done automatically.
Usage:
datum add [-h] -f FORMAT [-r PATH] [--no-check]
[-p PROJECT_DIR] path [-- EXTRA_FORMAT_ARGS]
Parameters:
<url>
(string) - A file of directory path to the dataset.-f, --format
(string) - Dataset format-r, --path
(string) - A path relative to the source URL the data source. Useful to specify a path to a subset, subtask, or a specific file in URL.--no-check
- Don’t try to read the source after importing-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.-- <extra format args>
- Additional arguments for the format reader (use-- -h
for help). Must be specified after the main command arguments.
Example: create a project from images and annotations in different formats, export in YOLO for model training
datum create
datum add -f coco -r annotations/instances_train.json dataset1/
datum add -f cvat dataset2/train.xml
datum export -f yolo -- --save-images
Example: add an existing dataset into a project, avoid data copying
To add a dataset, we need to have it inside the project directory:
proj/
├─ .datumaro/
├─ .dvc/
├─ my_coco/
│ └─ images/
│ ├─ image1.jpg
│ └─ ...
│ └─ annotations/
│ └─ coco_annotation.json
├─ .dvcignore
└─ .gitignore
datum create -o proj/
mv ~/my_coco/ proj/my_coco/ # move the dataset into the project directory
datum add -p proj/ -f coco proj/my_coco/
Remove Datasets
To remove a data source from a project, use the remove
command.
Usage:
datum remove [-h] [--force] [--keep-data] [-p PROJECT_DIR] name [name ...]
Parameters:
<name>
(string) - The name of the source to be removed (repeatable)-f, --force
- Do not fail and stop on errors during removal--keep-data
- Do not remove source data from the working directory, remove only project metainfo.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Example:
datum create
datum import -f voc -n src1 <path/to/dataset/>
datum remove src1
5.20 - Get Project Statistics
This command computes various project statistics, such as:
- image mean and std. dev.
- class and attribute balance
- mask pixel balance
- segment area distribution
Usage:
datum stats [-h] [-p PROJECT_DIR] [target]
Parameters:
<target>
(string) - Target source revpath. By default, computes statistics of the merged dataset.-s, --subset
(string) - Compute stats only for a specific subset--image-stats
(bool) - Compute image mean and std (default: True)--ann-stats
(bool) - Compute annotation statistics (default: True)-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Example:
datum stats -p test_project
Sample output:
{
"annotations": {
"labels": {
"attributes": {
"gender": {
"count": 358,
"distribution": {
"female": [
149,
0.41620111731843573
],
"male": [
209,
0.5837988826815642
]
},
"values count": 2,
"values present": [
"female",
"male"
]
},
"view": {
"count": 340,
"distribution": {
"__undefined__": [
4,
0.011764705882352941
],
"front": [
54,
0.1588235294117647
],
"left": [
14,
0.041176470588235294
],
"rear": [
235,
0.6911764705882353
],
"right": [
33,
0.09705882352941177
]
},
"values count": 5,
"values present": [
"__undefined__",
"front",
"left",
"rear",
"right"
]
}
},
"count": 2038,
"distribution": {
"car": [
340,
0.16683022571148184
],
"cyclist": [
194,
0.09519136408243375
],
"head": [
354,
0.17369970559371933
],
"ignore": [
100,
0.04906771344455348
],
"left_hand": [
238,
0.11678115799803729
],
"person": [
358,
0.17566241413150147
],
"right_hand": [
77,
0.037782139352306184
],
"road_arrows": [
326,
0.15996074582924436
],
"traffic_sign": [
51,
0.025024533856722278
]
}
},
"segments": {
"area distribution": [
{
"count": 1318,
"max": 11425.1,
"min": 0.0,
"percent": 0.9627465303140978
},
{
"count": 1,
"max": 22850.2,
"min": 11425.1,
"percent": 0.0007304601899196494
},
{
"count": 0,
"max": 34275.3,
"min": 22850.2,
"percent": 0.0
},
{
"count": 0,
"max": 45700.4,
"min": 34275.3,
"percent": 0.0
},
{
"count": 0,
"max": 57125.5,
"min": 45700.4,
"percent": 0.0
},
{
"count": 0,
"max": 68550.6,
"min": 57125.5,
"percent": 0.0
},
{
"count": 0,
"max": 79975.7,
"min": 68550.6,
"percent": 0.0
},
{
"count": 0,
"max": 91400.8,
"min": 79975.7,
"percent": 0.0
},
{
"count": 0,
"max": 102825.90000000001,
"min": 91400.8,
"percent": 0.0
},
{
"count": 50,
"max": 114251.0,
"min": 102825.90000000001,
"percent": 0.036523009495982466
}
],
"avg. area": 5411.624543462382,
"pixel distribution": {
"car": [
13655,
0.0018431496518735067
],
"cyclist": [
939005,
0.12674674030446592
],
"head": [
0,
0.0
],
"ignore": [
5501200,
0.7425510702956085
],
"left_hand": [
0,
0.0
],
"person": [
954654,
0.12885903974805205
],
"right_hand": [
0,
0.0
],
"road_arrows": [
0,
0.0
],
"traffic_sign": [
0,
0.0
]
}
}
},
"annotations by type": {
"bbox": {
"count": 548
},
"caption": {
"count": 0
},
"label": {
"count": 0
},
"mask": {
"count": 0
},
"points": {
"count": 669
},
"polygon": {
"count": 821
},
"polyline": {
"count": 0
}
},
"annotations count": 2038,
"unannotated images": [
"img00051",
"img00052",
"img00053",
"img00054",
"img00055",
],
"unannotated images count": 5,
"dataset": {
"images count": 100,
"unique images count": 97,
"repeated images count": 3,
"repeated images": [
[["img00057", "default"], ["img00058", "default"]],
[["img00059", "default"], ["img00060", "default"]],
[["img00061", "default"], ["img00062", "default"]],
],
},
"subsets": {
"default": {
"images count": 100,
"image mean": [
107.06903686941979,
79.12831698580979,
52.95829558185416
],
"image std": [
49.40237673503467,
43.29600731496902,
35.47373007603151
],
}
},
}
5.21 - Status
This command prints the summary of the source changes between the working tree of a project and its HEAD revision.
Prints lines in the following format:
<status> <source name>
The list of possible status
values:
modified
- the source data exists and it is changedforeign_modified
- the source data exists and it is changed, but Datumaro does not know about the way the differences were made. If changes are committed, they will only be available for reproduction from the project cache.added
- the source was added in the working treeremoved
- the source was removed from the working tree. This status won’t be reported if just the source data is removed in the working tree. In such situation the status will bemissing
.missing
- the source data is removed from the working directory. The source still can be restored from the project cache or reproduced.
Usage:
datum status [-h] [-p PROJECT_DIR]
Parameters:
-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Example output:
added source-1
modified source-2
foreign_modified source-3
removed source-4
missing source-5
5.22 - Transform Dataset
Often datasets need to be modified during preparation for model training and
experimenting. In trivial cases it can be done manually - e.g. image renaming
or label renaming. However, in more complex cases even simple modifications
can require too much efforts, distracting the user from the real work.
Datumaro provides the datum transform
command to help in such cases.
This command allows to modify dataset images or annotations all at once.
This command is designed for batch dataset processing, so if you only need to modify few elements of a dataset, you might want to use other approaches for better performance. A possible solution can be a simple script, which uses Datumaro API.
The command can be applied to a dataset or a project build target,
a stage or the combined project
target, in which case all the project
targets will be affected. A build tree stage will be recorded
if --stage
is enabled, and the resulting dataset(-s) will be
saved if --apply
is enabled.
By default, datasets are updated in-place. The -o/--output-dir
option can be used to specify another output directory. When
updating in-place, use the --overwrite
parameter (in-place
updates fail by default to prevent data loss), unless a project
target is modified.
The current project (-p/--project
) is also used as a context for
plugins, so it can be useful for dataset paths having custom formats.
When not specified, the current project’s working tree is used.
Usage:
datum transform [-h] -t TRANSFORM [-o DST_DIR] [--overwrite]
[-p PROJECT_DIR] [--stage STAGE] [--apply APPLY] [target] [-- EXTRA_ARGS]
Parameters:
<target>
(string) - Target dataset revpath. By default, transforms all targets of the current project.-t, --transform
(string) - Transform method name--stage
(bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and theproject
target, but not intermediate stages). Enabled by default.--apply
(bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default.-o, --output-dir
(string) - Output directory. Can be omitted for main project targets (i.e. data sources and theproject
target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace.--overwrite
- Allows to overwrite existing files in the output directory, when it is specified and is not empty.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.<extra args>
- The list of extra transformation parameters. Should be passed after the--
separator after the main command arguments. See transform descriptions for info about extra parameters. Use the--help
option to print parameter info.
Examples:
- Split a VOC-like dataset randomly:
datum transform -t random_split --overwrite path/to/dataset:voc
- Rename images in a project data source by a regex from
frame_XXX
toXXX
:
datum create <...>
datum import <...> -n source-1
datum transform -t rename source-1 -- -e '|^frame_||'
Built-in transforms
Basic dataset item manipulations:
rename
- Renames dataset items by regular expressionid_from_image_name
- Renames dataset items to their image filenamesreindex
- Renames dataset items with numbersndr
- Removes duplicated images from datasetrelevancy_sampler
- Leaves only the most important images (requires model inference results)random_sampler
- Leaves no more than k items from the dataset randomlylabel_random_sampler
- Leaves at least k images with annotations per classresize
- Resizes images and annotations in the datasetremove_images
- Removes specific imagesremove_annotations
- Removes annotationsremove_attributes
- Removes attributes
Subset manipulations:
random_split
- Splits dataset into subsets randomlysplit
- Splits dataset into subsets for classification, detection, segmentation or re-identificationmap_subsets
- Renames and removes subsets
Annotation manipulations:
remap_labels
- Renames, adds or removes labels in datasetproject_labels
- Sets dataset labels to the requested sequenceshapes_to_boxes
- Replaces spatial annotations with bounding boxesboxes_to_masks
- Converts bounding boxes to instance maskspolygons_to_masks
- Converts polygons to instance masksmasks_to_polygons
- Converts instance masks to polygonsanns_to_labels
- Replaces annotations having labels with label annotationsmerge_instance_segments
- Merges grouped spatial annotations into a maskcrop_covered_segments
- Removes occluded segments of covered masksbbox_value_decrement
- Subtracts 1 from bbox coordinates
rename
Renames items in the dataset. Supports regular expressions.
The first character in the expression is a delimiter for
the pattern and replacement parts. Replacement part can also
contain str.format
replacement fields with the item
(of type DatasetItem
) object available.
Usage:
rename [-h] [-e REGEX]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-e
,--regex
(string) - Regex for renaming in the form<sep><search><sep><replacement><sep>
Examples: Replace ‘pattern’ with ‘replacement’:
datum transform -t rename -- -e '|pattern|replacement|'
Remove the frame_
prefix from item ids:
datum transform -t rename -- -e '|^frame_|\1|'
Collect images from subdirectories into the base image directory using regex:
datum transform -t rename -- -e '|^((.+[/\\])*)?(.+)$|\2|'
Add subset prefix to images:
datum transform -t rename -- -e '|(.*)|{item.subset}_\1|'
id_from_image_name
Renames items in the dataset using image file name (without extension).
Usage:
id_from_image_name [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
reindex
Replaces dataset item IDs with sequential indices.
Usage:
reindex [-h] [-s START]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-s
,--start
(int) - Start value for item ids (default: 1)
ndr
Removes near-duplicated images in subset.
Remove duplicated images from a dataset. Keep at most -k/--num_cut
resulting images.
Available oversampling policies (the -e
parameter):
random
- sample from removed data randomlysimilarity
- sample from removed data with ascending similarity score
Available undersampling policies (the -u
parameter):
uniform
- sample data with uniform distributioninverse
- sample data with reciprocal of the number of number of items with the same similarity
Usage:
ndr [-h] [-w WORKING_SUBSET] [-d DUPLICATED_SUBSET] [-a {gradient}]
[-k NUM_CUT] [-e {random,similarity}] [-u {uniform,inverse}] [-s SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-w
,--working_subset
(str) - Name of the subset to operate (default:None
)-d
,--duplicated_subset
(str) - Name of the subset for the removed data after NDR runs (default: duplicated)-a
,--algorithm
(one of:gradient
) - Name of the algorithm to use (default:gradient
)-k
,--num_cut
(int) - Maximum output dataset size-e
,--over_sample
(one of:random
,similarity
) - The policy to use whennum_cut
is bigger than result length (default:random
)-u
,--under_sample
(one of:uniform
,inverse
) - The policy to use whennum_cut
is smaller than result length (default:uniform
)-s
,--seed
(int) - Random seed
Example: apply NDR, return no more than 100 images
datum transform -t ndr -- \
--working_subset train
--algorithm gradient
--num_cut 100
--over_sample random
--under_sample uniform
relevancy_sampler
Sampler that analyzes model inference results on the dataset and picks the most relevant samples for training.
Creates a dataset from the -k/--count
hardest items for a model.
The whole dataset or a single subset will be split into the sampled
and unsampled
subsets based on the model confidence. The dataset
must contain model confidence values in the scores
attributes
of annotations.
There are five methods of sampling (the -m/--method
option):
topk
- Return the k items with the highest uncertainty datalowk
- Return the k items with the lowest uncertainty datarandk
- Return random k itemsmixk
- Return a half using topk, and the other half using lowk methodrandtopk
- Select 3*k items randomly, and return the topk among them
Notes:
- Each image’s inference result must contain the probability for
all classes (in the
scores
attribute). - Requesting a sample larger than the number of all images will return all images.
Usage:
relevancy_sampler [-h] -k COUNT [-a {entropy}] [-i INPUT_SUBSET]
[-o SAMPLED_SUBSET] [-u UNSAMPLED_SUBSET]
[-m {topk,lowk,randk,mixk,randtopk}] [-d OUTPUT_FILE]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-k
,--count
(int) - Number of items to sample-a
,--algorithm
(one of:entropy
) - Sampling algorithm (default:entropy
)-i
,--input_subset
(str) - Subset name to select sample from (default:None
)-o
,--sampled_subset
(str) - Subset name to put sampled data to (default:sample
)-u
,--unsampled_subset
(str) - Subset name to put the rest data to (default:unsampled
)-m
,--sampling_method
(one of:topk
,lowk
,randk
,mixk
,randtopk
) - Sampling method (default:topk
)-d
,--output_file
(path) - A.csv
file path to dump sampling results
Examples:
Select the most relevant data subset of 20 images
based on model certainty, put the result into sample
subset
and put all the rest into unsampled
subset, use train
subset
as input. The dataset must contain model confidence values in the scores
attributes of annotations.
datum transform -t relevancy_sampler -- \
--algorithm entropy \
--subset_name train \
--sample_name sample \
--unsampled_name unsampled \
--sampling_method topk -k 20
random_sampler
Sampler that keeps no more than required number of items in the dataset.
Notes:
- Items are selected uniformly (tries to keep original item distribution by subsets)
- Requesting a sample larger than the number of all images will return all images
Usage:
random_sampler [-h] -k COUNT [-s SUBSET] [--seed SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-k
,--count
(int) - Maximum number of items to sample-s
,--subset
(str) - Limit changes to this subset (default: affect all dataset)--seed
(int) - Initial value for random number generator
Examples: Select subset of 20 images randomly
datum transform -t random_sampler -- -k 20
Select subset of 20 images, modify only train
subset
datum transform -t random_sampler -- -k 20 -s train
random_label_sampler
Sampler that keeps at least the required number of annotations of each class in the dataset for each subset separately.
Consider using the “stats” command to get class distribution in the dataset.
Notes:
- Items can contain annotations of several selected classes
(e.g. 3 bounding boxes per image). The number of annotations in the
resulting dataset varies between
max(class counts)
andsum(class counts)
- If the input dataset does not has enough class annotations, the result will contain only what is available
- Items are selected uniformly
- For reasons above, the resulting class distribution in the dataset may not be the same as requested
- The resulting dataset will only keep annotations for classes with
specified
count
> 0
Usage:
label_random_sampler [-h] -k COUNT [-l LABEL_COUNTS] [--seed SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-k
,--count
(int) - Minimum number of annotations of each class-l
,--label
(str; repeatable) - Minimum number of annotations of a specific class. Overrides the-k/--count
setting for the class. The format is<label_name>:<count>
--seed
(int) - Initial value for random number generator
Examples: Select a dataset with at least 10 images of each class:
datum transform -t label_random_sampler -- -k 10
Select a dataset with at least 20 cat
images, 5 dog
, 0 car
and 10 of each
unmentioned class:
datum transform -t label_random_sampler -- \
-l cat:20 \ # keep 20 images with cats
-l dog:5 \ # keep 5 images with dogs
-l car:0 \ # remove car annotations
-k 10 # for remaining classes
resize
Resizes images and annotations in the dataset to the specified size. Supports upscaling, downscaling and mixed variants.
Usage:
resize [-h] [-dw WIDTH] [-dh HEIGHT]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-dw
,--width
(int) - Destination image width-dh
,--height
(int) - Destination image height
Examples: Resize all images to 256x256 size
datum transform -t resize -- -dw 256 -dh 256
remove_images
Removes specific dataset items by their ids.
Usage:
remove_images [-h] [--id IDs]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit--id
(str) - Item id to remove. Id is ‘: ’ pair (repeatable)
Examples:
Remove specific images from the dataset
datum transform -t remove_images -- --id 'image1:train' --id 'image2:test'
remove_annotations
Allows to remove annotations on specific dataset items.
Can be useful to clean the dataset from broken or unnecessary annotations.
Usage:
remove_annotations [-h] [--id IDs]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit--id
(str) - Item id to clean from annotations. Id is ‘: ’ pair. If not specified, removes all annotations (repeatable)
Examples: Remove annotations from specific items in the dataset
datum transform -t remove_annotations -- --id 'image1:train' --id 'image2:test'
remove_attributes
Allows to remove item and annotation attributes in a dataset.
Can be useful to clean the dataset from broken or unnecessary attributes.
Usage:
remove_attributes [-h] [--id IDs] [--attr ATTRIBUTE_NAME]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit--id
(str) - Image id to clean from annotations. Id is ‘: ’ pair. If not specified, affects all items and annotations (repeatable) -a
,--attr
(flag) - Attribute name to be removed. If not specified, removes all attributes (repeatable)
Examples:
Remove the is_crowd
attribute from dataset
datum transform -t remove_attributes -- \
--attr 'is_crowd'
Remove the occluded
attribute from annotations of
the 2010_001705
item in the train
subset
datum transform -t remove_attributes -- \
--id '2010_001705:train' --attr 'occluded'
random_split
Joins all subsets into one and splits the result into few parts. It is expected that item ids are unique and subset ratios sum up to 1.
Usage:
random_split [-h] [-s SPLITS] [--seed SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-s
,--subset
(str, repeatable) - Subsets in the form: ‘: ’ (repeatable, default: { train
: 0.67,test
: 0.33})--seed
(int) - Random seed
Example:
Split a dataset randomly to train
and test
subsets, ratio is 2:1
datum transform -t random_split -- --subset train:.67 --subset test:.33
split
Splits a dataset for model training, using task information:
-
classification splits Splits dataset into subsets (train/val/test) in class-wise manner. Splits dataset images in the specified ratio, keeping the initial class distribution.
-
detection & segmentation splits Each image can have multiple object annotations - bbox, mask, polygon. Since an image shouldn’t be included in multiple subsets at the same time, and image annotations shouldn’t be split, in general, dataset annotations are unlikely to be split exactly in the specified ratio. This split tries to split dataset images as close as possible to the specified ratio, keeping the initial class distribution.
-
reidentification splits In this task, the test set should consist of images of unseen people or objects during the training phase. This function splits a dataset in the following way:
- Splits the dataset into
train + val
andtest
sets based on person or object ID. - Splits
test
set intotest-gallery
andtest-query
sets in class-wise manner. - Splits the
train + val
set intotrain
andval
sets in the same way. The final subsets would betrain
,val
,test-gallery
andtest-query
.
Notes:
- Each image is expected to have only one
Annotation
. Unlabeled or multi-labeled images will be split into subsets randomly. - If Labels also have attributes, also splits by attribute values.
- If there is not enough images in some class or attributes group, the split ratio can’t be guaranteed.
In reidentification task,
- Object ID can be described by Label, or by attribute (
--attr
parameter) - The splits of the test set are controlled by
--query
parameter Gallery ratio would be1.0 - query
.
Usage:
split [-h] [-t {classification,detection,segmentation,reid}]
[-s SPLITS] [--query QUERY] [--attr ATTR_FOR_ID] [--seed SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-t
,--task
(one of:classification
,detection
,segmentation
,reid
) - Dataset task (default:classification
)-s
,--subset
(str; repeatable) - Subsets in the form: ‘: ’ (default: { train
: 0.5,val
: 0.2,test
: 0.3})--query
(float) - Query ratio in the test set (default: 0.5)--attr
(str) - Attribute name representing the ID (default: use label)--seed
(int) - Random seed
Example:
datum transform -t split -- -t classification \
--subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- -t detection \
--subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- -t segmentation \
--subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- -t reid \
--subset train:.5 --subset val:.2 --subset test:.3 --query .5
Example: use person_id
attribute for splitting
datum transform -t split -- -t detection --attr person_id
map_subsets
Renames subsets in the dataset.
Usage:
map_subsets [-h] [-s MAPPING]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-s
,--subset
(str; repeatable) - Subset mapping of the form:src:dst
remap_labels
Changes labels in the dataset.
A label can be:
- renamed (and joined with existing) -
when
--label <old_name>:<new_name>
is specified - deleted - when
--label <name>:
is specified, or default action isdelete
and the label is not mentioned in the list. When a label is deleted, all the associated annotations are removed - kept unchanged - when
--label <name>:<name>
is specified, or default action iskeep
and the label is not mentioned in the list Annotations with no label are managed by the default action policy.
Usage:
remap_labels [-h] [-l MAPPING] [--default {keep,delete}]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-l
,--label
(str; repeatable) - Label in the form of:<src>:<dst>
--default
(one of:keep
,delete
) - Action for unspecified labels (default:keep
)
Examples:
Remove the person
label (and corresponding annotations):
datum transform -t remap_labels -- -l person: --default keep
Rename person
to pedestrian
and human
to pedestrian
, join annotations
that had different classes under the same class id for pedestrian
,
don’t touch other classes:
datum transform -t remap_labels -- \
-l person:pedestrian -l human:pedestrian --default keep
Rename person
to car
and cat
to dog
, keep bus
, remove others:
datum transform -t remap_labels -- \
-l person:car -l bus:bus -l cat:dog --default delete
project_labels
Changes the order of labels in the dataset from the existing to the desired one, removes unknown labels and adds new labels. Updates or removes the corresponding annotations.
Labels are matched by names (case dependent). Parent labels are only kept if they are present in the resulting set of labels. If new labels are added, and the dataset has mask colors defined, new labels will obtain generated colors.
Useful for merging similar datasets, whose labels need to be aligned.
Usage:
project_labels [-h] [-l DST_LABELS]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-l
,--label
(str; repeatable) - Label name (ordered)
Examples:
Set dataset labels to [person
, cat
, dog
], remove others, add missing.
Original labels (for example): cat
, dog
, elephant
, human
.
New labels: person
(added), cat
(kept), dog
(kept).
datum transform -t project_labels -- -l person -l cat -l dog
shapes_to_boxes
Converts spatial annotations (masks, polygons, polylines, points) to enclosing bounding boxes.
Usage:
shapes_to_boxes [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
Example: Convert spatial annotations between each other
datum transform -t boxes_to_masks
datum transform -t masks_to_polygons
datum transform -t polygons_to_masks
datum transform -t shapes_to_boxes
boxes_to_masks
Converts bounding boxes to masks.
Usage:
boxes_to_masks [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
polygons_to_masks
Converts polygons to masks.
Usage:
polygons_to_masks [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
masks_to_polygons
Converts masks to polygons.
Usage:
masks_to_polygons [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
anns_to_labels
Collects all labels from annotations (of all types) and transforms
them into a set of annotations of type Label
Usage:
anns_to_labels [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
merge_instance_segments
Replaces instance masks and, optionally, polygons with a single mask. A group of annotations with the same group id is considered an “instance”. The largest annotation in the group is considered the group “head”, so the resulting mask takes properties from that annotation.
Usage:
merge_instance_segments [-h] [--include-polygons]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit--include-polygons
(flag) - Include polygons
crop_covered_segments
Sorts polygons and masks (“segments”) according to z_order
,
crops covered areas of underlying segments. If a segment is split
into several independent parts by the segments above, produces
the corresponding number of separate annotations joined into a group.
Usage:
crop_covered_segments [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
bbox_value_decrement
Subtracts one from the coordinates of bounding boxes
Usage:
bbox_values_decrement [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
5.23 - Utilities
Split video into frames
Splits a video into separate frames and saves them in a directory.
After the splitting, the images can be added into a project using
the import
command and the image_dir
format.
This command is useful for making a dataset from a video file. Unlike direct video reading during model training, which can produce different results if the system environment changes, this command allows to split the video into frames and use them instead, making the dataset reproducible and stable.
This command provides different options like setting the frame step
(the -s/--step
option), file name pattern (-n/--name-pattern
),
starting (-b/--start-frame
) and finishing (-e/--end-frame
) frame etc.
Note that this command is equivalent to the following commands:
datum create -o proj
datum import -p proj -f video_frames video.mp4 -- <params>
datum export -p proj -f image_dir -- <params>
Usage:
datum util split_video [-h] -i SRC_PATH [-o DST_DIR] [--overwrite]
[-n NAME_PATTERN] [-s STEP] [-b START_FRAME] [-e END_FRAME] [-x IMAGE_EXT]
Parameters:
-i, --input-path
(string) - Path to the video file-o, --output-dir
(string) - Output directory. By default, a subdirectory in the current directory is used--overwrite
- Allows overwriting existing files in the output directory, when it is not empty-n, --name-pattern
(string) - Name pattern for the produced images (default:%06d
)-s, --step
(integer) - Frame step (default: 1)-b, --start-frame
(integer) - Starting frame (default: 0)-e, --end-frame
(integer) - Finishing frame (default: none)-x, --image-ext
(string) Output image extension (default:.jpg
)-h, --help
- Print the help message and exit
Example: split a video into frames, use each 30-rd frame:
datum util split_video -i video.mp4 -o video.mp4-frames --step 30
Example: split a video into frames, save as ‘frame_xxxxxx.png’ files:
datum util split_video -i video.mp4 --image-ext=.png --name-pattern='frame_%%06d'
Example: split a video, add frames and annotations into dataset, export as YOLO:
datum util split_video -i video.avi -o video-frames
datum create -o proj
datum import -p proj -f image_dir video-frames
datum import -p proj -f coco_instances annotations.json
datum export -p proj -f yolo -- --save-images
5.24 - Validate Dataset
This command inspects annotations with respect to the task type and stores the results in JSON file.
The task types supported are classification
, detection
, and
segmentation
(the -t/--task-type
parameter).
The validation result contains
annotation statistics
based on the task typevalidation reports
, such as- items not having annotations
- items having undefined annotations
- imbalanced distribution in class/attributes
- too small or large values
summary
Usage:
datum validate [-h] -t TASK [-s SUBSET_NAME] [-p PROJECT_DIR]
[target] [-- EXTRA_ARGS]
Parameters:
<target>
(string) - Target dataset revpath. By default, validates the current project.-t, --task-type
(string) - Task type for validation-s, --subset
(string) - Dataset subset to be validated-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.<extra args>
- The list of extra validation parameters. Should be passed after the--
separator after the main command arguments:-fs, --few-samples-thr
(number) - The threshold for giving a warning for minimum number of samples per class-ir, --imbalance-ratio-thr
(number) - The threshold for giving imbalance data warning-m, --far-from-mean-thr
(number) - The threshold for giving a warning that data is far from mean-dr, --dominance-ratio-thr
(number) - The threshold for giving a warning bounding box imbalance-k, --topk-bins
(number) - The ratio of bins with the highest number of data to total bins in the histogram
Example : give warning when imbalance ratio of data with classification task over 40
datum validate -p prj/ -t classification -- -ir 40
Here is the list of validation items(a.k.a. anomaly types).
Anomaly Type | Description | Task Type |
---|---|---|
MissingLabelCategories | Metadata (ex. LabelCategories) should be defined | common |
MissingAnnotation | No annotation found for an Item | common |
MissingAttribute | An attribute key is missing for an Item | common |
MultiLabelAnnotations | Item needs a single label | classification |
UndefinedLabel | A label not defined in the metadata is found for an item | common |
UndefinedAttribute | An attribute not defined in the metadata is found for an item | common |
LabelDefinedButNotFound | A label is defined, but not found actually | common |
AttributeDefinedButNotFound | An attribute is defined, but not found actually | common |
OnlyOneLabel | The dataset consists of only label | common |
OnlyOneAttributeValue | The dataset consists of only attribute value | common |
FewSamplesInLabel | The number of samples in a label might be too low | common |
FewSamplesInAttribute | The number of samples in an attribute might be too low | common |
ImbalancedLabels | There is an imbalance in the label distribution | common |
ImbalancedAttribute | There is an imbalance in the attribute distribution | common |
ImbalancedDistInLabel | Values (ex. bbox width) are not evenly distributed for a label | detection, segmentation |
ImbalancedDistInAttribute | Values (ex. bbox width) are not evenly distributed for an attribute | detection, segmentation |
NegativeLength | The width or height of bounding box is negative | detection |
InvalidValue | There’s invalid (ex. inf, nan) value for bounding box info. | detection |
FarFromLabelMean | An annotation has an too small or large value than average for a label | detection, segmentation |
FarFromAttrMean | An annotation has an too small or large value than average for an attribute | detection, segmentation |
Validation Result Format:
{
'statistics': {
## common statistics
'label_distribution': {
'defined_labels': <dict>, # <label:str>: <count:int>
'undefined_labels': <dict>
# <label:str>: {
# 'count': <int>,
# 'items_with_undefined_label': [<item_key>, ]
# }
},
'attribute_distribution': {
'defined_attributes': <dict>,
# <label:str>: {
# <attribute:str>: {
# 'distribution': {<attr_value:str>: <count:int>, },
# 'items_missing_attribute': [<item_key>, ]
# }
# }
'undefined_attributes': <dict>
# <label:str>: {
# <attribute:str>: {
# 'distribution': {<attr_value:str>: <count:int>, },
# 'items_with_undefined_attr': [<item_key>, ]
# }
# }
},
'total_ann_count': <int>,
'items_missing_annotation': <list>, # [<item_key>, ]
## statistics for classification task
'items_with_multiple_labels': <list>, # [<item_key>, ]
## statistics for detection task
'items_with_invalid_value': <dict>,
# '<item_key>': {<ann_id:int>: [ <property:str>, ], }
# - properties: 'x', 'y', 'width', 'height',
# 'area(wxh)', 'ratio(w/h)', 'short', 'long'
# - 'short' is min(w,h) and 'long' is max(w,h).
'items_with_negative_length': <dict>,
# '<item_key>': { <ann_id:int>: { <'width'|'height'>: <value>, }, }
'bbox_distribution_in_label': <dict>, # <label:str>: <bbox_template>
'bbox_distribution_in_attribute': <dict>,
# <label:str>: {<attribute:str>: { <attr_value>: <bbox_template>, }, }
'bbox_distribution_in_dataset_item': <dict>,
# '<item_key>': <bbox count:int>
## statistics for segmentation task
'items_with_invalid_value': <dict>,
# '<item_key>': {<ann_id:int>: [ <property:str>, ], }
# - properties: 'area', 'width', 'height'
'mask_distribution_in_label': <dict>, # <label:str>: <mask_template>
'mask_distribution_in_attribute': <dict>,
# <label:str>: {
# <attribute:str>: { <attr_value>: <mask_template>, }
# }
'mask_distribution_in_dataset_item': <dict>,
# '<item_key>': <mask/polygon count: int>
},
'validation_reports': <list>, # [ <validation_error_format>, ]
# validation_error_format = {
# 'anomaly_type': <str>,
# 'description': <str>,
# 'severity': <str>, # 'warning' or 'error'
# 'item_id': <str>, # optional, when it is related to a DatasetItem
# 'subset': <str>, # optional, when it is related to a DatasetItem
# }
'summary': {
'errors': <count: int>,
'warnings': <count: int>
}
}
item_key
is defined as,
item_key = (<DatasetItem.id:str>, <DatasetItem.subset:str>)
bbox_template
and mask_template
are defined as,
bbox_template = {
'width': <numerical_stat_template>,
'height': <numerical_stat_template>,
'area(wxh)': <numerical_stat_template>,
'ratio(w/h)': <numerical_stat_template>,
'short': <numerical_stat_template>, # short = min(w, h)
'long': <numerical_stat_template> # long = max(w, h)
}
mask_template = {
'area': <numerical_stat_template>,
'width': <numerical_stat_template>,
'height': <numerical_stat_template>
}
numerical_stat_template
is defined as,
numerical_stat_template = {
'items_far_from_mean': <dict>,
# {'<item_key>': {<ann_id:int>: <value:float>, }, }
'mean': <float>,
'stddev': <float>,
'min': <float>,
'max': <float>,
'median': <float>,
'histogram': {
'bins': <list>, # [<float>, ]
'counts': <list>, # [<int>, ]
}
}
6 - Extending
There are few ways to extend and customize Datumaro behavior, which is supported by plugins. Check our contribution guide for details on plugin implementation. In general, a plugin is a Python module. It must be put into a plugin directory:
<project_dir>/.datumaro/plugins
for project-specific plugins<datumaro_dir>/plugins
for global plugins
Built-in plugins
Datumaro provides several builtin plugins. Plugins can have dependencies, which need to be installed separately.
TensorFlow
The plugin provides support of TensorFlow Detection API format, which includes boxes and masks.
Dependencies
The plugin depends on TensorFlow, which can be installed with pip
:
pip install tensorflow
or
pip install tensorflow-gpu
or
pip install datumaro[tf]
or
pip install datumaro[tf-gpu]
Accuracy Checker
This plugin allows to use Accuracy Checker to launch deep learning models from various frameworks (Caffe, MxNet, PyTorch, OpenVINO, …) through Accuracy Checker’s API.
Dependencies
The plugin depends on Accuracy Checker, which can be installed with pip
:
pip install 'git+https://github.com/openvinotoolkit/open_model_zoo.git#subdirectory=tools/accuracy_checker'
To execute models with deep learning frameworks, they need to be installed too.
OpenVINO™
This plugin provides support for model inference with OpenVINO™.
Dependencies
The plugin depends on the OpenVINO™ Toolkit, which can be installed by following these instructions
Dataset Formats
Dataset reading is supported by Extractors and Importers. An Extractor produces a list of dataset items corresponding to the dataset. An Importer creates a project from the data source location. It is possible to add custom Extractors and Importers. To do this, you need to put an Extractor and Importer implementation scripts to a plugin directory.
Dataset writing is supported by Converters. A Converter produces a dataset of a specific format from dataset items. It is possible to add custom Converters. To do this, you need to put a Converter implementation script to a plugin directory.
Dataset Conversions (“Transforms”)
A Transform is a function for altering a dataset and producing a new one. It can update dataset items, annotations, classes, and other properties. A list of available transforms for dataset conversions can be extended by adding a Transform implementation script into a plugin directory.
Model launchers
A list of available launchers for model execution can be extended by adding a Launcher implementation script into a plugin directory.
7 - Links
8 - How to control telemetry data collection
The OpenVINO™ telemetry library is used to collect basic information about Datumaro usage.
A short description of the information collected:
Event | Description |
---|---|
version | Datumaro version |
session start/end | Accessory event, there is no additional info here |
{cli_command}_result | Datumaro command result with arguments passed* |
error | Stack trace in case of exception* |
* All sensitive arguments, such as filesystem paths or names, are sanitized
To enable the collection of telemetry data, the ISIP consent file
must exist and contain 1
, otherwise telemetry will be disabled.
The ISIP file can be created/modified by an OpenVINO installer
or manually and used by other OpenVINO™ tools.
The location of the ISIP consent file depends on the OS:
- Windows:
%localappdata%\Intel Corporation\isip
, - Linux, MacOS:
$HOME/intel/isip
.