This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Dataset Management Framework Documentation

Welcome to the documentation for the Dataset Management Framework (Datumaro).

The Datumaro is a free framework and CLI tool for building, transforming, and analyzing datasets. It is developed and used by Intel to build, transform, and analyze annotations and datasets in a large number of supported formats.

Our documentation provides information for AI researchers, developers, and teams, who are working with datasets and annotations.

flowchart LR
    datasets[(VOC dataset<br/>+<br/>COCO datset<br/>+<br/>CVAT annotation)]
    datumaro{Datumaro}
    dataset[dataset]
    annotation[Annotation tool]
    training[Model training]
    publication[Publication, statistics etc]
    datasets-->datumaro
    datumaro-->dataset
    dataset-->annotation & training & publication

Getting started

Basic information and sections needed for a quick start.

User Manual

This section contains documents for Datumaro users.

Developer Manual

Documentation for Datumaro developers.

1 - Getting started

To read about the design concept and features of Datumaro, go to the design section.

Installation

Dependencies

  • Python (3.7+)
  • Optional: OpenVINO, TensorFlow, PyTorch, MxNet, Caffe, Accuracy Checker, Git

Optionally, create a virtual environment:

python -m pip install virtualenv
python -m virtualenv venv
. venv/bin/activate

Install Datumaro package:

pip install datumaro[default]

Read full installation instructions in the user manual.

Usage

There are several options available:

Standalone tool

Datuaro as a standalone tool allows to do various dataset operations from the command line interface:

datum --help
python -m datumaro --help

Python module

Datumaro can be used in custom scripts as a Python module. Used this way, it allows to use its features from an existing codebase, enabling dataset reading, exporting and iteration capabilities, simplifying integration of custom formats and providing high performance operations:

import datumaro as dm

dataset = dm.Dataset.import_from('path/', 'voc')

# keep only annotated images
dataset.select(lambda item: len(item.annotations) != 0)

# change dataset labels and corresponding annotations
dataset.transform('remap_labels',
    mapping={
      'cat': 'dog', # rename cat to dog
      'truck': 'car', # rename truck to car
      'person': '', # remove this label
    },
    default='delete') # remove everything else

# iterate over the dataset elements
for item in dataset:
    print(item.id, item.annotations)

# export the resulting dataset in COCO format
dataset.export('dst/dir', 'coco', save_images=True)

List of components with the comfortable importing.

Check our developer manual for additional information.

Examples

  • Convert PASCAL VOC dataset to COCO format, keep only images with cat class presented:

    # Download VOC dataset:
    # http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
    datum convert --input-format voc --input-path <path/to/voc> \
                  --output-format coco \
                  --filter '/item[annotation/label="cat"]' \
                  -- --reindex 1 # avoid annotation id conflicts
    
  • Convert only non-occluded annotations from a CVAT project to TFrecord:

    # export Datumaro dataset in CVAT UI, extract somewhere, go to the project dir
    datum filter -e '/item/annotation[occluded="False"]' --mode items+anno
    datum export --format tf_detection_api -- --save-images
    
  • Annotate MS COCO dataset, extract image subset, re-annotate it in CVAT, update old dataset:

    # Download COCO dataset http://cocodataset.org/#download
    # Put images to coco/images/ and annotations to coco/annotations/
    datum create
    datum import --format coco <path/to/coco>
    datum export --filter '/image[images_I_dont_like]' --format cvat
    # import dataset and images to CVAT, re-annotate
    # export Datumaro project, extract to 'reannotation-upd'
    datum project update reannotation-upd
    datum export --format coco
    
  • Annotate instance polygons in CVAT, export as masks in COCO:

    datum convert --input-format cvat --input-path <path/to/cvat.xml> \
                  --output-format coco -- --segmentation-mode masks
    
  • Apply an OpenVINO detection model to some COCO-like dataset, then compare annotations with ground truth and visualize in TensorBoard:

    datum create
    datum import --format coco <path/to/coco>
    # create model results interpretation script
    datum model add -n mymodel openvino \
      --weights model.bin --description model.xml \
      --interpretation-script parse_results.py
    datum model run --model -n mymodel --output-dir mymodel_inference/
    datum diff mymodel_inference/ --format tensorboard --output-dir diff
    
  • Change colors in PASCAL VOC-like .png masks:

    datum create
    datum import --format voc <path/to/voc/dataset>
    
    # Create a color map file with desired colors:
    #
    # label : color_rgb : parts : actions
    # cat:0,0,255::
    # dog:255,0,0::
    #
    # Save as mycolormap.txt
    
    datum export --format voc_segmentation -- --label-map mycolormap.txt
    # add "--apply-colormap=0" to save grayscale (indexed) masks
    # check "--help" option for more info
    # use "datum --loglevel debug" for extra conversion info
    
  • Create a custom COCO-like dataset:

    import numpy as np
    import datumaro as dm
    
    dataset = dm.Dataset([
      dm.DatasetItem(id='image1', subset='train',
        image=np.ones((5, 5, 3)),
        annotations=[
          dm.Bbox(1, 2, 3, 4, label=0),
        ]
      ),
      # ...
    ], categories=['cat', 'dog'])
    dataset.export('test_dataset/', 'coco')
    

2 - Datumaro Design

Concept

Datumaro is:

  • a tool to build composite datasets and iterate over them
  • a tool to create and maintain datasets
    • Version control of annotations and images
    • Publication (with removal of sensitive information)
    • Editing
    • Joining and splitting
    • Exporting, format changing
    • Image preprocessing
  • a dataset storage
  • a tool to debug datasets
    • A network can be used to generate informative data subsets (e.g. with false-positives) to be analyzed further

Requirements

  • User interfaces
    • a library
    • a console tool with visualization means
  • Targets: single datasets, composite datasets, single images / videos
  • Built-in support for well-known annotation formats and datasets: CVAT, COCO, PASCAL VOC, Cityscapes, ImageNet
  • Extensibility with user-provided components
  • Lightweightness - it should be easy to start working with Datumaro
    • Minimal dependency on environment and configuration
    • It should be easier to use Datumaro than writing own code for computation of statistics or dataset manipulations

Functionality and ideas

  • Blur sensitive areas on dataset images
  • Dataset annotation filters, relabelling etc.
  • Dataset augmentation
  • Calculation of statistics:
    • Mean & std, custom stats
  • “Edit” command to modify annotations
  • Versioning (for images, annotations, subsets, sources etc., comparison)
  • Documentation generation
  • Provision of iterators for user code
  • Dataset downloading
  • Dataset generation
  • Dataset building (export in a specific format, indexation, statistics, documentation)
  • Dataset exporting to other formats
  • Dataset debugging (run inference, generate dataset slices, compute statistics)
  • “Explainable AI” - highlight network attention areas (paper)
    • Black-box approach
      • Classification, Detection, Segmentation, Captioning
      • White-box approach

Research topics

  • exploration of network prediction uncertainty (aka Bayessian approach) Use case: explanation of network “quality”, “stability”, “certainty”
  • adversarial attacks on networks
  • dataset minification / reduction Use case: removal of redundant information to reach the same network quality with lesser training time
  • dataset expansion and filtration of additions Use case: add only important data
  • guidance for key frame selection for tracking (paper) Use case: more effective annotation, better predictions

RC 1 vision

CVAT integration

Datumaro needs to be integrated with CVAT, extending CVAT UI capabilities regarding task and project operations. It should be capable of downloading and processing data from CVAT.

        User
          |
          v
 +------------------+
 |       CVAT       |
 +--------v---------+       +------------------+       +--------------+
 | Datumaro module  | ----> | Datumaro project | <---> | Datumaro CLI | <--- User
 +------------------+       +------------------+       +--------------+

Interfaces

  • Python API for user code
    • Installation as a package
    • Installation with pip by name
  • A command-line tool for dataset manipulations

Features

  • Dataset format support (reading, writing)

    • Own format
    • CVAT
    • COCO
    • PASCAL VOC
    • YOLO
    • TF Detection API
    • Cityscapes
    • ImageNet
  • Dataset visualization (show)

    • Ability to visualize a dataset
      • with TensorBoard
  • Calculation of statistics for datasets

    • Pixel mean, std
    • Object counts (detection scenario)
    • Image-Class distribution (classification scenario)
    • Pixel-Class distribution (segmentation scenario)
    • Image similarity clusters
    • Custom statistics
  • Dataset building

    • Composite dataset building
    • Class remapping
    • Subset splitting
    • Dataset filtering (extract)
    • Dataset merging (merge)
    • Dataset item editing (edit)
  • Dataset comparison (diff)

    • Annotation-annotation comparison
    • Annotation-inference comparison
    • Annotation quality estimation (for CVAT)
      • Provide a simple method to check annotation quality with a model and generate summary
  • Dataset and model debugging

    • Inference explanation (explain)
    • Black-box approach (RISE paper)
    • Ability to run a model on a dataset and read the results
  • CVAT-integration features

    • Task export
      • Datumaro project export
      • Dataset export
      • Original raw data (images, a video file) can be downloaded (exported) together with annotations or just have links on CVAT server (in future, support S3, etc)
        • Be able to use local files instead of remote links
          • Specify cache directory
    • Use case “annotate for model training”
      • create a task
      • annotate
      • export the task
      • convert to a training format
      • train a DL model
    • Use case “annotate - reannotate problematic images - merge”
    • Use case “annotate and estimate quality”
      • create a task
      • annotate
      • estimate quality of annotations

Optional features

  • Dataset publishing

    • Versioning (for annotations, subsets, sources, etc.)
    • Blur sensitive areas on images
    • Tracking of legal information
    • Documentation generation
  • Dataset building

    • Dataset minification / Extraction of the most representative subset
      • Use case: generate low-precision calibration dataset
  • Dataset and model debugging

    • Training visualization
    • Inference explanation (explain)
      • White-box approach

Properties

  • Lightweightness
  • Modularity
  • Extensibility

3.1 - Installation

Dependencies

  • Python (3.7+)
  • Optional: OpenVINO, TensorFlow, PyTorch, MxNet, Caffe, Accuracy Checker, Git

Installation steps

Optionally, set up a virtual environment:

python -m pip install virtualenv
python -m virtualenv venv
. venv/bin/activate

Install:

# From PyPI:
pip install datumaro[default]
# From the GitHub repository:
pip install 'git+https://github.com/cvat-ai/datumaro[default]'

Read more about choosing between datumaro and datumaro[default] here.

Plugins

Datumaro has many plugins, which are responsible for dataset formats, model launchers and other optional components. If a plugin has dependencies, they can require additional installation. You can find the list of all the plugin dependencies in the plugins section.

Customizing installation

  • Datumaro has the following installation options:

    • pip install datumaro - for core library functionality
    • pip install datumaro[default] - for normal CLI experience

    In restricted installation environments, where some dependencies are not available, or if you need only the core library functionality, you can install Datumaro without extra plugins.

    The CLI variant (datumaro[default]) requires Git to be installed and available to work with Datumaro projects and dataset versioning features. You can find installation instructions for your platform here.

    In some cases, installing just the core library may be not enough, because there can be limited options of installing graphical libraries in the system (various Docker environments, servers etc). You can select between using opencv-python and opencv-python-headless by setting the DATUMARO_HEADLESS environment variable to 0 or 1 before installing the package. It requires installation from sources (using --no-binary):

    DATUMARO_HEADLESS=1 pip install datumaro --no-binary=datumaro
    

    This option can’t be covered by extras due to Python packaging system limitations.

  • When installing directly from the repository, you can change the installation branch with ...@<branch_name>. Also use --force-reinstall parameter in this case. It can be useful for testing of unreleased versions from GitHub pull requests.

3.2 - How to use Datumaro

As a standalone tool or a Python module:

datum --help

python -m datumaro --help
python datumaro/ --help
python datum.py --help

As a Python library:

import datumaro as dm
...
dataset = dm.Dataset.import_from(path, format)
...

Glossary

  • Basic concepts:

    • Dataset - A collection of dataset items, which consist of media and associated annotations.
    • Dataset item - A basic single element of the dataset. Also known as “sample”, “entry”. In different datasets it can be an image, a video frame, a whole video, a 3d point cloud etc. Typically, has corresponding annotations.
    • (Datumaro) Project - A combination of multiple datasets, plugins, models and metadata.
  • Project versioning concepts:

    • Data source - A link to a dataset or a copy of a dataset inside a project. Basically, a URL + dataset format name.
    • Project revision - A commit or a reference from Git (branch, tag, HEAD~3 etc.). A revision is referenced by data hash. The HEAD revision is the currently selected revision of the project.
    • Revision tree - A project build tree and plugins at a specified revision.
    • Working tree - The revision tree in the working directory of a project.
    • data source revision - a state of a data source at a specific stage. A revision is referenced by the data hash.
    • Object - The data of a revision tree or a data source revision. An object is referenced by the data hash.
  • Dataset path concepts:

    • Dataset revpath - A path to a dataset in a special format. They are supposed to specify paths to files, directories or data source revisions in a uniform way in the CLI.

      • dataset path - a path to a dataset in the following format: <dataset path>:<format>

        • format is optional. If not specified, will try to detect automatically
      • revision path - a path to a data source revision in a project. The syntax is: <project path>@<revision>:<target name>, any part can be omitted.

        • Default project is the current project (-p/--project CLI arg.) Local revpaths imply that the current project is used and this part should be omitted.
        • Default revision is the working tree of the project
        • Default build target is project

        If a path refers to project (i.e. target name is not set, or this target is exactly specified), the target dataset is the result of joining all the project data sources. Otherwise, if the path refers to a data source revision, the corresponding stage from the revision build tree will be used.

  • Dataset building concepts:

    • Stage - A revision of a dataset - the original dataset or its modification after transformation, filtration or something else. A build tree node. A stage is referred by a name.
    • Build tree - A directed graph (tree) with root nodes at data sources and a single top node called project, which represents a joined dataset. Each data source has a starting root node, which corresponds to the original dataset. The internal graph nodes are stages.
    • Build target - A data source or a stage name. Data source names correspond to the last stages of data sources.
    • Pipeline - A subgraph of a stage, which includes all the ancestors.
  • Other:

    • Transform - A transformation operation over dataset elements. Examples are image renaming, image flipping, image and subset renaming, label remapping etc. Corresponds to the transform command.

Command-line workflow

In Datumaro, most command-line commands operate on projects, but there are also few commands operating on datasets directly. There are 2 basic ways to use Datumaro from the command-line:

  • Use the convert, diff, merge commands directly on existing datasets

  • Create a Datumaro project and operate on it:

Basically, a project is a combination of datasets, models and environment.

A project can contain an arbitrary number of datasets (data sources). A project acts as a manager for them and allows to manipulate them separately or as a whole, in which case it combines dataset items from all the sources into one composite dataset. You can manage separate datasets in a project by commands in the datum source command line context.

Note that modifying operations (transform, filter, patch) are applied in-place to the datasets by default.

If you want to interact with models, you need to add them to the project first using the model add command.

A typical way to obtain Datumaro projects is to export tasks in CVAT UI.

Project data model

project model

Datumaro tries to combine a “Git for datasets” and a build system like make or CMake for datasets in a single solution. Currently, Project represents a Version Control System for datasets, which is based on Git and DVC projects. Each project Revision describes a build tree of a dataset with all the related metadata. A build tree consists of a number of data sources and transformation stages. Each data source has its own set of build steps (stages). Datumaro supposes copying of datasets and working in-place by default. Modifying operations are recorded in the project, so any of the dataset revisions can be reproduced when needed. Multiple dataset versions can be stored in different branches with the common data shared.

Let’s consider an example of a build tree: build tree There are 2 data sources in the example project. The resulting dataset is obtained by simple merging (joining) the results of the input datasets. “Source 1” and “Source 2” are the names of data sources in the project. Each source has several stages with their own names. The first stage (called “root”) represents the original contents of a data source - the data at the user-provided URL. The following stages represent operations, which needs to be done with the data source to prepare the resulting dataset.

Roughly, such build tree can be created by the following commands (arguments are omitted for simplicity):

datum create

# describe the first source
datum import <...> -n source1
datum filter <...> source1
datum transform <...> source1
datum transform <...> source1

# describe the second source
datum import <...> -n source2
datum model add <...>
datum transform <...> source2
datum transform <...> source2

Now, the resulting dataset can be built with:

datum export <...>

Project layout

project/
├── .dvc/
├── .dvcignore
├── .git/
├── .gitignore
├── .datumaro/
│   ├── cache/ # object cache
│   │   └── <2 leading symbols of obj hash>/
│   │       └── <remaining symbols of obj hash>/
│   │           └── <object data>
│   │
│   ├── models/ # project-specific models
│   │
│   ├── plugins/ # project-specific plugins
│   │   ├── plugin1/ # composite plugin, a directory
│   │   |   ├── __init__.py
│   │   |   └── file2.py
│   │   ├── plugin2.py # simple plugin, a file
│   │   └── ...
│   │
│   ├── tmp/ # temp files
│   └── tree/ # working tree metadata
│       ├── config.yml
│       └── sources/
│           ├── <source name 1>.dvc
│           ├── <source name 2>.dvc
│           └── ...
│
├── <source name 1>/ # working directory for the source 1
│   └── <source data>
└── <source name 2>/ # working directory for the source 2
    └── <source data>

Datasets and Data Sources

A project can contain an arbitrary number of Data Sources. Each Data Source describes a dataset in a specific format. A project acts as a manager for the data sources and allows to manipulate them separately or as a whole, in which case it combines dataset items from all the sources into one composite dataset. You can manage separate sources in a project by commands in the datum source command line context.

Datasets come in a wide variety of formats. Each dataset format defines its own data structure and rules on how to interpret the data. For example, the following data structure is used in COCO format:

/dataset/
- /images/<id>.jpg
- /annotations/

Datumaro supports complete datasets, having both image data and annotations, or incomplete ones, having annotations only. Incomplete datasets can be used to prepare images and annotations independently of each other, or to analyze or modify just the lightweight annotations without the need to download the whole dataset.

Check supported formats for more info about format specifications, supported import and export options and other details. The list of formats can be extended by custom plugins, check extending tips for information on this topic.

Use cases

Let’s consider few examples describing what Datumaro does for you behind the scene.

The first example explains how working trees, working directories and the cache interact. Suppose, there is a dataset which we want to modify and export in some other format. To do it with Datumaro, we need to create a project and register the dataset as a data source:

datum create
datum import <...> -n source1

The dataset will be copied to the working directory inside the project. It will be added to the project working tree.

After the dataset is added, we want to transform it and filter out some irrelevant samples, so we run the following commands:

datum transform <...> source1
datum filter <...> source1

The commands modify the data source inside the working directory, inplace. The operations done are recorded in the working tree.

Now, we want to make a new version of the dataset and make a snapshot in the project cache. So we commit the working tree:

datum commit <...>

cache interaction diagram 1

At this time, the data source is copied into the project cache and a new project revision is created. The dataset operation history is saved, so the dataset can be reproduced even if it is removed from the cache and the working directory. Note, however, that the original dataset hash was not computed, so Datumaro won’t be able to compare dataset hash on re-downloading. If it is desired, consider making a commit with an unmodified data source.

After this, we do some other modifications to the dataset and make a new commit. Note that the dataset is not cached, until a commit is done.

When the dataset is ready and all the required operations are done, we can export it to the required format. We can export the resulting dataset, or any previous stage.

datum export <...> source1
datum export <...> source1.stage3

Let’s extend the example. Imagine we have a project with 2 data sources. Roughly, it corresponds to the following set of commands:

datum create
datum import <...> -n source1
datum import <...> -n source2
datum transform <...> source1 # used 3 times
datum transform <...> source2 # used 5 times

Then, for some reasons, the project cache was cleaned from source1 revisions. We also don’t have anything in the project working directories - suppose, the user removed them to save disk space.

Let’s see what happens, if we call the diff command with 2 different revisions now.

cache interaction diagram 2

Datumaro needs to reproduce 2 dataset revisions requested so that they could be read and compared. Let’s see how the first dataset is reproduced step-by-step:

  1. source1.stage2 will be looked for in the project cache. It won’t be found, since the cache was cleaned.
  2. Then, Datumaro will look for previous source revisions in the cache and won’t find any.
  3. The project can be marked read-only, if we are not working with the “current” project (which is specified by the -p/--project command parameter). In the example, the command is datum diff rev1:... rev2:..., which means there is a project in the current directory, so the project we are working with is not read-only. If a command target was specified as datum diff <project>@<rev>:<source>, the project would be loaded as read-only. If a project is read-only, we can’t do anything more to reproduce the dataset and can only exit with an error (3a). The reason for such behavior is that the dataset downloading can be quite expensive (in terms of time, disk space etc.). It is supposed, that such side-effects should be controlled manually.
  4. If the project is not read-only (3b), Datumaro will try to download the original dataset and reproduce the resulting dataset. The data hash will be computed and hashes will be compared (if the data source had hash computed on addition). On success, the data will be put into the cache.
  5. The downloaded dataset will be read and the remaining operations from the source history will be re-applied.
  6. The resulting dataset might be cached in some cases.
  7. The resulting dataset is returned.

The source2 will be looked for the same way. In our case, it will be found in the cache and returned. Once both datasets are restored and read, they are compared.

Consider other situation. Let’s try to export the source1. Suppose we have a clear project cache and the source1 has a copy in the working directory.

cache interaction diagram 3

Again, Datumaro needs to reproduce a dataset revision (stage) requested.

  1. It looks for the dataset in the working directory and finds some data. If there is no source working directory, Datumaro will try to reproduce the source using the approach described above (1b).
  2. The data hash is computed and compared with the one saved in the history. If the hashes match, the dataset is read and returned (4). Note: we can’t use the cached hash stored in the working tree info - it can be outdated, so we need to compute it again.
  3. Otherwise, Datumaro tries to detect the stage by the data hash. If the current stage is not cached, the tree is the working tree and the working directory is not empty, the working copy is hashed and matched against the source stage list. If there is a matching stage, it will be read and the missing stages will be added. The result might be cached in some cases. If there is no matching stage in the source history, the situation can be contradictory. Currently, an error is raised (3b).
  4. The resulting dataset is returned.

After the requested dataset is obtained, it is exported in the requested format.

To sum up, Datumaro tries to restore a dataset from the project cache or reproduce it from sources. It can be done as long as the source operations are recorded and any step data is available. Note that cache objects share common files, so if there are only annotation differences between datasets, or data sources contain the same images, there will only be a single copy of the related media files. This helps to keep storage use reasonable and avoid unnecessary data copies.

Examples

Example: create a project, add dataset, modify, restore an old version

datum create
datum import <path/to/dataset> -f coco -n source1
datum commit -m "Added a dataset"
datum transform -t shapes_to_boxes
datum filter -e '/item/annotation[label="cat" or label="dog"]' -m i+a
datum commit -m "Transformed"
datum checkout HEAD~1 -- source1 # restore a previous revision
datum status # prints "modified source1"
datum checkout source1 # restore the last revision
datum export -f voc -- --save-images

3.3 - Supported Formats

List of supported formats:

Supported annotation types

  • Labels
  • Bounding boxes
  • Polygons
  • Polylines
  • (Segmentation) Masks
  • (Key-)Points
  • Captions
  • 3D cuboids
  • Super Resolution Annotation
  • Depth Annotation

Datumaro does not separate datasets by tasks like classification, detection etc. Instead, datasets can have any annotations. When a dataset is exported in a specific format, only relevant annotations are exported.

Dataset meta info file

It is possible to use classes that are not original to the format. To do this, use dataset_meta.json.

{
"label_map": {"0": "background", "1": "car", "2": "person"},
"segmentation_colors": [[0, 0, 0], [255, 0, 0], [0, 0, 255]],
"background_label": "0"
}
  • label_map is a dictionary where the class ID is the key and the class name is the value.
  • segmentation_colors is a list of channel-wise values for each class. This is only necessary for the segmentation task.
  • background_label is a background label ID in the dataset.

3.4 - Media formats

Datumaro supports the following media types:

  • 2D RGB(A) images
  • KITTI Point Clouds

To create an unlabelled dataset from an arbitrary directory with images use image_dir and image_zip formats:

datum create -o <project/dir>
datum import -p <project/dir> -f image_dir <directory/path/>

or, if you work with Datumaro API:

  • for using with a project:

    from datumaro.project import Project
    
    project = Project.init()
    project.import_source('source1', format='image_dir', url='directory/path/')
    dataset = project.working_tree.make_dataset()
    
  • for using as a dataset:

    from datumaro import Dataset
    
    dataset = Dataset.import_from('directory/path/', 'image_dir')
    

This will search for images in the directory recursively and add them as dataset entries with names like <subdir1>/<subsubdir1>/<image_name1>. The list of formats matches the list of supported image formats in OpenCV:

.jpg, .jpeg, .jpe, .jp2, .png, .bmp, .dib, .tif, .tiff, .tga, .webp, .pfm,
.sr, .ras, .exr, .hdr, .pic, .pbm, .pgm, .ppm, .pxm, .pnm

Once there is a Dataset instance, its items can be split into subsets, renamed, filtered, joined with annotations, exported in various formats etc.

To import frames from a video, you can split the video into frames with the split_video command and then use the image_dir format described above. In more complex cases, consider using FFmpeg and other tools for video processing.

Alternatively, you can use the video_frames format directly:

Note, however, that it can produce different results if the system environment changes. If you want to obtain reproducible results, consider splitting the video into frames by any method.

datum create -o <project/dir>
datum import -p <project/dir> -f video_frames <video/path.avi>
from datumaro import Dataset

dataset = Dataset.import_from('video.mp4', 'video_frames')

Datumaro supports the following video formats:

.3gp, .3g2, .asf, .wmv, .avi, .divx, .evo, .f4v, .flv, .mkv, .mk3d,
.mp4, .mpg, .mpeg, .m2p, .ps, .ts, .m2ts, .mxf, .ogg, .ogv, .ogx,
.mov, .qt, .rmvb, .vob, .webm

3.5 - Command reference

%%{init { 'theme':'neutral' }}%%
flowchart LR
  d(("#0009; datum #0009;")):::mainclass
  m(model):::nofillclass
  p(project):::nofillclass
  s(source):::nofillclass

  d===m
    m===m_add[add]:::hideclass
    m===m_info[info]:::hideclass
    m===m_remove[remove]:::hideclass
    m===m_run[run]:::hideclass
  d===p
    p===p_info[info]:::hideclass
    p===p_migrate[migrate]:::hideclass
  d===s
    s===s_add[add]:::hideclass
    s===s_info[info]:::hideclass
    s===s_remove[remove]:::hideclass
  d====_add[add]:::filloneclass
  d====_create[create]:::filloneclass
  d====_describe_downloads[describe-downloads]:::filloneclass
  d====_detect_format[detect-format]:::filloneclass
  d====_download[download]:::filloneclass
  d====_export[export]:::filloneclass
  d====_import[import]:::filloneclass
  d====_info[info]:::filloneclass
  d====_remove[remove]:::filloneclass
  d====_generate[generate]:::filloneclass
  d====_filter[filter]:::filltwoclass
  d====_transform[transform]:::filltwoclass
  d====_diff[diff]:::fillthreeclass
  d====_explain[explain]:::fillthreeclass
  d====_merge[merge]:::fillthreeclass
  d====_patch[patch]:::fillthreeclass
  d====_stats[stats]:::fillthreeclass
  d====_validate[validate]:::fillthreeclass
  d====_checkout[checkout]:::fillfourclass
  d====_commit[commit]:::fillfourclass
  d====_log[log]:::fillfourclass
  d====_status[status]:::fillfourclass

  classDef nofillclass fill-opacity:0;
  classDef hideclass fill-opacity:0,stroke-opacity:0;
  classDef filloneclass fill:#CCCCFF,stroke-opacity:0;
  classDef filltwoclass fill:#FFFF99,stroke-opacity:0;
  classDef fillthreeclass fill:#CCFFFF,stroke-opacity:0;
  classDef fillfourclass fill:#CCFFCC,stroke-opacity:0;

The command line is split into the separate commands and command contexts. Contexts group multiple commands related to a specific topic, e.g. project operations, data source operations etc. Almost all the commands operate on projects, so the project context and commands without a context are mostly the same. By default, commands look for a project in the current directory. If the project you’re working on is located somewhere else, you can pass the -p/--project <path> argument to the command.

Note: command behavior is subject to change, so this text might be outdated, always check the --help output of the specific command

Note: command parameters must be passed prior to the positional arguments.

Datumaro functionality is available with the datum command.

Usage:

datum [-h] [--version] [--loglevel LOGLEVEL] [command] [command args]

Parameters:

  • --loglevel (string) - Logging level, one of debug, info, warning, error, critical (default: info)
  • --version - Print the version number and exit.
  • -h, --help - Print the help message and exit.

3.5.1 - Checkout

This command allows to restore a specific project revision in the project tree or to restore separate revisions of sources. A revision can be a commit hash, branch, tag, or any relative reference in the Git format.

This command has multiple forms:

1) datum checkout <revision>
2) datum checkout [--] <source1> ...
3) datum checkout <revision> [--] <source1> <source2> ...

1 - Restores a revision and all the corresponding sources in the working directory. If there are conflicts between modified files in the working directory and the target revision, an error is raised, unless --force is used.

2, 3 - Restores only selected sources from the specified revision. The current revision is used, when not set.

“–” can be used to separate source names and revisions:

  • datum checkout name - will look for revision “name”
  • datum checkout -- name - will look for source “name” in the current revision

Usage:

datum checkout [-h] [-f] [-p PROJECT_DIR] [rev] [--] [sources [sources ...]]

Parameters:

  • --force - Allows to overwrite unsaved changes in case of conflicts
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Examples:

  • Restore the previous revision: datum checkout HEAD~1

  • Restore the saved version of a source in the working tree datum checkout -- source-1

  • Restore a previous version of a source datum checkout 33fbfbe my-source

3.5.2 - Commit

This command allows to fix the current state of a project and create a new revision from the working tree.

By default, this command checks sources in the working tree for changes. If there are unknown changes found, an error will be raised, unless --allow-foreign is used. If such changes are committed, the source will only be available for reproduction from the project cache, because Datumaro will not know how to repeat them.

The command will add the sources into the project cache. If you only need to record revision metadata, you can use the --no-cache parameter. This can be useful if you want to save disk space and/or have a backup copy of datasets used in the project.

If there are no changes found, the command will stop. To allow empty commits, use --allow-empty.

Usage:

datum commit [-h] -m MESSAGE [--allow-empty] [--allow-foreign]
  [--no-cache] [-p PROJECT_DIR]

Parameters:

  • --allow-empty - Allow commits with no changes
  • --allow-foreign - Allow commits with changes made not by Datumaro
  • --no-cache - Don’t put committed datasets into cache, save only metadata
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Example:

datum create
datum import -f coco <path/to/coco/>
datum commit -m "Added COCO"

3.5.3 - Convert datasets

This command allows to convert a dataset from one format to another. The command is a usability alias for create, add and export and just provides a simpler way to obtain the same results in simple cases. A list of supported formats can be found in the --help output of this command.

Usage:

datum convert [-h] [-i SOURCE] [-if INPUT_FORMAT] -f OUTPUT_FORMAT
  [-o DST_DIR] [--overwrite] [-e FILTER] [--filter-mode FILTER_MODE]
  [-- EXTRA_EXPORT_ARGS]

Parameters:

  • -i, --input-path (string) - Input dataset path. The current directory is used by default.
  • -if, --input-format (string) - Input dataset format. Will try to detect, if not specified.
  • -f, --output-format (string) - Output format
  • -o, --output-dir (string) - Output directory. By default, a subdirectory in the current directory is used.
  • --overwrite - Allows overwriting existing files in the output directory, when it is not empty.
  • -e, --filter (string) - XML XPath filter expression for dataset items
  • --filter-mode (string) - The filtering mode. Default is the i mode.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • -- <extra export args> - Additional arguments for the format writer (use -- -h for help). Must be specified after the main command arguments.

Example: convert a VOC-like dataset to a COCO-like one:

datum convert --input-format voc --input-path <path/to/voc/> \
              --output-format coco \
              -- --save-images

3.5.4 - Create project

The command creates an empty project. A project is required for the most of Datumaro functionality.

By default, the project is created in the current directory. To specify another output directory, pass the -o/--output-dir parameter. If output already directory contains a Datumaro project, an error is raised, unless --overwrite is used.

Usage:

datum create [-h] [-o DST_DIR] [--overwrite]

Parameters:

  • -o, --output-dir (string) - Allows to specify an output directory. The current directory is used by default.
  • --overwrite - Allows to overwrite existing project files in the output directory. Any other files are not touched.
  • -h, --help - Print the help message and exit.

Examples:

Example: create an empty project in the my_dataset directory

datum create -o my_dataset/

Example: create a new empty project in the current directory, remove the existing one

datum create
...
datum create --overwrite

3.5.5 - Describe downloadable datasets

This command reports reports various information about datasets that can be downloaded with the download command. The information is reported either as human-readable text (the default) or as a JSON object. The format can be selected with the --report-format option.

When the JSON output format is selected, the output document has the following schema:

{
    "<dataset name>": {
        "default_output_format": "<Datumaro format name>",
        "description": "<human-readable description>",
        "download_size": <total size of the downloaded files in bytes>,
        "home_url": "<URL of a web page describing the dataset>",
        "human_name": "<human-readable dataset name>",
        "num_classes": <number of classes in the dataset>,
        "subsets": {
            "<subset name>": {
                "num_items": <number of items in the subset>
            },
            ...
        },
        "version": "<version number>"
    },
    ...
}

home_url may be null if there is no suitable web page for the dataset.

num_classes may be null if the dataset does not involve classification.

version currently contains the version number supplied by TFDS. In future versions of Datumaro, datasets might come from other sources; the way version numbers will be set for those is to be determined.

New object members may be added in future versions of Datumaro.

Usage:

datum describe-downloads [-h] [--report-format {text,json}]
                         [--report-file REPORT_FILE]

Parameters:

  • -h, --help - Print the help message and exit.
  • --report-format (text or json) - Format in which to report the information. By default, text is used.
  • --report-file (string) - File to which to write the report. By default, the report is written to the standard output stream.

3.5.6 - Detect dataset format

This command attempts to detect the format of a dataset in a directory. Currently, only local directories are supported.

The detection result may be one of:

  • a single format being detected;
  • no formats being detected (if the dataset doesn’t match any known format);
  • multiple formats being detected (if the dataset is ambiguous).

The command outputs this result in a human-readable form and optionally as a machine-readable JSON report (see --json-report).

The format of the machine-readable report is as follows:

{
    "detected_formats": [
        "detected-format-name-1", "detected-format-name-2", ...
    ],
    "rejected_formats": {
        "rejected-format-name-1": {
            "reason": <reason-code>,
            "message": "line 1\nline 2\n...\nline N"
        },
        "rejected-format-name-2": ...,
        ...
    }
}

The <reason-code> can be one of:

  • "detection_unsupported": the corresponding format does not support detection.

  • "insufficient_confidence": the dataset matched the corresponding format, but it matched at least one other format better.

  • "unmet_requirements": the dataset didn’t meet at least one requirement of the corresponding format.

Other reason codes may be defined in the future.

Usage:

datum detect-format [-h] [-p PROJECT_DIR] [--show-rejections]
                    [--json-report JSON_REPORT]
                    url

Parameters:

  • <url> - Path to the dataset to analyse.
  • -h, --help - Print the help message and exit.
  • -p, --project (string) - Directory of the project to use as the context (default: current directory). The project might contain local plugins with custom formats, which will be used for detection.
  • --show-rejections - Describe why each supported format that wasn’t detected was rejected. This only affects the human-readable output; the machine-readable report always includes rejection information.
  • --json-report (string) - Path to which to save a JSON report describing detected and rejected formats. By default, no report is saved.

Example: detect the format of a dataset in a given directory, showing rejection information:

datum detect-format --show-rejections path/to/dataset

3.5.7 - Compare datasets

The command compares two datasets and saves the results in the specified directory. The current project is considered to be “ground truth”.

Datasets can be compared using different methods:

  • equality - Annotations are compared to be equal
  • distance - A distance metric is used

This command has multiple forms:

1) datum diff <revpath>
2) datum diff <revpath> <revpath>

1 - Compares the current project’s main target (project) in the working tree with the specified dataset.

2 - Compares two specified datasets.

<revpath> - a dataset path or a revision path.

Usage:

datum diff [-h] [-o DST_DIR] [-m METHOD] [--overwrite] [-p PROJECT_DIR]
  [--iou-thresh IOU_THRESH] [-f FORMAT]
  [-iia IGNORE_ITEM_ATTR] [-ia IGNORE_ATTR] [-if IGNORE_FIELD]
  [--match-images] [--all]
  first_target [second_target]

Parameters:

  • <target> (string) - Target dataset revpaths

  • -m, --method (string) - Comparison method.

  • -o, --output-dir (string) - Output directory. By default, a new directory is created in the current directory.

  • --overwrite - Allows to overwrite existing files in the output directory, when it is specified and is not empty.

  • -p, --project (string) - Directory of the project to operate on (default: current directory).

  • -h, --help - Print the help message and exit.

  • Distance comparison options:

    • --iou-thresh (number) - The IoU threshold for spatial annotations (default is 0.5).
    • -f, --format (string) - Output format, one of simple (text files and images) and tensorboard (a TB log directory)
  • Equality comparison options:

    • -iia, --ignore-item-attr (string) - Ignore an item attribute (repeatable)
    • -ia, --ignore-attr (string) - Ignore an annotation attribute (repeatable)
    • -if, --ignore-field (string) - Ignore an annotation field (repeatable) Default is id and group
    • --match-images - Match dataset items by image pixels instead of ids
    • --all - Include matches in the output. By default, only differences are printed.

Examples:

  • Compare two projects by distance, match boxes if IoU > 0.7, save results to TensorBoard: datum diff other/project -o diff/ -f tensorboard --iou-thresh 0.7

  • Compare two projects for equality, exclude annotation groups and the is_crowd attribute from comparison: datum diff other/project/ -if group -ia is_crowd

  • Compare two datasets, specify formats: datum diff path/to/dataset1:voc path/to/dataset2:coco

  • Compare the current working tree and a dataset: datum diff path/to/dataset2:coco

  • Compare a source from a previous revision and a dataset: datum diff HEAD~2:source-2 path/to/dataset2:yolo

  • Compare a dataset with model inference

datum create
datum import <...>
datum model add mymodel <...>
datum transform <...> -o inference
datum diff inference -o diff

3.5.8 - Download datasets

This command downloads a publicly available dataset and saves it to a local directory. In terms of syntax, this command is similar to convert, but instead of taking a local directory as the source, it takes a dataset ID. A list of supported datasets and output formats can be found in the --help output of this command.

Currently, the only source of datasets is the TensorFlow Datasets library. Therefore, to use this command you must install TensorFlow & TFDS, which you can do as follows:

pip install datumaro[tf,tfds]

To use a proxy for downloading, configure it with the conventional curl environment variables.

Usage:

datum download [-h] -i DATASET_ID [-f OUTPUT_FORMAT] [-o DST_DIR]
               [--overwrite] [-s SUBSET] [-- EXTRA_EXPORT_ARGS]

Parameters:

  • -h, --help - Print the help message and exit.
  • -i, --dataset-id (string) - ID of the dataset to download.
  • -f, --output-format (string) - Output format. By default, the format of the original dataset is used.
  • -o, --output-dir (string) - Output directory. By default, a subdirectory in the current directory is used.
  • --overwrite - Allows overwriting existing files in the output directory, when it is not empty.
  • --subset (string) - Which subset of the dataset to save. By default, all subsets are saved. Note that due to limitations of TFDS, all subsets are downloaded even if this option is specified.
  • -- <extra export args> - Additional arguments for the format writer (use -- -h for help). Must be specified after the main command arguments.

Example: download the MNIST dataset, saving it in the ImageNet text format:

datum download -i tfds:mnist -f imagenet_txt -- --save-images

3.5.9 - Run model inference explanation (explain)

Runs an explainable AI algorithm for a model.

This tool is supposed to help an AI developer to debug a model and a dataset. Basically, it executes model inference and tries to find relation between inputs and outputs of the trained model, i.e. determine decision boundaries and belief intervals for the classifier.

Currently, the only available algorithm is RISE (article), which runs model a single time and then re-runs a model multiple times on each image to produce a heatmap of activations for each output of the first inference. Each time a part of the input image is masked. As a result, we obtain a number heatmaps, which show, how specific image pixels affected the inference result. This algorithm doesn’t require any special information about the model, but it requires the model to return all the outputs and confidences. The original algorithm supports only classification scenario, but Datumaro extends it for detection models.

The following use cases available:

  • RISE for classification
  • RISE for object detection

Usage:

datum explain [-h] -m MODEL [-o SAVE_DIR] [-p PROJECT_DIR]
  [target] {rise} [RISE_ARGS]

Parameters:

  • <target> (string) - Target dataset revpath.By default, uses the whole current project. An image path can be specified instead. <image path> - a path to the file. <revpath> - a dataset path or a revision path.

  • <method> (string) - The algorithm to use. Currently, only rise is supported.

  • -m, --model (string) - The model to use for inference

  • -o, --output-dir (string) - Directory to save results to (default: display only)

  • -p, --project (string) - Directory of the project to operate on (default: current directory).

  • -h, --help - Print the help message and exit.

  • RISE options:

    • -s, --max-samples (number) - Number of algorithm model runs per image (default: mask size ^ 2).
    • --mw, --mask-width (number) - Mask width in pixels (default: 7)
    • --mh, --mask-height (number) - Mask height in pixels (default: 7)
    • --prob (number) - Mask pixel inclusion probability, controls mask density (default: 0.5)
    • --iou, --iou-thresh (number) - IoU match threshold for detections (default: 0.9)
    • --nms, --nms-iou-thresh (number) - IoU match threshold for detections for non-maxima suppression (default: no NMS)
    • --conf, --det-conf-thresh (number) - Confidence threshold for detections (default: include all)
    • -b, --batch-size (number) - Batch size for inference (default: 1)
    • --display - Visualize results during computations

Examples:

  • Run RISE on an image, display results: datum explain path/to/image.jpg -m mymodel rise --max-samples 50

  • Run RISE on a source revision: datum explain HEAD~1:source-1 -m model rise

  • Run inference explanation on a single image with online visualization

datum create <...>
datum model add mymodel <...>
datum explain -t image.png -m mymodel \
    rise --max-samples 1000 --display

Note: this algorithm requires the model to return all (or a reasonable amount) the outputs and confidences unfiltered, i.e. all the Label annotations for classification models and all the Bboxes for detection models. You can find examples of the expected model outputs in tests/test_RISE.py

For OpenVINO models the output processing script would look like this:

Classification scenario:

import datumaro as dm
from datumaro.util.annotation_util import softmax

def process_outputs(inputs, outputs):
    # inputs = model input, array or images, shape = (N, C, H, W)
    # outputs = model output, logits, shape = (N, n_classes)
    # results = conversion result, [ [ Annotation, ... ], ... ]
    results = []
    for output in outputs:
        confs = softmax(output[0])
        for label, conf in enumerate(confs):
            results.append(dm.Label(int(label)), attributes={'score': float(conf)})

    return results

Object Detection scenario:

import datumaro as dm

# return a significant number of output boxes to make multiple runs
# statistically correct and meaningful
max_det = 1000

def process_outputs(inputs, outputs):
    # inputs = model input, array or images, shape = (N, C, H, W)
    # outputs = model output, shape = (N, 1, K, 7)
    # results = conversion result, [ [ Annotation, ... ], ... ]
    results = []
    for input, output in zip(inputs, outputs):
        input_height, input_width = input.shape[:2]
        detections = output[0]
        image_results = []
        for det in detections:
            label = int(det[1])
            conf = float(det[2])
            x = max(int(det[3] * input_width), 0)
            y = max(int(det[4] * input_height), 0)
            w = min(int(det[5] * input_width - x), input_width)
            h = min(int(det[6] * input_height - y), input_height)
            image_results.append(dm.Bbox(x, y, w, h,
                label=label, attributes={'score': conf} ))

            results.append(image_results[:max_det])

    return results

3.5.10 - Export Datasets

This command exports a project or a source as a dataset in some format.

Check supported formats for more info about format specifications, supported options and other details. The list of formats can be extended by custom plugins, check extending tips for information on this topic.

Available formats are listed in the command help output.

Dataset format writers support additional export options. To pass such options, use the -- separator after the main command arguments. The usage information can be printed with datum import -f <format> -- --help.

Common export options:

  • Most formats (where applicable) support the --save-images option, which allows to export dataset images along with annotations. The option is disabled be default.
  • If --save-images is used, the image-ext option can be passed to specify the output image file extension (.jpg, .png etc.). By default, tries to Datumaro keep the original image extension. This option allows to convert all the images from one format into another.

This command allows to use the -f/--filter parameter to select dataset elements needed for exporting. Read the filter command description for more info about this functionality.

The command can only be applied to a project build target, a stage or the combined project target, in which case all the targets will be affected.

Usage:

datum export [-h] [-e FILTER] [--filter-mode FILTER_MODE] [-o DST_DIR]
  [--overwrite] [-p PROJECT_DIR] -f FORMAT [target] [-- EXTRA_FORMAT_ARGS]

Parameters:

  • <target> (string) - A project build target to be exported. By default, all project targets are affected.
  • -f, --format (string) - Output format.
  • -e, --filter (string) - XML XPath filter expression for dataset items
  • --filter-mode (string) - The filtering mode. Default is the i mode.
  • -o, --output-dir (string) - Output directory. By default, a subdirectory in the current directory is used.
  • --overwrite - Allows overwriting existing files in the output directory, when it is not empty.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • -- <extra format args> - Additional arguments for the format writer (use -- -h for help). Must be specified after the main command arguments.

Example: save a project as a VOC-like dataset, include images, convert images to PNG from other formats.

datum export \
  -p test_project \
  -o test_project-export \
  -f voc \
  -- --save-images --image-ext='.png'

3.5.11 - Filter datasets

This command allows to extract a sub-dataset from a dataset. The new dataset includes only items satisfying some condition. The XML XPath is used as a query format.

The command can be applied to a dataset or a project build target, a stage or the combined project target, in which case all the project targets will be affected. A build tree stage will be recorded if --stage is enabled, and the resulting dataset(-s) will be saved if --apply is enabled.

By default, datasets are updated in-place. The -o/--output-dir option can be used to specify another output directory. When updating in-place, use the --overwrite parameter (in-place updates fail by default to prevent data loss), unless a project target is modified.

The current project (-p/--project) is also used as a context for plugins, so it can be useful for dataset paths having custom formats. When not specified, the current project’s working tree is used.

There are several filtering modes available (the -m/--mode parameter). Supported modes:

  • i, items
  • a, annotations
  • i+a, a+i, items+annotations, annotations+items

When filtering annotations, use the items+annotations mode to point that annotation-less dataset items should be removed, otherwise they will be kept in the resulting dataset. To select an annotation, write an XPath that returns annotation elements (see examples).

Item representations can be printed with the --dry-run parameter:

<item>
  <id>290768</id>
  <subset>minival2014</subset>
  <image>
    <width>612</width>
    <height>612</height>
    <depth>3</depth>
  </image>
  <annotation>
    <id>80154</id>
    <type>bbox</type>
    <label_id>39</label_id>
    <x>264.59</x>
    <y>150.25</y>
    <w>11.19</w>
    <h>42.31</h>
    <area>473.87</area>
  </annotation>
  <annotation>
    <id>669839</id>
    <type>bbox</type>
    <label_id>41</label_id>
    <x>163.58</x>
    <y>191.75</y>
    <w>76.98</w>
    <h>73.63</h>
    <area>5668.77</area>
  </annotation>
  ...
</item>

The command can only be applied to a project build target, a stage or the combined project target, in which case all the targets will be affected. A build tree stage will be added if --stage is enabled, and the resulting dataset(-s) will be saved if --apply is enabled.

Usage:

datum filter [-h] [-e FILTER] [-m MODE] [--dry-run] [--stage STAGE]
  [--apply APPLY] [-o DST_DIR] [--overwrite] [-p PROJECT_DIR] [target]

Parameters:

  • <target> (string) - Target dataset revpath. By default, filters all targets of the current project.
  • -e, --filter (string) - XML XPath filter expression for dataset items
  • -m, --mode (string) - The filtering mode. Default is the i mode.
  • --dry-run - Print XML representations of the filtered dataset and exit.
  • --stage (bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and the project target, but not intermediate stages). Enabled by default.
  • --apply (bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default.
  • -o, --output-dir (string) - Output directory. Can be omitted for main project targets (i.e. data sources and the project target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace.
  • --overwrite - Allows to overwrite existing files in the output directory, when it is specified and is not empty.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Example: extract a dataset with images with width < height

datum filter \
  -p test_project \
  -e '/item[image/width < image/height]'

Example: extract a dataset with images of the train subset

datum filter \
  -p test_project \
  -e '/item[subset="train"]'

Example: extract a dataset with only large annotations of the cat class and any non-persons

datum filter \
  -p test_project \
  --mode annotations \
  -e '/item/annotation[(label="cat" and area > 99.5) or label!="person"]'

Example: extract a dataset with non-occluded annotations, remove empty images. Use data only from the “s1” source of the project.

datum create
datum import --format voc -i <path/to/dataset1/> --name s1
datum import --format voc -i <path/to/dataset2/> --name s2
datum filter s1 \
  -m i+a -e '/item/annotation[occluded="False"]'

3.5.12 - Generate Datasets

Creates a synthetic dataset with elements of the specified type and shape, and saves it in the provided directory.

Currently, can only generate fractal images, useful for network compression. To create 3-channel images, you should provide the number of images, height and width. The images are colorized with a model, which will be downloaded automatically. Uses the algorithm from the article: https://arxiv.org/abs/2103.13023

Usage:

datum generate [-h] -o OUTPUT_DIR -k COUNT --shape SHAPE [SHAPE ...]
  [-t {image}] [--overwrite] [--model-dir MODEL_PATH]

Parameters:

  • -o, --output-dir (string) - Output directory
  • -k, --count (integer) - Number of images to be generated
  • --shape (integer, repeatable) - Dimensions of data to be generated (H, W)
  • -t, --type (one of: image) - Specify the type of data to generate (default: image)
  • --model-dir (path) - Path to load the colorization model from. If no model is found, the model will be downloaded (default: current dir)
  • --overwrite - Allows overwriting existing files in the output directory, when it is not empty.
  • -h, --help - Print the help message and exit.

Examples: Generate 300 3-channel fractal images with H=224, W=256 and store in the images/ dir:

datum generate -o images/ --count 300 --shape 224 256

3.5.13 - Print dataset info

This command outputs high level dataset information such as sample count, categories and subsets.

Usage:

datum info [-h] [--json] [-p PROJECT_DIR] [revpath]

Parameters:

  • <target> (string) - Target dataset revpath. By default, prints info about the joined project dataset.
  • --json - Print output data in JSON format
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Examples:

  • Print info about a project dataset: datum info -p test_project/

  • Print info about a COCO-like dataset: datum info path/to/dataset:coco

Sample output:

format: voc
media type: image
length: 5
categories:
    labels: background, aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair (and 12 more)
subsets:
    trainval:
        length: 5

JSON output format:

{
  "format": string,
  "media type": string,
  "length": integer,
  "categories": {
    "count": integer,
    "labels": [
      {
        "id": integer,
        "name": string,
        "parent": string,
        "attributes": [ string, ... ]
      },
      ...
    ]
  },
  "subsets": [
    {
      "name": string,
      "length": integer
    },
    ...
  ]
}

3.5.14 - Log

This command prints the history of the current project revision.

Prints lines in the following format: <short commit hash> <commit message>

Usage:

datum log [-h] [-n MAX_COUNT] [-p PROJECT_DIR]

Parameters:

  • -n, --max-count (number, default: 10) - The maximum number of previous revisions in the output
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Example output:

affbh33 Added COCO dataset
eeffa35 Added VOC dataset

3.5.15 - Merge Datasets

Consider the following task: there is a set of images (the original dataset) we want to annotate. Suppose we did this manually and/or automated it using models, and now we have few sets of annotations for the same images. We want to merge them and produce a single set of high-precision annotations.

Another use case: there are few datasets with different sets of images and labels, which we need to combine in a single dataset. If the labels were the same, we could just join the datasets. But in this case we need to merge labels and adjust the annotations in the resulting dataset.

In Datumaro, it can be done with the merge command. This command merges 2 or more datasets and checks annotations for errors.

In simple cases, when dataset images do not intersect and new labels are not added, the recommended way of merging is using the patch command. It will offer better performance and provide the same results.

Datasets are merged by items, and item annotations are merged by finding the unique ones across datasets. Annotations are matched between matching dataset items by distance. Spatial annotations are compared by the applicable distance measure (IoU, OKS, PDJ etc.), labels and annotation attributes are selected by voting. Each set of matching annotations produces a single annotation in the resulting dataset. The score (a number in the range [0; 1]) attribute indicates the agreement between different sources in the produced annotation. The working time of the function can be estimated as O( (summary dataset length) * (dataset count) ^ 2 * (item annotations) ^ 2 )

This command also allows to merge datasets with different, or partially overlapping sets of labels (which is impossible by simple joining).

During the process, some merge conflicts can appear. For example, it can be mismatching dataset images having the same ids, label voting can be unsuccessful if quorum is not reached (the --quorum parameter), bboxes may be too close (the -iou parameter) etc. Found merge conflicts, missing items or annotations, and other errors are saved into an output .json file.

In Datumaro, annotations can be grouped. It can be useful to represent different parts of a single object - for example, it can be different parts of a human body, parts of a vehicle etc. This command allows to check annotation groups for completeness with the -g/--groups option. If used, this parameter must specify a list of labels for annotations that must be in the same group. It can be particularly useful to check if separate keypoints are grouped and all the necessary object components in the same group.

This command has multiple forms:

1) datum merge <revpath>
2) datum merge <revpath> <revpath> ...

<revpath> - either a dataset path or a revision path.

1 - Merges the current project’s main target (“project”) in the working tree with the specified dataset.

2 - Merges the specified datasets. Note that the current project is not included in the list of merged sources automatically.

The command supports passing extra exporting options for the output dataset. The format can be specified with the -f/--format option. Extra options should be passed after the main arguments and after the -- separator. Particularly, this is useful to include images in the output dataset with --save-images.

Usage:

datum merge [-h] [-iou IOU_THRESH] [-oconf OUTPUT_CONF_THRESH]
  [--quorum QUORUM] [-g GROUPS] [-o DST_DIR] [--overwrite]
  [-p PROJECT_DIR] [-f FORMAT]
  target [target ...] [-- EXTRA_FORMAT_ARGS]

Parameters:

  • <target> (string) - Target dataset revpaths (repeatable)
  • -iou, --iou-thresh (number) - IoU matching threshold for spatial annotations (both maximum inter-cluster and pairwise). Default is 0.25.
  • --quorum (number) - Minimum count of votes for a label or attribute to be counted. Default is 0.
  • -g, --groups (string) - A comma-separated list of label names in annotation groups to check. The ? postfix can be added to a label to make it optional in the group (repeatable)
  • -oconf, --output-conf-thresh (number) - Confidence threshold for output annotations to be included in the resulting dataset. Default is 0.
  • -o, --output-dir (string) - Output directory. By default, a new directory is created in the current directory.
  • --overwrite - Allows to overwrite existing files in the output directory, when it is specified and is not empty.
  • -f, --format (string) - Output format. The default format is datumaro.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • -- <extra format args> - Additional arguments for the format writer (use -- -h for help). Must be specified after the main command arguments.

Examples:

Merge 4 (partially-)intersecting projects,

  • consider voting successful when there are no less than 3 same votes
  • consider shapes intersecting when IoU >= 0.6
  • check annotation groups to have person, hand, head and foot (? is used for optional parts)
datum merge project1/ project2/ project3/ project4/ \
  --quorum 3 \
  -iou 0.6 \
  --groups 'person,hand?,head,foot?'

Merge images and annotations from 2 datasets in COCO format: datum merge dataset1/:image_dir dataset2/:coco dataset3/:coco

Check groups of the merged dataset for consistency: look for groups consisting of person, hand head, foot datum merge project1/ project2/ -g 'person,hand?,head,foot?'

Merge two datasets, specify formats: datum merge path/to/dataset1:voc path/to/dataset2:coco

Merge the current working tree and a dataset: datum merge path/to/dataset2:coco

Merge a source from a previous revision and a dataset: datum merge HEAD~2:source-2 path/to/dataset2:yolo

Merge datasets and save in different format: datum merge -f voc dataset1/:yolo path2/:coco -- --save-images

3.5.16 - Models

Register model

Datumaro can execute deep learning models in various frameworks. Check the plugins section for more info.

Supported frameworks:

  • OpenVINO
  • Custom models via custom launchers

Models need to be added to the Datumaro project first. It can be done with the datum model add command.

Usage:

datum model add [-h] [-n NAME] -l LAUNCHER [--copy] [--no-check]
  [-p PROJECT_DIR] [-- EXTRA_ARGS]

Parameters:

  • -l, --launcher (string) - Model launcher name
  • --copy - Copy model data into project. By default, only the link is saved.
  • --no-check - Don’t check the model can be loaded
  • -n, --name (string) - Name of the new model (default: generate automatically)
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • <extra args> - Additional arguments for the model launcher (use -- -h for help). Must be specified after the main command arguments.

Example: register an OpenVINO model

A model consists of a graph description and weights. There is also a script used to convert model outputs to internal data structures.

datum create
datum model add \
  -n <model_name> -l openvino -- \
  -d <path_to_xml> -w <path_to_bin> -i <path_to_interpretation_script>

Interpretation script for an OpenVINO detection model (convert.py): You can find OpenVINO model interpreter samples in datumaro/plugins/openvino/samples (instruction).

import datumaro as dm

max_det = 10
conf_thresh = 0.1

def process_outputs(inputs, outputs):
    # inputs = model input, array or images, shape = (N, C, H, W)
    # outputs = model output, shape = (N, 1, K, 7)
    # results = conversion result, [ [ Annotation, ... ], ... ]
    results = []
    for input, output in zip(inputs, outputs):
        input_height, input_width = input.shape[:2]
        detections = output[0]
        image_results = []
        for det in detections:
            label = int(det[1])
            conf = float(det[2])
            if conf <= conf_thresh:
                continue

            x = max(int(det[3] * input_width), 0)
            y = max(int(det[4] * input_height), 0)
            w = min(int(det[5] * input_width - x), input_width)
            h = min(int(det[6] * input_height - y), input_height)
            image_results.append(dm.Bbox(x, y, w, h,
                label=label, attributes={'score': conf} ))

            results.append(image_results[:max_det])

    return results

def get_categories():
    # Optionally, provide output categories - label map etc.
    # Example:
    label_categories = dm.LabelCategories()
    label_categories.add('person')
    label_categories.add('car')
    return { dm.AnnotationType.label: label_categories }

Remove Models

To remove a model from a project, use the datum model remove command.

Usage:

datum model remove [-h] [-p PROJECT_DIR] name

Parameters:

  • <name> (string) - The name of the model to be removed
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Example:

datum create
datum model add <...> -n model1
datum remove model1

Run Model

This command applies model to dataset images and produces a new dataset.

Usage:

datum model run

Parameters:

  • <target> (string) - A project build target to be used. By default, uses the combined project target.
  • -m, --model (string) - Model name
  • -o, --output-dir (string) - Output directory. By default, results will be stored in an auto-generated directory in the current directory.
  • --overwrite - Allows to overwrite existing files in the output directory, when it is specified and is not empty.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Example: launch inference on a dataset

datum create
datum import <...>
datum model add mymodel <...>
datum model run -m mymodel -o inference

3.5.17 - Patch Datasets

Updates items of the first dataset with items from the second one.

By default, datasets are updated in-place. The -o/--output-dir option can be used to specify another output directory. When updating in-place, use the --overwrite parameter along with the --save-images export option (in-place updates fail by default to prevent data loss).

Unlike the regular project data source joining, the datasets are not required to have the same labels. The labels from the “patch” dataset are projected onto the labels of the patched dataset, so only the annotations with the matching labels are used, i.e. all the annotations having unknown labels are ignored. Currently, this command doesn’t allow to update the label information in the patched dataset.

The command supports passing extra exporting options for the output dataset. The extra options should be passed after the main arguments and after the -- separator. Particularly, this is useful to include images in the output dataset with --save-images.

This command can be applied to the current project targets or arbitrary datasets outside a project. Note that if the target dataset is read-only (e.g. if it is a project, stage or a cache entry), the output directory must be provided.

Usage:

datum patch [-h] [-o DST_DIR] [--overwrite] [-p PROJECT_DIR]
  target patch
  [-- EXPORT_ARGS]

<revpath> - either a dataset path or a revision path.

The current project (-p/--project) is also used as a context for plugins, so it can be useful for dataset paths having custom formats. When not specified, the current project’s working tree is used.

Parameters:

  • <target dataset> (string) - Target dataset revpath
  • <patch dataset> (string) - Patch dataset revpath
  • -o, --output-dir (string) - Output directory. By default, saves in-place
  • --overwrite - Allows to overwrite existing files in the output directory, when it is not empty.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • -- <export args> - Additional arguments for the format writer (use -- -h for help). Must be specified after the main command arguments.

Examples:

  • Update a VOC-like dataset with COCO-like annotations:
datum patch --overwrite dataset1/:voc dataset2/:coco -- --save-images
  • Generate a patched dataset, based on a project:
datum patch -o patched_proj1/ proj1/ proj2/
  • Update the “source1” source in the current project with a dataset:
datum patch -p proj/ --overwrite source1 path/to/dataset2:coco
  • Generate a patched source from a previous revision and a dataset:
datum patch -o new_src2/ HEAD~2:source-2 path/to/dataset2:yolo
  • Update a dataset in a custom format, described in a project plugin:
datum patch -p proj/ --overwrite dataset/:my_format dataset2/:coco

3.5.18 - Projects

Migrate project

Updates the project from an old version to the current one and saves the resulting project in the output directory. Projects cannot be updated inplace.

The command tries to map the old source configuration to the new one. This can fail in some cases, so the command will exit with an error, unless -f/--force is specified. With this flag, the command will skip these errors an continue its work.

Usage:

datum project migrate [-h] -o DST_DIR [-f] [-p PROJECT_DIR] [--overwrite]

Parameters:

  • -o, --output-dir (string) - Output directory for the updated project
  • -f, --force - Ignore source import errors (default: False)
  • --overwrite - Overwrite existing files in the save directory.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Examples:

  • Migrate a project from v1 to v2, save the new project in other dir: datum project migrate -o <output/dir>

Prints project configuration info such as available plugins, registered models, imported sources and build tree.

Usage:

datum project info [-h] [-p PROJECT_DIR] [revision]

Parameters:

  • <revision> (string) - Target project revision. By default, uses the working tree.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Examples:

  • Print project info for the current working tree: datum project info

  • Print project info for the previous revision: datum project info HEAD~1

Sample output:

Project:
  location: /test_proj

Plugins:
  extractors: ade20k2017, ade20k2020, camvid, cifar, cityscapes, coco, coco_captions, coco_image_info, coco_instances, coco_labels, coco_panoptic, coco_person_keypoints, coco_stuff, cvat, datumaro, icdar_text_localization, icdar_text_segmentation, icdar_word_recognition, image_dir, image_zip, imagenet, imagenet_txt, kitti, kitti_detection, kitti_raw, kitti_segmentation, label_me, lfw, market1501, mnist, mnist_csv, mot_seq, mots, mots_png, open_images, sly_pointcloud, tf_detection_api, vgg_face2, voc, voc_action, voc_classification, voc_detection, voc_layout, voc_segmentation, wider_face, yolo

  converters: camvid, mot_seq_gt, coco_captions, coco, coco_image_info, coco_instances, coco_labels, coco_panoptic, coco_person_keypoints, coco_stuff, kitti, kitti_detection, kitti_segmentation, icdar_text_localization, icdar_text_segmentation, icdar_word_recognition, lfw, datumaro, open_images, image_zip, cifar, yolo, voc_action, voc_classification, voc, voc_detection, voc_layout, voc_segmentation, tf_detection_api, label_me, mnist, cityscapes, mnist_csv, kitti_raw, wider_face, vgg_face2, sly_pointcloud, mots_png, image_dir, imagenet_txt, market1501, imagenet, cvat

  launchers:

Models:

Sources:
  'source-2':
    format: voc
    url: /datasets/pascal/VOC2012
    location: /test_proj/source-2/
    options: {}
    hash: 3eb282cdd7339d05b75bd932a1fd3201
    stages:
      'root':
        type: source
        hash: 3eb282cdd7339d05b75bd932a1fd3201
  'source-3':
    format: imagenet
    url: /datasets/imagenet/ILSVRC2012_img_val/train
    location: /test_proj/source-3/
    options: {}
    hash: e47804a3ec1a54c9b145e5f1007ec72f
    stages:
      'root':
        type: source
        hash: e47804a3ec1a54c9b145e5f1007ec72f

3.5.19 - Sources

These commands are specific for Data Sources. Read more about them here.

Import Dataset

Datasets can be added to a Datumaro project with the import command, which adds a dataset link into the project and downloads (or copies) the dataset. If you need to add a dataset already copied into the project, use the add command.

Dataset format readers can provide some additional import options. To pass such options, use the -- separator after the main command arguments. The usage information can be printed with datum import -f <format> -- --help.

The list of currently available formats is listed in the command help output.

A dataset is imported by its URL. Currently, only local filesystem paths are supported. The URL can be a file or a directory path to a dataset. When the dataset is read, it is read as a whole. However, many formats can have multiple subsets like train, val, test etc. If you want to limit reading only to a specific subset, use the -r/--path parameter. It can also be useful when subset files have non-standard placement or names.

When a dataset is imported, the following things are done:

  • URL is saved in the project config
  • data in copied into the project

Each data source has a name assigned, which can be used in other commands. To set a specific name, use the -n/--name parameter.

The dataset is added into the working tree of the project. A new commit is not done automatically.

Usage:

datum import [-h] [-n NAME] -f FORMAT [-r PATH] [--no-check]
  [-p PROJECT_DIR] url [-- EXTRA_FORMAT_ARGS]

Parameters:

  • <url> (string) - A file of directory path to the dataset.
  • -f, --format (string) - Dataset format
  • -r, --path (string) - A path relative to the source URL the data source. Useful to specify a path to a subset, subtask, or a specific file in URL.
  • --no-check - Don’t try to read the source after importing
  • -n, --name (string) - Name of the new source (default: generate automatically)
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • -- <extra format args> - Additional arguments for the format reader (use -- -h for help). Must be specified after the main command arguments.

Example: create a project from images and annotations in different formats, export as TFrecord for TF Detection API for model training

# 'default' is the name of the subset below
datum create
datum import -f coco_instances -r annotations/instances_default.json path/to/coco
datum import -f cvat <path/to/cvat/default.xml>
datum import -f voc_detection -r custom_subset_dir/default.txt <path/to/voc>
datum import -f datumaro <path/to/datumaro/default.json>
datum import -f image_dir <path/to/images/dir>
datum export -f tf_detection_api -- --save-images

Add Dataset

Existing datasets can be added to a Datumaro project with the add command. The command adds a project-local directory as a data source in the project. Unlike the import command, it does not copy datasets and only works with local directories. The source name is defined by the directory name.

Dataset format readers can provide some additional import options. To pass such options, use the -- separator after the main command arguments. The usage information can be printed with datum add -f <format> -- --help.

The list of currently available formats is listed in the command help output.

A dataset is imported as a directory. When the dataset is read, it is read as a whole. However, many formats can have multiple subsets like train, val, test etc. If you want to limit reading only to a specific subset, use the -r/--path parameter. It can also be useful when subset files have non-standard placement or names.

The dataset is added into the working tree of the project. A new commit is not done automatically.

Usage:

datum add [-h] -f FORMAT [-r PATH] [--no-check]
  [-p PROJECT_DIR] path [-- EXTRA_FORMAT_ARGS]

Parameters:

  • <url> (string) - A file of directory path to the dataset.
  • -f, --format (string) - Dataset format
  • -r, --path (string) - A path relative to the source URL the data source. Useful to specify a path to a subset, subtask, or a specific file in URL.
  • --no-check - Don’t try to read the source after importing
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • -- <extra format args> - Additional arguments for the format reader (use -- -h for help). Must be specified after the main command arguments.

Example: create a project from images and annotations in different formats, export in YOLO for model training

datum create
datum add -f coco -r annotations/instances_train.json dataset1/
datum add -f cvat dataset2/train.xml
datum export -f yolo -- --save-images

Example: add an existing dataset into a project, avoid data copying

To add a dataset, we need to have it inside the project directory:

proj/
├─ .datumaro/
├─ .dvc/
├─ my_coco/
│  └─ images/
│     ├─ image1.jpg
│     └─ ...
│  └─ annotations/
│     └─ coco_annotation.json
├─ .dvcignore
└─ .gitignore
datum create -o proj/
mv ~/my_coco/ proj/my_coco/ # move the dataset into the project directory
datum add -p proj/ -f coco proj/my_coco/

Remove Datasets

To remove a data source from a project, use the remove command.

Usage:

datum remove [-h] [--force] [--keep-data] [-p PROJECT_DIR] name [name ...]

Parameters:

  • <name> (string) - The name of the source to be removed (repeatable)
  • -f, --force - Do not fail and stop on errors during removal
  • --keep-data - Do not remove source data from the working directory, remove only project metainfo.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Example:

datum create
datum import -f voc -n src1 <path/to/dataset/>
datum remove src1

3.5.20 - Get Project Statistics

This command computes various project statistics, such as:

  • image mean and std. dev.
  • class and attribute balance
  • mask pixel balance
  • segment area distribution

Usage:

datum stats [-h] [-p PROJECT_DIR] [target]

Parameters:

  • <target> (string) - Target source revpath. By default, computes statistics of the merged dataset.
  • -s, --subset (string) - Compute stats only for a specific subset
  • --image-stats (bool) - Compute image mean and std (default: True)
  • --ann-stats (bool) - Compute annotation statistics (default: True)
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Example:

datum stats -p test_project

Sample output:

{
    "annotations": {
        "labels": {
            "attributes": {
                "gender": {
                    "count": 358,
                    "distribution": {
                        "female": [
                            149,
                            0.41620111731843573
                        ],
                        "male": [
                            209,
                            0.5837988826815642
                        ]
                    },
                    "values count": 2,
                    "values present": [
                        "female",
                        "male"
                    ]
                },
                "view": {
                    "count": 340,
                    "distribution": {
                        "__undefined__": [
                            4,
                            0.011764705882352941
                        ],
                        "front": [
                            54,
                            0.1588235294117647
                        ],
                        "left": [
                            14,
                            0.041176470588235294
                        ],
                        "rear": [
                            235,
                            0.6911764705882353
                        ],
                        "right": [
                            33,
                            0.09705882352941177
                        ]
                    },
                    "values count": 5,
                    "values present": [
                        "__undefined__",
                        "front",
                        "left",
                        "rear",
                        "right"
                    ]
                }
            },
            "count": 2038,
            "distribution": {
                "car": [
                    340,
                    0.16683022571148184
                ],
                "cyclist": [
                    194,
                    0.09519136408243375
                ],
                "head": [
                    354,
                    0.17369970559371933
                ],
                "ignore": [
                    100,
                    0.04906771344455348
                ],
                "left_hand": [
                    238,
                    0.11678115799803729
                ],
                "person": [
                    358,
                    0.17566241413150147
                ],
                "right_hand": [
                    77,
                    0.037782139352306184
                ],
                "road_arrows": [
                    326,
                    0.15996074582924436
                ],
                "traffic_sign": [
                    51,
                    0.025024533856722278
                ]
            }
        },
        "segments": {
            "area distribution": [
                {
                    "count": 1318,
                    "max": 11425.1,
                    "min": 0.0,
                    "percent": 0.9627465303140978
                },
                {
                    "count": 1,
                    "max": 22850.2,
                    "min": 11425.1,
                    "percent": 0.0007304601899196494
                },
                {
                    "count": 0,
                    "max": 34275.3,
                    "min": 22850.2,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 45700.4,
                    "min": 34275.3,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 57125.5,
                    "min": 45700.4,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 68550.6,
                    "min": 57125.5,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 79975.7,
                    "min": 68550.6,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 91400.8,
                    "min": 79975.7,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 102825.90000000001,
                    "min": 91400.8,
                    "percent": 0.0
                },
                {
                    "count": 50,
                    "max": 114251.0,
                    "min": 102825.90000000001,
                    "percent": 0.036523009495982466
                }
            ],
            "avg. area": 5411.624543462382,
            "pixel distribution": {
                "car": [
                    13655,
                    0.0018431496518735067
                ],
                "cyclist": [
                    939005,
                    0.12674674030446592
                ],
                "head": [
                    0,
                    0.0
                ],
                "ignore": [
                    5501200,
                    0.7425510702956085
                ],
                "left_hand": [
                    0,
                    0.0
                ],
                "person": [
                    954654,
                    0.12885903974805205
                ],
                "right_hand": [
                    0,
                    0.0
                ],
                "road_arrows": [
                    0,
                    0.0
                ],
                "traffic_sign": [
                    0,
                    0.0
                ]
            }
        }
    },
    "annotations by type": {
        "bbox": {
            "count": 548
        },
        "caption": {
            "count": 0
        },
        "label": {
            "count": 0
        },
        "mask": {
            "count": 0
        },
        "points": {
            "count": 669
        },
        "polygon": {
            "count": 821
        },
        "polyline": {
            "count": 0
        }
    },
    "annotations count": 2038,
    "unannotated images": [
        "img00051",
        "img00052",
        "img00053",
        "img00054",
        "img00055",
    ],
    "unannotated images count": 5,

    "dataset": {
        "images count": 100,
        "unique images count": 97,
        "repeated images count": 3,
        "repeated images": [
            [["img00057", "default"], ["img00058", "default"]],
            [["img00059", "default"], ["img00060", "default"]],
            [["img00061", "default"], ["img00062", "default"]],
        ],
    },
    "subsets": {
        "default": {
            "images count": 100,
            "image mean": [
                107.06903686941979,
                79.12831698580979,
                52.95829558185416
            ],
            "image std": [
                49.40237673503467,
                43.29600731496902,
                35.47373007603151
            ],

        }
    },
}

3.5.21 - Status

This command prints the summary of the source changes between the working tree of a project and its HEAD revision.

Prints lines in the following format: <status> <source name>

The list of possible status values:

  • modified - the source data exists and it is changed
  • foreign_modified - the source data exists and it is changed, but Datumaro does not know about the way the differences were made. If changes are committed, they will only be available for reproduction from the project cache.
  • added - the source was added in the working tree
  • removed - the source was removed from the working tree. This status won’t be reported if just the source data is removed in the working tree. In such situation the status will be missing.
  • missing - the source data is removed from the working directory. The source still can be restored from the project cache or reproduced.

Usage:

datum status [-h] [-p PROJECT_DIR]

Parameters:

  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.

Example output:

added source-1
modified source-2
foreign_modified source-3
removed source-4
missing source-5

3.5.22 - Transform Dataset

Often datasets need to be modified during preparation for model training and experimenting. In trivial cases it can be done manually - e.g. image renaming or label renaming. However, in more complex cases even simple modifications can require too much efforts, distracting the user from the real work. Datumaro provides the datum transform command to help in such cases.

This command allows to modify dataset images or annotations all at once.

This command is designed for batch dataset processing, so if you only need to modify few elements of a dataset, you might want to use other approaches for better performance. A possible solution can be a simple script, which uses Datumaro API.

The command can be applied to a dataset or a project build target, a stage or the combined project target, in which case all the project targets will be affected. A build tree stage will be recorded if --stage is enabled, and the resulting dataset(-s) will be saved if --apply is enabled.

By default, datasets are updated in-place. The -o/--output-dir option can be used to specify another output directory. When updating in-place, use the --overwrite parameter (in-place updates fail by default to prevent data loss), unless a project target is modified.

The current project (-p/--project) is also used as a context for plugins, so it can be useful for dataset paths having custom formats. When not specified, the current project’s working tree is used.

Usage:

datum transform [-h] -t TRANSFORM [-o DST_DIR] [--overwrite]
  [-p PROJECT_DIR] [--stage STAGE] [--apply APPLY] [target] [-- EXTRA_ARGS]

Parameters:

  • <target> (string) - Target dataset revpath. By default, transforms all targets of the current project.
  • -t, --transform (string) - Transform method name
  • --stage (bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and the project target, but not intermediate stages). Enabled by default.
  • --apply (bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default.
  • -o, --output-dir (string) - Output directory. Can be omitted for main project targets (i.e. data sources and the project target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace.
  • --overwrite - Allows to overwrite existing files in the output directory, when it is specified and is not empty.
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • <extra args> - The list of extra transformation parameters. Should be passed after the -- separator after the main command arguments. See transform descriptions for info about extra parameters. Use the --help option to print parameter info.

Examples:

  • Split a VOC-like dataset randomly:
datum transform -t random_split --overwrite path/to/dataset:voc
  • Rename images in a project data source by a regex from frame_XXX to XXX:
datum create <...>
datum import <...> -n source-1
datum transform -t rename source-1 -- -e '|^frame_||'

Built-in transforms

Basic dataset item manipulations:

Subset manipulations:

  • random_split - Splits dataset into subsets randomly
  • split - Splits dataset into subsets for classification, detection, segmentation or re-identification
  • map_subsets - Renames and removes subsets

Annotation manipulations:

rename

Renames items in the dataset. Supports regular expressions. The first character in the expression is a delimiter for the pattern and replacement parts. Replacement part can also contain str.format replacement fields with the item (of type DatasetItem) object available.

Usage:

rename [-h] [-e REGEX]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -e, --regex (string) - Regex for renaming in the form <sep><search><sep><replacement><sep>

Examples: Replace ‘pattern’ with ‘replacement’:

datum transform -t rename -- -e '|pattern|replacement|'

Remove the frame_ prefix from item ids:

datum transform -t rename -- -e '|^frame_|\1|'

Collect images from subdirectories into the base image directory using regex:

datum transform -t rename -- -e '|^((.+[/\\])*)?(.+)$|\2|'

Add subset prefix to images:

datum transform -t rename -- -e '|(.*)|{item.subset}_\1|'

id_from_image_name

Renames items in the dataset using image file name (without extension).

Usage:

id_from_image_name [-h]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit

reindex

Replaces dataset item IDs with sequential indices.

Usage:

reindex [-h] [-s START]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -s, --start (int) - Start value for item ids (default: 1)

ndr

Removes near-duplicated images in subset.

Remove duplicated images from a dataset. Keep at most -k/--num_cut resulting images.

Available oversampling policies (the -e parameter):

  • random - sample from removed data randomly
  • similarity - sample from removed data with ascending similarity score

Available undersampling policies (the -u parameter):

  • uniform - sample data with uniform distribution
  • inverse - sample data with reciprocal of the number of number of items with the same similarity

Usage:

ndr [-h] [-w WORKING_SUBSET] [-d DUPLICATED_SUBSET] [-a {gradient}]
  [-k NUM_CUT] [-e {random,similarity}] [-u {uniform,inverse}] [-s SEED]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -w, --working_subset (str) - Name of the subset to operate (default: None)
  • -d, --duplicated_subset (str) - Name of the subset for the removed data after NDR runs (default: duplicated)
  • -a, --algorithm (one of: gradient) - Name of the algorithm to use (default: gradient)
  • -k, --num_cut (int) - Maximum output dataset size
  • -e, --over_sample (one of: random, similarity) - The policy to use when num_cut is bigger than result length (default: random)
  • -u, --under_sample (one of: uniform, inverse) - The policy to use when num_cut is smaller than result length (default: uniform)
  • -s, --seed (int) - Random seed

Example: apply NDR, return no more than 100 images

datum transform -t ndr -- \
  --working_subset train
  --algorithm gradient
  --num_cut 100
  --over_sample random
  --under_sample uniform

relevancy_sampler

Sampler that analyzes model inference results on the dataset and picks the most relevant samples for training.

Creates a dataset from the -k/--count hardest items for a model. The whole dataset or a single subset will be split into the sampled and unsampled subsets based on the model confidence. The dataset must contain model confidence values in the scores attributes of annotations.

There are five methods of sampling (the -m/--method option):

  • topk - Return the k items with the highest uncertainty data
  • lowk - Return the k items with the lowest uncertainty data
  • randk - Return random k items
  • mixk - Return a half using topk, and the other half using lowk method
  • randtopk - Select 3*k items randomly, and return the topk among them

Notes:

  • Each image’s inference result must contain the probability for all classes (in the scores attribute).
  • Requesting a sample larger than the number of all images will return all images.

Usage:

relevancy_sampler [-h] -k COUNT [-a {entropy}] [-i INPUT_SUBSET]
  [-o SAMPLED_SUBSET] [-u UNSAMPLED_SUBSET]
  [-m {topk,lowk,randk,mixk,randtopk}] [-d OUTPUT_FILE]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -k, --count (int) - Number of items to sample
  • -a, --algorithm (one of: entropy) - Sampling algorithm (default: entropy)
  • -i, --input_subset (str) - Subset name to select sample from (default: None)
  • -o, --sampled_subset (str) - Subset name to put sampled data to (default: sample)
  • -u, --unsampled_subset (str) - Subset name to put the rest data to (default: unsampled)
  • -m, --sampling_method (one of: topk, lowk, randk, mixk, randtopk) - Sampling method (default: topk)
  • -d, --output_file (path) - A .csv file path to dump sampling results

Examples: Select the most relevant data subset of 20 images based on model certainty, put the result into sample subset and put all the rest into unsampled subset, use train subset as input. The dataset must contain model confidence values in the scores attributes of annotations.

datum transform -t relevancy_sampler -- \
  --algorithm entropy \
  --subset_name train \
  --sample_name sample \
  --unsampled_name unsampled \
  --sampling_method topk -k 20

random_sampler

Sampler that keeps no more than required number of items in the dataset.

Notes:

  • Items are selected uniformly (tries to keep original item distribution by subsets)
  • Requesting a sample larger than the number of all images will return all images

Usage:

random_sampler [-h] -k COUNT [-s SUBSET] [--seed SEED]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -k, --count (int) - Maximum number of items to sample
  • -s, --subset (str) - Limit changes to this subset (default: affect all dataset)
  • --seed (int) - Initial value for random number generator

Examples: Select subset of 20 images randomly

datum transform -t random_sampler -- -k 20

Select subset of 20 images, modify only train subset

datum transform -t random_sampler -- -k 20 -s train

random_label_sampler

Sampler that keeps at least the required number of annotations of each class in the dataset for each subset separately.

Consider using the “stats” command to get class distribution in the dataset.

Notes:

  • Items can contain annotations of several selected classes (e.g. 3 bounding boxes per image). The number of annotations in the resulting dataset varies between max(class counts) and sum(class counts)
  • If the input dataset does not has enough class annotations, the result will contain only what is available
  • Items are selected uniformly
  • For reasons above, the resulting class distribution in the dataset may not be the same as requested
  • The resulting dataset will only keep annotations for classes with specified count > 0

Usage:

label_random_sampler [-h] -k COUNT [-l LABEL_COUNTS] [--seed SEED]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -k, --count (int) - Minimum number of annotations of each class
  • -l, --label (str; repeatable) - Minimum number of annotations of a specific class. Overrides the -k/--count setting for the class. The format is <label_name>:<count>
  • --seed (int) - Initial value for random number generator

Examples: Select a dataset with at least 10 images of each class:

datum transform -t label_random_sampler -- -k 10

Select a dataset with at least 20 cat images, 5 dog, 0 car and 10 of each unmentioned class:

datum transform -t label_random_sampler -- \
  -l cat:20 \ # keep 20 images with cats
  -l dog:5 \ # keep 5 images with dogs
  -l car:0 \ # remove car annotations
  -k 10 # for remaining classes

resize

Resizes images and annotations in the dataset to the specified size. Supports upscaling, downscaling and mixed variants.

Usage:

resize [-h] [-dw WIDTH] [-dh HEIGHT]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -dw, --width (int) - Destination image width
  • -dh, --height (int) - Destination image height

Examples: Resize all images to 256x256 size

datum transform -t resize -- -dw 256 -dh 256

remove_images

Removes specific dataset items by their ids.

Usage:

remove_images [-h] [--id IDs]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • --id (str) - Item id to remove. Id is ‘:’ pair (repeatable)

Examples:

Remove specific images from the dataset

datum transform -t remove_images -- --id 'image1:train' --id 'image2:test'

remove_annotations

Allows to remove annotations on specific dataset items.

Can be useful to clean the dataset from broken or unnecessary annotations.

Usage:

remove_annotations [-h] [--id IDs]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • --id (str) - Item id to clean from annotations. Id is ‘:’ pair. If not specified, removes all annotations (repeatable)

Examples: Remove annotations from specific items in the dataset

datum transform -t remove_annotations -- --id 'image1:train' --id 'image2:test'

remove_attributes

Allows to remove item and annotation attributes in a dataset.

Can be useful to clean the dataset from broken or unnecessary attributes.

Usage:

remove_attributes [-h] [--id IDs] [--attr ATTRIBUTE_NAME]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • --id (str) - Image id to clean from annotations. Id is ‘:’ pair. If not specified, affects all items and annotations (repeatable)
  • -a, --attr (flag) - Attribute name to be removed. If not specified, removes all attributes (repeatable)

Examples: Remove the is_crowd attribute from dataset

datum transform -t remove_attributes -- \
  --attr 'is_crowd'

Remove the occluded attribute from annotations of the 2010_001705 item in the train subset

datum transform -t remove_attributes -- \
  --id '2010_001705:train' --attr 'occluded'

random_split

Joins all subsets into one and splits the result into few parts. It is expected that item ids are unique and subset ratios sum up to 1.

Usage:

random_split [-h] [-s SPLITS] [--seed SEED]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -s, --subset (str, repeatable) - Subsets in the form: ‘:’ (repeatable, default: {train: 0.67, test: 0.33})
  • --seed (int) - Random seed

Example: Split a dataset randomly to train and test subsets, ratio is 2:1

datum transform -t random_split -- --subset train:.67 --subset test:.33

split

Splits a dataset for model training, using task information:

  • classification splits Splits dataset into subsets (train/val/test) in class-wise manner. Splits dataset images in the specified ratio, keeping the initial class distribution.

  • detection & segmentation splits Each image can have multiple object annotations - bbox, mask, polygon. Since an image shouldn’t be included in multiple subsets at the same time, and image annotations shouldn’t be split, in general, dataset annotations are unlikely to be split exactly in the specified ratio. This split tries to split dataset images as close as possible to the specified ratio, keeping the initial class distribution.

  • reidentification splits In this task, the test set should consist of images of unseen people or objects during the training phase. This function splits a dataset in the following way:

  1. Splits the dataset into train + val and test sets based on person or object ID.
  2. Splits test set into test-gallery and test-query sets in class-wise manner.
  3. Splits the train + val set into train and val sets in the same way. The final subsets would be train, val, test-gallery and test-query.

Notes:

  • Each image is expected to have only one Annotation. Unlabeled or multi-labeled images will be split into subsets randomly.
  • If Labels also have attributes, also splits by attribute values.
  • If there is not enough images in some class or attributes group, the split ratio can’t be guaranteed.

In reidentification task,

  • Object ID can be described by Label, or by attribute (--attr parameter)
  • The splits of the test set are controlled by --query parameter Gallery ratio would be 1.0 - query.

Usage:

split [-h] [-t {classification,detection,segmentation,reid}]
  [-s SPLITS] [--query QUERY] [--attr ATTR_FOR_ID] [--seed SEED]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -t, --task (one of: classification, detection, segmentation, reid) - Dataset task (default: classification)
  • -s, --subset (str; repeatable) - Subsets in the form: ‘:’ (default: {train: 0.5, val: 0.2, test: 0.3})
  • --query (float) - Query ratio in the test set (default: 0.5)
  • --attr (str) - Attribute name representing the ID (default: use label)
  • --seed(int) - Random seed

Example:

datum transform -t split -- -t classification \
  --subset train:.5 --subset val:.2 --subset test:.3

datum transform -t split -- -t detection \
  --subset train:.5 --subset val:.2 --subset test:.3

datum transform -t split -- -t segmentation \
  --subset train:.5 --subset val:.2 --subset test:.3

datum transform -t split -- -t reid \
  --subset train:.5 --subset val:.2 --subset test:.3 --query .5

Example: use person_id attribute for splitting

datum transform -t split -- -t detection --attr person_id

map_subsets

Renames subsets in the dataset.

Usage:

map_subsets [-h] [-s MAPPING]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -s, --subset (str; repeatable) - Subset mapping of the form: src:dst

remap_labels

Changes labels in the dataset.

A label can be:

  • renamed (and joined with existing) - when --label <old_name>:<new_name> is specified
  • deleted - when --label <name>: is specified, or default action is delete and the label is not mentioned in the list. When a label is deleted, all the associated annotations are removed
  • kept unchanged - when --label <name>:<name> is specified, or default action is keep and the label is not mentioned in the list Annotations with no label are managed by the default action policy.

Usage:

remap_labels [-h] [-l MAPPING] [--default {keep,delete}]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -l, --label (str; repeatable) - Label in the form of: <src>:<dst>
  • --default (one of: keep, delete) - Action for unspecified labels (default: keep)

Examples: Remove the person label (and corresponding annotations):

datum transform -t remap_labels -- -l person: --default keep

Rename person to pedestrian and human to pedestrian, join annotations that had different classes under the same class id for pedestrian, don’t touch other classes:

datum transform -t remap_labels -- \
  -l person:pedestrian -l human:pedestrian --default keep

Rename person to car and cat to dog, keep bus, remove others:

datum transform -t remap_labels -- \
  -l person:car -l bus:bus -l cat:dog --default delete

project_labels

Changes the order of labels in the dataset from the existing to the desired one, removes unknown labels and adds new labels. Updates or removes the corresponding annotations.

Labels are matched by names (case dependent). Parent labels are only kept if they are present in the resulting set of labels. If new labels are added, and the dataset has mask colors defined, new labels will obtain generated colors.

Useful for merging similar datasets, whose labels need to be aligned.

Usage:

project_labels [-h] [-l DST_LABELS]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • -l, --label (str; repeatable) - Label name (ordered)

Examples: Set dataset labels to [person, cat, dog], remove others, add missing. Original labels (for example): cat, dog, elephant, human. New labels: person (added), cat (kept), dog (kept).

datum transform -t project_labels -- -l person -l cat -l dog

shapes_to_boxes

Converts spatial annotations (masks, polygons, polylines, points) to enclosing bounding boxes.

Usage:

shapes_to_boxes [-h]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit

Example: Convert spatial annotations between each other

datum transform -t boxes_to_masks
datum transform -t masks_to_polygons
datum transform -t polygons_to_masks
datum transform -t shapes_to_boxes

boxes_to_masks

Converts bounding boxes to masks.

Usage:

boxes_to_masks [-h]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit

polygons_to_masks

Converts polygons to masks.

Usage:

polygons_to_masks [-h]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit

masks_to_polygons

Converts masks to polygons.

Usage:

masks_to_polygons [-h]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit

anns_to_labels

Collects all labels from annotations (of all types) and transforms them into a set of annotations of type Label

Usage:

anns_to_labels [-h]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit

merge_instance_segments

Replaces instance masks and, optionally, polygons with a single mask. A group of annotations with the same group id is considered an “instance”. The largest annotation in the group is considered the group “head”, so the resulting mask takes properties from that annotation.

Usage:

merge_instance_segments [-h] [--include-polygons]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit
  • --include-polygons (flag) - Include polygons

crop_covered_segments

Sorts polygons and masks (“segments”) according to z_order, crops covered areas of underlying segments. If a segment is split into several independent parts by the segments above, produces the corresponding number of separate annotations joined into a group.

Usage:

crop_covered_segments [-h]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit

bbox_value_decrement

Subtracts one from the coordinates of bounding boxes

Usage:

bbox_values_decrement [-h]

Optional arguments:

  • -h, --help (flag) - Show this help message and exit

3.5.23 - Utilities

Split video into frames

Splits a video into separate frames and saves them in a directory. After the splitting, the images can be added into a project using the import command and the image_dir format.

This command is useful for making a dataset from a video file. Unlike direct video reading during model training, which can produce different results if the system environment changes, this command allows to split the video into frames and use them instead, making the dataset reproducible and stable.

This command provides different options like setting the frame step (the -s/--step option), file name pattern (-n/--name-pattern), starting (-b/--start-frame) and finishing (-e/--end-frame) frame etc.

Note that this command is equivalent to the following commands:

datum create -o proj
datum import -p proj -f video_frames video.mp4 -- <params>
datum export -p proj -f image_dir -- <params>

Usage:

datum util split_video [-h] -i SRC_PATH [-o DST_DIR] [--overwrite]
  [-n NAME_PATTERN] [-s STEP] [-b START_FRAME] [-e END_FRAME] [-x IMAGE_EXT]

Parameters:

  • -i, --input-path (string) - Path to the video file
  • -o, --output-dir (string) - Output directory. By default, a subdirectory in the current directory is used
  • --overwrite - Allows overwriting existing files in the output directory, when it is not empty
  • -n, --name-pattern (string) - Name pattern for the produced images (default: %06d)
  • -s, --step (integer) - Frame step (default: 1)
  • -b, --start-frame (integer) - Starting frame (default: 0)
  • -e, --end-frame (integer) - Finishing frame (default: none)
  • -x, --image-ext (string) Output image extension (default: .jpg)
  • -h, --help - Print the help message and exit

Example: split a video into frames, use each 30-rd frame:

datum util split_video -i video.mp4 -o video.mp4-frames --step 30

Example: split a video into frames, save as ‘frame_xxxxxx.png’ files:

datum util split_video -i video.mp4 --image-ext=.png --name-pattern='frame_%%06d'

Example: split a video, add frames and annotations into dataset, export as YOLO:

datum util split_video -i video.avi -o video-frames
datum create -o proj
datum import -p proj -f image_dir video-frames
datum import -p proj -f coco_instances annotations.json
datum export -p proj -f yolo -- --save-images

3.5.24 - Validate Dataset

This command inspects annotations with respect to the task type and stores the results in JSON file.

The task types supported are classification, detection, and segmentation (the -t/--task-type parameter).

The validation result contains

  • annotation statistics based on the task type
  • validation reports, such as
    • items not having annotations
    • items having undefined annotations
    • imbalanced distribution in class/attributes
    • too small or large values
  • summary

Usage:

datum validate [-h] -t TASK [-s SUBSET_NAME] [-p PROJECT_DIR]
  [target] [-- EXTRA_ARGS]

Parameters:

  • <target> (string) - Target dataset revpath. By default, validates the current project.
  • -t, --task-type (string) - Task type for validation
  • -s, --subset (string) - Dataset subset to be validated
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • <extra args> - The list of extra validation parameters. Should be passed after the -- separator after the main command arguments:
    • -fs, --few-samples-thr (number) - The threshold for giving a warning for minimum number of samples per class
    • -ir, --imbalance-ratio-thr (number) - The threshold for giving imbalance data warning
    • -m, --far-from-mean-thr (number) - The threshold for giving a warning that data is far from mean
    • -dr, --dominance-ratio-thr (number) - The threshold for giving a warning bounding box imbalance
    • -k, --topk-bins (number) - The ratio of bins with the highest number of data to total bins in the histogram

Example : give warning when imbalance ratio of data with classification task over 40

datum validate -p prj/ -t classification -- -ir 40

Here is the list of validation items(a.k.a. anomaly types).

Anomaly Type Description Task Type
MissingLabelCategories Metadata (ex. LabelCategories) should be defined common
MissingAnnotation No annotation found for an Item common
MissingAttribute An attribute key is missing for an Item common
MultiLabelAnnotations Item needs a single label classification
UndefinedLabel A label not defined in the metadata is found for an item common
UndefinedAttribute An attribute not defined in the metadata is found for an item common
LabelDefinedButNotFound A label is defined, but not found actually common
AttributeDefinedButNotFound An attribute is defined, but not found actually common
OnlyOneLabel The dataset consists of only label common
OnlyOneAttributeValue The dataset consists of only attribute value common
FewSamplesInLabel The number of samples in a label might be too low common
FewSamplesInAttribute The number of samples in an attribute might be too low common
ImbalancedLabels There is an imbalance in the label distribution common
ImbalancedAttribute There is an imbalance in the attribute distribution common
ImbalancedDistInLabel Values (ex. bbox width) are not evenly distributed for a label detection, segmentation
ImbalancedDistInAttribute Values (ex. bbox width) are not evenly distributed for an attribute detection, segmentation
NegativeLength The width or height of bounding box is negative detection
InvalidValue There’s invalid (ex. inf, nan) value for bounding box info. detection
FarFromLabelMean An annotation has an too small or large value than average for a label detection, segmentation
FarFromAttrMean An annotation has an too small or large value than average for an attribute detection, segmentation

Validation Result Format:

{
    'statistics': {
        ## common statistics
        'label_distribution': {
            'defined_labels': <dict>,   # <label:str>: <count:int>
            'undefined_labels': <dict>
            # <label:str>: {
            #     'count': <int>,
            #     'items_with_undefined_label': [<item_key>, ]
            # }
        },
        'attribute_distribution': {
            'defined_attributes': <dict>,
            # <label:str>: {
            #     <attribute:str>: {
            #         'distribution': {<attr_value:str>: <count:int>, },
            #         'items_missing_attribute': [<item_key>, ]
            #     }
            # }
            'undefined_attributes': <dict>
            # <label:str>: {
            #     <attribute:str>: {
            #         'distribution': {<attr_value:str>: <count:int>, },
            #         'items_with_undefined_attr': [<item_key>, ]
            #     }
            # }
        },
        'total_ann_count': <int>,
        'items_missing_annotation': <list>, # [<item_key>, ]

        ## statistics for classification task
        'items_with_multiple_labels': <list>, # [<item_key>, ]

        ## statistics for detection task
        'items_with_invalid_value': <dict>,
        # '<item_key>': {<ann_id:int>: [ <property:str>, ], }
        # - properties: 'x', 'y', 'width', 'height',
        #               'area(wxh)', 'ratio(w/h)', 'short', 'long'
        # - 'short' is min(w,h) and 'long' is max(w,h).
        'items_with_negative_length': <dict>,
        # '<item_key>': { <ann_id:int>: { <'width'|'height'>: <value>, }, }
        'bbox_distribution_in_label': <dict>, # <label:str>: <bbox_template>
        'bbox_distribution_in_attribute': <dict>,
        # <label:str>: {<attribute:str>: { <attr_value>: <bbox_template>, }, }
        'bbox_distribution_in_dataset_item': <dict>,
        # '<item_key>': <bbox count:int>

        ## statistics for segmentation task
        'items_with_invalid_value': <dict>,
        # '<item_key>': {<ann_id:int>: [ <property:str>, ], }
        # - properties: 'area', 'width', 'height'
        'mask_distribution_in_label': <dict>, # <label:str>: <mask_template>
        'mask_distribution_in_attribute': <dict>,
        # <label:str>: {
        #     <attribute:str>: { <attr_value>: <mask_template>, }
        # }
        'mask_distribution_in_dataset_item': <dict>,
        # '<item_key>': <mask/polygon count: int>
    },
    'validation_reports': <list>, # [ <validation_error_format>, ]
    # validation_error_format = {
    #     'anomaly_type': <str>,
    #     'description': <str>,
    #     'severity': <str>, # 'warning' or 'error'
    #     'item_id': <str>,  # optional, when it is related to a DatasetItem
    #     'subset': <str>,   # optional, when it is related to a DatasetItem
    # }
    'summary': {
        'errors': <count: int>,
        'warnings': <count: int>
    }
}

item_key is defined as,

item_key = (<DatasetItem.id:str>, <DatasetItem.subset:str>)

bbox_template and mask_template are defined as,

bbox_template = {
    'width': <numerical_stat_template>,
    'height': <numerical_stat_template>,
    'area(wxh)': <numerical_stat_template>,
    'ratio(w/h)': <numerical_stat_template>,
    'short': <numerical_stat_template>, # short = min(w, h)
    'long': <numerical_stat_template>   # long = max(w, h)
}
mask_template = {
    'area': <numerical_stat_template>,
    'width': <numerical_stat_template>,
    'height': <numerical_stat_template>
}

numerical_stat_template is defined as,

numerical_stat_template = {
    'items_far_from_mean': <dict>,
    # {'<item_key>': {<ann_id:int>: <value:float>, }, }
    'mean': <float>,
    'stddev': <float>,
    'min': <float>,
    'max': <float>,
    'median': <float>,
    'histogram': {
        'bins': <list>,   # [<float>, ]
        'counts': <list>, # [<int>, ]
    }
}

3.6 - Extending

There are few ways to extend and customize Datumaro behavior, which is supported by plugins. Check our contribution guide for details on plugin implementation. In general, a plugin is a Python module. It must be put into a plugin directory:

  • <project_dir>/.datumaro/plugins for project-specific plugins
  • <datumaro_dir>/plugins for global plugins

Built-in plugins

Datumaro provides several builtin plugins. Plugins can have dependencies, which need to be installed separately.

TensorFlow

The plugin provides support of TensorFlow Detection API format, which includes boxes and masks.

Dependencies

The plugin depends on TensorFlow, which can be installed with pip:

pip install tensorflow

or

pip install tensorflow-gpu

or

pip install datumaro[tf]

or

pip install datumaro[tf-gpu]

Accuracy Checker

This plugin allows to use Accuracy Checker to launch deep learning models from various frameworks (Caffe, MxNet, PyTorch, OpenVINO, …) through Accuracy Checker’s API.

Dependencies

The plugin depends on Accuracy Checker, which can be installed with pip:

pip install 'git+https://github.com/openvinotoolkit/open_model_zoo.git#subdirectory=tools/accuracy_checker'

To execute models with deep learning frameworks, they need to be installed too.

OpenVINO™

This plugin provides support for model inference with OpenVINO™.

Dependencies

The plugin depends on the OpenVINO™ Toolkit, which can be installed by following these instructions

Dataset Formats

Dataset reading is supported by Extractors and Importers. An Extractor produces a list of dataset items corresponding to the dataset. An Importer creates a project from the data source location. It is possible to add custom Extractors and Importers. To do this, you need to put an Extractor and Importer implementation scripts to a plugin directory.

Dataset writing is supported by Converters. A Converter produces a dataset of a specific format from dataset items. It is possible to add custom Converters. To do this, you need to put a Converter implementation script to a plugin directory.

Dataset Conversions (“Transforms”)

A Transform is a function for altering a dataset and producing a new one. It can update dataset items, annotations, classes, and other properties. A list of available transforms for dataset conversions can be extended by adding a Transform implementation script into a plugin directory.

Model launchers

A list of available launchers for model execution can be extended by adding a Launcher implementation script into a plugin directory.

3.8 - How to control telemetry data collection

The OpenVINO™ telemetry library is used to collect basic information about Datumaro usage.

A short description of the information collected:

Event Description
version Datumaro version
session start/end Accessory event, there is no additional info here
{cli_command}_result Datumaro command result with arguments passed*
error Stack trace in case of exception*

* All sensitive arguments, such as filesystem paths or names, are sanitized

To enable the collection of telemetry data, the ISIP consent file must exist and contain 1, otherwise telemetry will be disabled. The ISIP file can be created/modified by an OpenVINO installer or manually and used by other OpenVINO™ tools.

The location of the ISIP consent file depends on the OS:

  • Windows: %localappdata%\Intel Corporation\isip,
  • Linux, MacOS: $HOME/intel/isip.

4 - Formats

List of dataset formats supported by Datumaro

4.1 - NYU Depth Dataset V2

Format specification

The original NYU Depth Dataset V2 is available here.

Supported annotation types:

  • DepthAnnotation

Import NYU Depth Dataset V2

The NYU Depth Dataset V2 is available for free download.

A Datumaro project with a NYU Depth Dataset V2 source can be created in the following way:

datum create
datum import --format nyu_depth_v2 <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'nyu_depth_v2')

NYU Depth Dataset V2 directory should have the following structure:

Dataset/
    ├── 1.h5
    ├── 2.h5
    ├── 3.h5
    └── ...

To make sure that the selected dataset has been added to the project, you can run datum project info, which will display the project information.

Examples

Examples of using this format from the code can be found in the format tests

4.2 - ADE20k (v2017)

Format specification

The original ADE20K 2017 dataset is available here.

The consistency set (for checking the annotation consistency) is available here.

Supported annotation types:

  • Masks

Supported annotation attributes:

  • occluded (boolean): whether the object is occluded by another object
  • other arbitrary boolean attributes, which can be specified in the annotation file <image_name>_atr.txt

Import ADE20K 2017 dataset

A Datumaro project with an ADE20k source can be created in the following way:

datum create
datum import --format ade20k2017 <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

ade20k_dataset = dm.Dataset.import_from('<path/to/dataset>', 'ade20k2017')

ADE20K dataset directory should have the following structure:

dataset/
├── dataset_meta.json # a list of non-format labels (optional)
├── subset1/
│   └── super_label_1/
│       ├── img1.jpg
│       ├── img1_atr.txt
│       ├── img1_parts_1.png
│       ├── img1_seg.png
│       ├── img2.jpg
│       ├── img2_atr.txt
│       └── ...
└── subset2/
    ├── img3.jpg
    ├── img3_atr.txt
    ├── img3_parts_1.png
    ├── img3_parts_2.png
    ├── img4.jpg
    ├── img4_atr.txt
    ├── img4_seg.png
    └── ...

The mask images <image_name>_seg.png contain information about the object class segmentation masks and also separate each class into instances. The channels R and G encode the objects class masks. The channel B encodes the instance object masks.

The mask images <image_name>_parts_N.png contain segmentation masks for parts of objects, where N is a number indicating the level in the part hierarchy.

The annotation files <image_name>_atr.txt describe the content of each image. Each line in the text file contains:

  • column 1: instance number,
  • column 2: part level (0 for objects),
  • column 3: occluded (1 for true),
  • column 4: original raw name (might provide a more detailed categorization),
  • column 5: class name (parsed using wordnet),
  • column 6: double-quoted list of attributes, separated by commas. Each column is separated by a #. See example of dataset here.

To add custom classes, you can use dataset_meta.json.

Export to other formats

Datumaro can convert an ADE20K dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports segmentation masks.

There are several ways to convert an ADE20k 2017 dataset to other dataset formats using CLI:

datum create
datum import -f ade20k2017 <path/to/dataset>
datum export -f coco -o <output/dir> -- --save-media

or

datum convert -if ade20k2017 -i <path/to/dataset> \
    -f coco -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'ade202017')
dataset.export('save_dir', 'coco')

Examples

Examples of using this format from the code can be found in the format tests

4.3 - ADE20k (v2020)

Format specification

The original ADE20K 2020 dataset is available here.

The consistency set (for checking the annotation consistency) is available here.

Supported annotation types:

  • Masks

Supported annotation attributes:

  • occluded (boolean): whether the object is occluded by another object
  • other arbitrary boolean attributes, which can be specified in the annotation file <image_name>.json

Import ADE20K dataset

A Datumaro project with an ADE20k source can be created in the following way:

datum create
datum import --format ade20k2020 <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

ade20k_dataset = dm.Dataset.import_from('<path/to/dataset>', 'ade20k2020')

ADE20K dataset directory should have the following structure:

dataset/
├── dataset_meta.json # a list of non-format labels (optional)
├── subset1/
│   ├── img1/  # directory with instance masks for img1
│   |    ├── instance_001_img1.png
│   |    ├── instance_002_img1.png
│   |    └── ...
│   ├── img1.jpg
│   ├── img1.json
│   ├── img1_seg.png
│   ├── img1_parts_1.png
│   |
│   ├── img2/  # directory with instance masks for img2
│   |    ├── instance_001_img2.png
│   |    ├── instance_002_img2.png
│   |    └── ...
│   ├── img2.jpg
│   ├── img2.json
│   └── ...
│
└── subset2/
    ├── super_label_1/
    |   ├── img3/  # directory with instance masks for img3
    |   |    ├── instance_001_img3.png
    |   |    ├── instance_002_img3.png
    |   |    └── ...
    |   ├── img3.jpg
    |   ├── img3.json
    |   ├── img3_seg.png
    |   ├── img3_parts_1.png
    |   └── ...
    |
    ├── img4/  # directory with instance masks for img4
    |   ├── instance_001_img4.png
    |   ├── instance_002_img4.png
    |   └── ...
    ├── img4.jpg
    ├── img4.json
    ├── img4_seg.png
    └── ...

The mask images <image_name>_seg.png contain information about the object class segmentation masks and also separate each class into instances. The channels R and G encode the objects class masks. The channel B encodes the instance object masks.

The mask images <image_name>_parts_N.png contain segmentation masks for parts of objects, where N is a number indicating the level in the part hierarchy.

The <image_name> directory contains instance masks for each object in the image, these masks represent one-channel images, each pixel of which indicates an affinity to a specific object.

The annotation files <image_name>.json describe the content of each image. See our tests asset for example of this file, or check ADE20K toolkit for it.

To add custom classes, you can use dataset_meta.json.

Export to other formats

Datumaro can convert an ADE20K dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports segmentation masks.

There are several ways to convert an ADE20k dataset to other dataset formats using CLI:

datum create
datum import -f ade20k2020 <path/to/dataset>
datum export -f coco -o ./save_dir -- --save-media

or

datum convert -if ade20k2020 -i <path/to/dataset> \
    -f coco -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'ade20k2020')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in the format tests

4.4 - Align CelebA

Format specification

The original CelebA dataset is available here.

Supported annotation types:

  • Label
  • Points (landmarks)

Supported attributes:

  • 5_o_Clock_Shadow, Arched_Eyebrows, Attractive, Bags_Under_Eyes, Bald, Bangs, Big_Lips, Big_Nose, Black_Hair, Blond_Hair, Blurry, Brown_Hair, Bushy_Eyebrows, Chubby, Double_Chin, Eyeglasses, Goatee, Gray_Hair, Heavy_Makeup, High_Cheekbones, Male, Mouth_Slightly_Open, Mustache, Narrow_Eyes, No_Beard, Oval_Face, Pale_Skin, Pointy_Nose, Receding_Hairline, Rosy_Cheeks, Sideburns, Smiling, Straight_Hair, Wavy_Hair, Wearing_Earrings, Wearing_Hat, Wearing_Lipstick, Wearing_Necklace, Wearing_Necktie, Young (boolean)

Import align CelebA dataset

A Datumaro project with an align CelebA source can be created in the following way:

datum create
datum import --format align_celeba <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

align_celeba_dataset = dm.Dataset.import_from('<path/to/dataset>', 'align_celeba')

Align CelebA dataset directory should have the following structure:

dataset/
├── dataset_meta.json # a list of non-format labels (optional)
├── Anno/
│   ├── identity_CelebA.txt
│   ├── list_attr_celeba.txt
│   └── list_landmarks_align_celeba.txt
├── Eval/
│   └── list_eval_partition.txt
└── Img/
    └── img_align_celeba/
        ├── 000001.jpg
        ├── 000002.jpg
        └── ...

The identity_CelebA.txt file contains labels (required). The list_attr_celeba.txt, list_landmarks_align_celeba.txt, list_eval_partition.txt files contain attributes, bounding boxes, landmarks and subsets respectively (optional).

The original CelebA dataset stores images in a .7z archive. The archive needs to be unpacked before importing.

To add custom classes, you can use dataset_meta.json.

Export to other formats

Datumaro can convert an align CelebA dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports labels or landmarks.

There are several ways to convert an align CelebA dataset to other dataset formats using CLI:

datum create
datum import -f align_celeba <path/to/dataset>
datum export -f imagenet_txt -o ./save_dir -- --save-media

or

datum convert -if align_celeba -i <path/to/dataset> \
    -f imagenet_txt -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'align_celeba')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in the format tests

4.5 - BraTS

Format specification

The original BraTS dataset is available here. The BraTS data provided since BraTS'17 differs significantly from the data provided during the previous BraTS challenges (i.e., 2016 and backwards). Datumaro supports BraTS'17-20.

Supported annotation types:

  • Mask

Import BraTS dataset

A Datumaro project with a BraTS source can be created in the following way:

datum create
datum import --format brats <path/to/dataset>

It is also possible to import the dataset using Python API:

from datumaro.components.dataset import Dataset

brats_dataset = Dataset.import_from('<path/to/dataset>', 'brats')

BraTS dataset directory should have the following structure:

dataset/
├── imagesTr
│   │── <img1>.nii.gz
│   │── <img2>.nii.gz
│   └── ...
├── imagesTs
│   │── <img3>.nii.gz
│   │── <img4>.nii.gz
│   └── ...
├── labels
└── labelsTr
    │── <img1>.nii.gz
    │── <img2>.nii.gz
    └── ...

The data in Datumaro is stored as multi-frame images (set of 2D images). Annotated images are stored as masks for each 2d image separately with an image_id attribute.

Export to other formats

Datumaro can convert a BraTS dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports segmentation masks.

There are several ways to convert a BraTS dataset to other dataset formats using CLI:

datum create
datum import -f brats <path/to/dataset>
datum export -f voc -o <output/dir> -- --save-media

or

datum convert -if brats -i <path/to/dataset> \
    -f voc -o <output/dir> -- --save-media

Or, using Python API:

from datumaro.components.dataset import Dataset

dataset = Dataset.import_from('<path/to/dataset>', 'brats')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in the format tests

4.6 - BraTS Numpy

Format specification

The original BraTS dataset is available here.

Supported annotation types:

  • Mask
  • Cuboid3d

Import BraTS Numpy dataset

A Datumaro project with a BraTS Numpy source can be created in the following way:

datum create
datum import --format brats_numpy <path/to/dataset>

It is also possible to import the dataset using Python API:

from datumaro.components.dataset import Dataset

brats_dataset = Dataset.import_from('<path/to/dataset>', 'brats_numpy')

BraTS Numpy dataset directory should have the following structure:

dataset/
├── <img1>_data_cropped.npy
├── <img1>_label_cropped.npy
├── <img2>_data_cropped.npy
├── <img2>_label_cropped.npy
├── ...
├── labels
├── val_brain_bbox.p
└── val_ids.p

The data in Datumaro is stored as multi-frame images (set of 2D images). Annotated images are stored as masks for each 2d image separately with an image_id attribute.

Export to other formats

Datumaro can convert a BraTS Numpy dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports segmentation masks or cuboids.

There are several ways to convert a BraTS Numpy dataset to other dataset formats using CLI:

datum create
datum import -f brats_numpy <path/to/dataset>
datum export -f voc -o <output/dir> -- --save-media

or

datum convert -if brats_numpy -i <path/to/dataset> \
    -f voc -o <output/dir> -- --save-media

Or, using Python API:

from datumaro.components.dataset import Dataset

dataset = Dataset.import_from('<path/to/dataset>', 'brats_numpy')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in the format tests

4.7 - CelebA

Format specification

The original CelebA dataset is available here.

Supported annotation types:

  • Label
  • Bbox
  • Points (landmarks)

Supported attributes:

  • 5_o_Clock_Shadow, Arched_Eyebrows, Attractive, Bags_Under_Eyes, Bald, Bangs, Big_Lips, Big_Nose, Black_Hair, Blond_Hair, Blurry, Brown_Hair, Bushy_Eyebrows, Chubby, Double_Chin, Eyeglasses, Goatee, Gray_Hair, Heavy_Makeup, High_Cheekbones, Male, Mouth_Slightly_Open, Mustache, Narrow_Eyes, No_Beard, Oval_Face, Pale_Skin, Pointy_Nose, Receding_Hairline, Rosy_Cheeks, Sideburns, Smiling, Straight_Hair, Wavy_Hair, Wearing_Earrings, Wearing_Hat, Wearing_Lipstick, Wearing_Necklace, Wearing_Necktie, Young (boolean)

Import CelebA dataset

A Datumaro project with a CelebA source can be created in the following way:

datum create
datum import --format celeba <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

celeba_dataset = dm.Dataset.import_from('<path/to/dataset>', 'celeba')

CelebA dataset directory should have the following structure:

dataset/
├── dataset_meta.json # a list of non-format labels (optional)
├── Anno/
│   ├── identity_CelebA.txt
│   ├── list_attr_celeba.txt
│   ├── list_bbox_celeba.txt
│   └── list_landmarks_celeba.txt
├── Eval/
│   └── list_eval_partition.txt
└── Img/
    └── img_celeba/
        ├── 000001.jpg
        ├── 000002.jpg
        └── ...

The identity_CelebA.txt file contains labels (required). The list_attr_celeba.txt, list_bbox_celeba.txt, list_landmarks_celeba.txt, list_eval_partition.txt files contain attributes, bounding boxes, landmarks and subsets respectively (optional).

The original CelebA dataset stores images in a .7z archive. The archive needs to be unpacked before importing.

To add custom classes, you can use dataset_meta.json.

Export to other formats

Datumaro can convert a CelebA dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports labels, bounding boxes or landmarks.

There are several ways to convert a CelebA dataset to other dataset formats using CLI:

datum create
datum import -f celeba <path/to/dataset>
datum export -f imagenet_txt -o ./save_dir -- --save-media

or

datum convert -if celeba -i <path/to/dataset> \
    -f imagenet_txt -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'celeba')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in the format tests

4.8 - CIFAR

Format specification

CIFAR format specification is available here.

Supported annotation types:

  • Label

Datumaro supports Python version CIFAR-10/100. The difference between CIFAR-10 and CIFAR-100 is how labels are stored in the meta files (batches.meta or meta) and in the annotation files. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). In CIFAR-10 there are no superclasses.

CIFAR formats contain 32 x 32 images. As an extension, Datumaro supports reading and writing of arbitrary-sized images.

Import CIFAR dataset

The CIFAR dataset is available for free download:

A Datumaro project with a CIFAR source can be created in the following way:

datum create
datum import --format cifar <path/to/dataset>

It is possible to specify project name and project directory. Run datum create --help for more information.

CIFAR-10 dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of non-format labels (optional)
    ├── batches.meta
    ├── <subset_name1>
    ├── <subset_name2>
    └── ...

CIFAR-100 dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of non-format labels (optional)
    ├── meta
    ├── <subset_name1>
    ├── <subset_name2>
    └── ...

Dataset files use the Pickle data format.

Meta files:

CIFAR-10:
    num_cases_per_batch: 1000
    label_names: list of strings (['airplane', 'automobile', 'bird', ...])
    num_vis: 3072

CIFAR-100:
    fine_label_names: list of strings (['apple', 'aquarium_fish', ...])
    coarse_label_names: list of strings (['aquatic_mammals', 'fish', ...])

Annotation files:

Common:
    'batch_label': 'training batch 1 of <N>'
    'data': numpy.ndarray of uint8, layout N x C x H x W
    'filenames': list of strings

    If images have non-default size (32x32) (Datumaro extension):
        'image_sizes': list of (H, W) tuples

CIFAR-10:
    'labels': list of strings

CIFAR-100:
    'fine_labels': list of integers
    'coarse_labels': list of integers

To add custom classes, you can use dataset_meta.json.

Export to other formats

Datumaro can convert a CIFAR dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports the classification task (e.g. MNIST, ImageNet, PascalVOC, etc.)

There are several ways to convert a CIFAR dataset to other dataset formats using CLI:

datum create
datum import -f cifar <path/to/cifar>
datum export -f imagenet -o <output/dir>

or

datum convert -if cifar -i <path/to/dataset> \
    -f imagenet -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'cifar')
dataset.export('save_dir', 'imagenet', save_media=True)

Export to CIFAR

There are several ways to convert a dataset to CIFAR format:

# export dataset into CIFAR format from existing project
datum export -p <path/to/project> -f cifar -o <output/dir> \
    -- --save-media
# converting to CIFAR format from other format
datum convert -if imagenet -i <path/to/dataset> \
    -f cifar -o <output/dir> -- --save-media

Extra options for exporting to CIFAR format:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext <IMAGE_EXT> allow to specify image extension for exporting the dataset (by default .png)
  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)

The format (CIFAR-10 or CIFAR-100) in which the dataset will be exported depends on the presence of superclasses in the LabelCategories.

Examples

Datumaro supports filtering, transformation, merging etc. for all formats and for the CIFAR format in particular. Follow the user manual to get more information about these operations.

There are several examples of using Datumaro operations to solve particular problems with CIFAR dataset:

Example 1. How to create a custom CIFAR-like dataset

import numpy as np
import datumaro as dm

dataset = dm.Dataset.from_iterable([
    dm.DatasetItem(id=0, image=np.ones((32, 32, 3)),
        annotations=[dm.Label(3)]
    ),
    dm.DatasetItem(id=1, image=np.ones((32, 32, 3)),
        annotations=[dm.Label(8)]
    )
], categories=['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck'])

dataset.export('./dataset', format='cifar')

Example 2. How to filter and convert a CIFAR dataset to ImageNet

Convert a CIFAR dataset to ImageNet format, keep only images with the dog class present:

# Download CIFAR-10 dataset:
# https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
datum convert --input-format cifar --input-path <path/to/cifar> \
              --output-format imagenet \
              --filter '/item[annotation/label="dog"]'

Examples of using this format from the code can be found in the format tests

4.9 - Cityscapes

Format specification

Cityscapes format overview is available here.

Cityscapes format specification is available here.

Supported annotation types:

  • Masks

Supported annotation attributes:

  • is_crowd (boolean). Specifies if the annotation label can distinguish between different instances. If False, the annotation id field encodes the instance id.

Import Cityscapes dataset

The Cityscapes dataset is available for free download.

A Datumaro project with a Cityscapes source can be created in the following way:

datum create
datum import --format cityscapes <path/to/dataset>

Cityscapes dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of non-Cityscapes labels (optional)
    ├── label_colors.txt # a list of non-Cityscapes labels in other format (optional)
    ├── imgsFine/
    │   ├── leftImg8bit
    │   │   ├── <split: train,val, ...>
    │   │   |   ├── {city1}
    │   │   │   |   ├── {city1}_{seq:[0...6]}_{frame:[0...6]}_leftImg8bit.png
    │   │   │   │   └── ...
    │   │   |   ├── {city2}
    │   │   │   └── ...
    │   │   └── ...
    └── gtFine/
        ├── <split: train,val, ...>
        │   ├── {city1}
        │   |   ├── {city1}_{seq:[0...6]}_{frame:[0...6]}_gtFine_color.png
        │   |   ├── {city1}_{seq:[0...6]}_{frame:[0...6]}_gtFine_instanceIds.png
        │   |   ├── {city1}_{seq:[0...6]}_{frame:[0...6]}_gtFine_labelIds.png
        │   │   └── ...
        │   ├── {city2}
        │   └── ...
        └── ...

Annotated files description:

  1. *_leftImg8bit.png - left images in 8-bit LDR format
  2. *_color.png - class labels encoded by its color
  3. *_labelIds.png - class labels are encoded by its index
  4. *_instanceIds.png - class and instance labels encoded by an instance ID. The pixel values encode class and the individual instance: the integer part of a division by 1000 of each ID provides class ID, the remainder is the instance ID. If a certain annotation describes multiple instances, then the pixels have the regular ID of that class

To add custom classes, you can use dataset_meta.json and label_colors.txt. If the dataset_meta.json is not represented in the dataset, then label_colors.txt will be imported if possible.

In label_colors.txt you can define custom color map and non-cityscapes labels, for example:

# label_colors [color_rgb name]
0 124 134 elephant

To make sure that the selected dataset has been added to the project, you can run datum project info, which will display the project information.

Export to other formats

Datumaro can convert a Cityscapes dataset into any other format Datumaro supports. To get the expected result, convert the dataset to formats that support the segmentation task (e.g. PascalVOC, CamVID, etc.)

There are several ways to convert a Cityscapes dataset to other dataset formats using CLI:

datum create
datum import -f cityscapes <path/to/cityscapes>
datum export -f voc -o <output/dir>

or

datum convert -if cityscapes -i <path/to/cityscapes> \
    -f voc -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'cityscapes')
dataset.export('save_dir', 'voc', save_media=True)

Export to Cityscapes

There are several ways to convert a dataset to Cityscapes format:

# export dataset into Cityscapes format from existing project
datum export -p <path/to/project> -f cityscapes -o <output/dir> \
    -- --save-media
# converting to Cityscapes format from other format
datum convert -if voc -i <path/to/dataset> \
    -f cityscapes -o <output/dir> -- --save-media

Extra options for exporting to Cityscapes format:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext IMAGE_EXT allow to specify image extension for exporting dataset (by default - keep original or use .png, if none)
  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)
  • --label_map allow to define a custom colormap. Example:
# mycolormap.txt :
# 0 0 255 sky
# 255 0 0 person
#...
datum export -f cityscapes -- --label-map mycolormap.txt

or you can use original cityscapes colomap:

datum export -f cityscapes -- --label-map cityscapes

Examples

Datumaro supports filtering, transformation, merging etc. for all formats and for the Cityscapes format in particular. Follow the user manual to get more information about these operations.

There are several examples of using Datumaro operations to solve particular problems with a Cityscapes dataset:

Example 1. Load the original Cityscapes dataset and convert to Pascal VOC

datum create -o project
datum import -p project -f cityscapes ./Cityscapes/
datum stats -p project
datum export -p project -o dataset/ -f voc -- --save-media

Example 2. Create a custom Cityscapes-like dataset

from collections import OrderedDict

import numpy as np
import datumaro as dm
import datumaro.plugins.cityscapes_format as Cityscapes

label_map = OrderedDict()
label_map['background'] = (0, 0, 0)
label_map['label_1'] = (1, 2, 3)
label_map['label_2'] = (3, 2, 1)
categories = Cityscapes.make_cityscapes_categories(label_map)

dataset = dm.Dataset.from_iterable([
    dm.DatasetItem(id=1,
        image=np.ones((1, 5, 3)),
        annotations=[
            dm.Mask(image=np.array([[1, 0, 0, 1, 1]]), label=1),
            dm.Mask(image=np.array([[0, 1, 1, 0, 0]]), label=2, id=2,
                attributes={'is_crowd': False}),
        ]
    ),
], categories=categories)

dataset.export('./dataset', format='cityscapes')

Examples of using this format from the code can be found in the format tests

4.10 - COCO

Format specification

COCO format specification is available here.

The dataset has annotations for multiple tasks. Each task has its own format in Datumaro, and there is also a combined coco format, which includes all the available tasks. The sub-formats have the same options as the “main” format and only limit the set of annotation files they work with. To work with multiple formats, use the corresponding option of the coco format.

Supported tasks / formats:

Supported annotation types (depending on the task):

  • Caption (captions)
  • Label (label, Datumaro extension)
  • Bbox (instances, person keypoints)
  • Polygon (instances, person keypoints)
  • Mask (instances, person keypoints, panoptic, stuff)
  • Points (person keypoints)

Supported annotation attributes:

  • is_crowd (boolean; on bbox, polygon and mask annotations) - Indicates that the annotation covers multiple instances of the same class.
  • score (number; range [0; 1]) - Indicates the confidence in this annotation. Ground truth annotations always have 1.
  • arbitrary attributes (string/number) - A Datumaro extension. Stored in the attributes section of the annotation descriptor.

Import COCO dataset

The COCO dataset is available for free download:

Images:

Annotations:

A Datumaro project with a COCO source can be created in the following way:

datum create
datum import --format coco <path/to/dataset>

It is possible to specify project name and project directory. Run datum create --help for more information.

Extra options for adding a source in the COCO format:

  • --keep-original-category-ids: Add dummy label categories so that category indexes in the imported data source correspond to the category IDs in the original annotation file.

A COCO dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of custom labels (optional)
    ├── images/
    │   ├── train/
    │   │   ├── <image_name1.ext>
    │   │   ├── <image_name2.ext>
    │   │   └── ...
    │   └── val/
    │       ├── <image_name1.ext>
    │       ├── <image_name2.ext>
    │       └── ...
    └── annotations/
        ├── <task>_<subset_name>.json
        └── ...

For the panoptic task, a dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of custom labels (optional)
    ├── images/
    │   ├── train/
    │   │   ├── <image_name1.ext>
    │   │   ├── <image_name2.ext>
    │   │   └── ...
    │   ├── val/
    │   │   ├── <image_name1.ext>
    │   │   ├── <image_name2.ext>
    │   │   └── ...
    └── annotations/
        ├── panoptic_train/
        │   ├── <image_name1.ext>
        │   ├── <image_name2.ext>
        │   └── ...
        ├── panoptic_train.json
        ├── panoptic_val/
        │   ├── <image_name1.ext>
        │   ├── <image_name2.ext>
        │   └── ...
        └── panoptic_val.json

Annotation files must have the names like <task_name>_<subset_name>.json. The year is treated as a part of the subset name. If the annotation file name does’t match this pattern, use one of the task-specific formats instead of plain coco: coco_captions, coco_image_info, coco_instances, coco_labels, coco_panoptic, coco_person_keypoints, coco_stuff. In this case all items of the dataset will be added to the default subset.

To add custom classes, you can use dataset_meta.json.

You can import a dataset for one or several tasks instead of the whole dataset. This option also allows to import annotation files with non-default names. For example:

datum create
datum import --format coco_stuff -r <relpath/to/stuff.json> <path/to/dataset>

To make sure that the selected dataset has been added to the project, you can run datum project info, which will display the project information.

Notes:

  • COCO categories can have any integer ids, however, Datumaro will count annotation category id 0 as “not specified”. This does not contradict the original annotations, because they have category indices starting from 1.

Export to other formats

Datumaro can convert COCO dataset into any other format Datumaro supports. To get the expected result, convert the dataset to formats that support the specified task (e.g. for panoptic segmentation - VOC, CamVID)

There are several ways to convert a COCO dataset to other dataset formats using CLI:

datum create
datum import -f coco <path/to/coco>
datum export -f voc -o <output/dir>

or

datum convert -if coco -i <path/to/coco> -f voc -o <output/dir>

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'coco')
dataset.export('save_dir', 'voc', save_media=True)

Export to COCO

There are several ways to convert a dataset to COCO format:

# export dataset into COCO format from existing project
datum export -p <path/to/project> -f coco -o <output/dir> \
    -- --save-media
# converting to COCO format from other format
datum convert -if voc -i <path/to/dataset> \
    -f coco -o <output/dir> -- --save-media

Extra options for exporting to COCO format:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext IMAGE_EXT allow to specify image extension for exporting dataset (by default - keep original or use .jpg, if none)
  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)
  • --segmentation-mode MODE allow to specify save mode for instance segmentation:
    • ‘guess’: guess the mode for each instance (using ‘is_crowd’ attribute as hint)
    • ‘polygons’: save polygons (merge and convert masks, prefer polygons)
    • ‘mask’: save masks (merge and convert polygons, prefer masks) (by default guess)
  • --crop-covered allow to crop covered segments so that background objects segmentation was more accurate (by default False)
  • --allow-attributes ALLOW_ATTRIBUTES allow export of attributes (by default True). The parameter enables or disables writing the custom annotation attributes to the “attributes” annotation field. This field is an extension to the original COCO format
  • --reindex REINDEX allow to assign new indices to images and annotations, useful to avoid merge conflicts (by default False). This option allows to control if the images and annotations must be given new indices. It can be useful, when you want to preserve the original indices in the produced dataset. Consider having this option enabled when converting from other formats or merging datasets to avoid conflicts
  • --merge-images allow to save all images into a single directory (by default False). The parameter controls the output directory for images. When enabled, the dataset images are saved into a single directory, otherwise they are saved in separate directories by subsets.
  • --tasks TASKS allow to specify tasks for export dataset, by default Datumaro uses all tasks. Example:
datum create
datum import -f coco <path/to/dataset>
datum export -f coco -- --tasks instances,stuff

Examples

Datumaro supports filtering, transformation, merging etc. for all formats and for the COCO format in particular. Follow the user manual to get more information about these operations.

There are several examples of using Datumaro operations to solve particular problems with a COCO dataset:

Example 1. How to load an original panoptic COCO dataset and convert to Pascal VOC

datum create -o project
datum import -p project -f coco_panoptic ./COCO/annotations/panoptic_val2017.json
datum stats -p project
datum export -p project -f voc -- --save-media

Example 2. How to create custom COCO-like dataset

import numpy as np
import datumaro as dm

dataset = dm.Dataset.from_iterable([
  dm.DatasetItem(id='000000000001',
    image=np.ones((1, 5, 3)),
    subset='val',
    attributes={'id': 40},
    annotations=[
      dm.Mask(image=np.array([[0, 0, 1, 1, 0]]), label=3,
        id=7, group=7, attributes={'is_crowd': False}),
      dm.Mask(image=np.array([[0, 1, 0, 0, 1]]), label=1,
        id=20, group=20, attributes={'is_crowd': True}),
    ]
  ),
], categories=['a', 'b', 'c', 'd'])

dataset.export('./dataset', format='coco_panoptic')

Examples of using this format from the code can be found in the format tests

4.11 - Common Semantic Segmentation

Format specification

CSS format specification is available here.

Supported annotation types:

  • Masks

Import Common Semantic Segmentation dataset

A Datumaro project with a CSS source can be created in the following way:

datum create
datum import --format common_semantic_segmentation <path/to/dataset>

Extra import options:

  • --image-prefix IMAGE_PREFIX allow to import dataset with custom image prefix (by default ‘')
  • --mask-prefix MASK_PREFIX allow to import dataset with custom mask prefix (by default ‘')

CSS dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of labels
    ├── images/
    │   ├── <img1>.png
    │   ├── <img2>.png
    │   └── ...
    └── masks/
        ├── <img1>.png
        ├── <img2>.png
        └── ...

To describe classes and colors, you should use dataset_meta.json.

To make sure that the selected dataset has been added to the project, you can run datum project info, which will display the project information.

Export to other formats

Datumaro can convert a CSS dataset into any other format Datumaro supports. To get the expected result, convert the dataset to formats that support the segmentation task (e.g. PASCAL VOC, CamVid, Cityscapes, etc.)

There are several ways to convert a CSS dataset to other dataset formats using CLI:

datum create
datum import -f common_semantic_segmentation <path/to/dataset>
datum export -f voc -o <output/dir>

or

datum convert -if common_semantic_segmentation -i <path/to/dataset> \
    -f cityscapes -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'common_semantic_segmentation')
dataset.export('save_dir', 'camvid', save_media=True)

Examples

Examples of using this format from the code can be found in the format tests

4.12 - Common Super Resolution

Format specification

CSR format specification is available here.

Supported annotation types:

  • SuperResolutionAnnotation

Supported attributes:

  • upsampled (Image): upsampled image

Import Common Super Resolution dataset

A Datumaro project with a CSR source can be created in the following way:

datum create
datum import --format common_super_resolution <path/to/dataset>

CSR dataset directory should have the following structure:

└─ Dataset/
    ├── HR/
    │   ├── <img1>.png
    │   ├── <img2>.png
    │   └── ...
    ├── LR/
    │   ├── <img1>.png
    │   ├── <img2>.png
    │   └── ...
    └── upsampled/ # optional
        ├── <img1>.png
        ├── <img2>.png
        └── ...

To make sure that the selected dataset has been added to the project, you can run datum project info, which will display the project information.

Examples

Examples of using this format from the code can be found in the format tests

4.13 - ICDAR

Format specification

ICDAR is a dataset for text recognition task, it’s available for download here. There is exists two most popular version of this dataset: ICDAR13 and ICDAR15, Datumaro supports both of them.

Original dataset contains the following subformats:

  • ICDAR word recognition;
  • ICDAR text localization;
  • ICDAR text segmentation.

Supported types of annotations:

  • ICDAR word recognition
    • Caption
  • ICDAR text localization
    • Polygon, Bbox
  • ICDAR text segmentation
    • Mask

Supported attributes:

  • ICDAR text localization
    • text: transcription of text is inside a Polygon/Bbox.
  • ICDAR text segmentation
    • index: identifier of the annotation object, which is encoded in the mask and coincides with the line number in which the description of this object is written;
    • text: transcription of text is inside a Mask;
    • color: RGB values of the color corresponding text in the mask image (three numbers separated by space);
    • center: coordinates of the center of text (two numbers separated by space).

Import ICDAR dataset

There is few ways to import ICDAR dataset with Datumaro:

  • Through the Datumaro project
datum create
datum import -f icdar_text_localization <text_localization_dataset>
datum import -f icdar_text_segmentation <text_segmentation_dataset>
datum import -f icdar_word_recognition <word_recognition_dataset>
  • With Python API
import datumaro as dm
data1 = dm.Dataset.import_from('text_localization_path', 'icdar_text_localization')
data2 = dm.Dataset.import_from('text_segmentation_path', 'icdar_text_segmentation')
data3 = dm.Dataset.import_from('word_recognition_path', 'icdar_word_recognition')

Dataset with ICDAR dataset should have the following structure:

For icdar_word_recognition

<dataset_path>/
├── <subset_name_1>
│   ├── gt.txt
│   └── images
│       ├── word_1.png
│       ├── word_2.png
│       ├── ...
├── <subset_name_2>
├── ...

For icdar_text_localization

<dataset_path>/
├── <subset_name_1>
│   ├── gt_img_1.txt
│   ├── gt_img_2.txt
│   ├── ...
│   └── images
│       ├── img_1.png
│       ├── img_2.png
│       ├── ...
├── <subset_name_2>
│   ├── ...
├── ...

For icdar_text_segmentation

<dataset_path>/
├── <subset_name_1>
│   ├── image_1_GT.bmp # mask for image_1
│   ├── image_1_GT.txt # description of mask objects on the image_1
│   ├── image_2_GT.bmp
│   ├── image_2_GT.txt
│   ├── ...
│   └── images
│       ├── image_1.png
│       ├── image_2.png
│       ├── ...
├── <subset_name_2>
│   ├── ...
├── ...

See more information about adding datasets to the project in the docs.

Export to other formats

Datumaro can convert ICDAR dataset into any other format Datumaro supports. Examples:

# converting ICDAR text segmentation dataset into the VOC with `convert` command
datum convert -if icdar_text_segmentation -i source_dataset \
    -f voc -o export_dir -- --save-media
# converting ICDAR text localization into the LabelMe through Datumaro project
datum create
datum import -f icdar_text_localization source_dataset
datum export -f label_me -o ./export_dir -- --save-media

Note: some formats have extra export options. For particular format see the docs to get information about it.

With Datumaro you can also convert your dataset to one of the ICDAR formats, but to get expected result, the source dataset should contain required attributes, described in previous section.

Note: in case with icdar_text_segmentation format, if your dataset contains masks without attribute color then it will be generated automatically.

Available extra export options for ICDAR dataset formats:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext IMAGE_EXT allow to specify image extension for exporting dataset (by default - keep original)

4.14 - Image zip

Format specification

The image zip format allows to export/import unannotated datasets with images to/from a zip archive. The format doesn’t support any annotations or attributes.

Import Image zip dataset

There are several ways to import unannotated datasets to your Datumaro project:

  • From an existing archive:
datum create
datum import -f image_zip ./images.zip
  • From a directory with zip archives. Datumaro will import images from all zip files in the directory:
datum create
datum import -f image_zip ./foo

The directory with zip archives must have the following structure:

└── foo/
    ├── archive1.zip/
    |   ├── image_1.jpg
    |   ├── image_2.png
    |   ├── subdir/
    |   |   ├── image_3.jpg
    |   |   └── ...
    |   └── ...
    ├── archive2.zip/
    |   ├── image_101.jpg
    |   ├── image_102.jpg
    |   └── ...
    ...

Images in the archives must have a supported extension, follow the user manual to see the supported extensions.

Export to other formats

Datumaro can convert image zip dataset into any other format Datumaro supports. For example:

datum create -o project
datum import -p project -f image_zip ./images.zip
datum export -p project -f coco -o ./new_dir -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'image_zip')
dataset.export('save_dir', 'coco', save_media=True)

Export an unannotated dataset to a zip archive

Example: exporting images from a VOC dataset to zip archives:

datum create -o project
datum import -p project -f voc ./VOC2012
datum export -p project -f image_zip -- --name voc_images.zip

Extra options for exporting to image_zip format:

  • --save-media allow to export dataset with saving media files (default: False)
  • --image-ext <IMAGE_EXT> allow to specify image extension for exporting dataset (default: use original or .jpg, if none)
  • --name name of output zipfile (default: default.zip)
  • --compression allow to specify archive compression method. Available methods: ZIP_STORED, ZIP_DEFLATED, ZIP_BZIP2, ZIP_LZMA (default: ZIP_STORED). Follow zip documentation for more information.

Examples

Examples of using this format from the code can be found in the format tests

4.15 - ImageNet

Format specification

ImageNet is one of the most popular datasets for image classification task, this dataset is available for downloading here

Supported types of annotations:

  • Label

Format doesn’t support any attributes for annotations objects.

The original ImageNet dataset contains about 1.2M images and information about class name for each image. Datumaro supports two versions of ImageNet format: imagenet and imagenet_txt. The imagenet_txt format assumes storing information about the class of the image in *.txt files. And imagenet format assumes storing information about the class of the image in the name of directory where is this image stored.

Import ImageNet dataset

A Datumaro project with a ImageNet dataset can be created in the following way:

datum create
datum import -f imagenet <path_to_dataset>
# or
datum import -f imagenet_txt <path_to_dataset>

Note: if you use datum import then <path_to_dataset> should not be a subdirectory of directory with Datumaro project, see more information about it in the docs.

Load ImageNet dataset through the Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path_to_dataset>', format='imagenet_txt')

For successful importing of ImageNet dataset the input directory with dataset should has the following structure:

imagenet_dataset/
├── label_0
│   ├── <image_name_0>.jpg
│   ├── <image_name_1>.jpg
│   ├── <image_name_2>.jpg
│   ├── ...
├── label_1
│    ├── <image_name_0>.jpg
│    ├── <image_name_1>.jpg
│    ├── <image_name_2>.jpg
│    ├── ...
├── ...
  
imagenet_txt_dataset/
├── images # directory with images
│   ├── <image_name_0>.jpg
│   ├── <image_name_1>.jpg
│   ├── <image_name_2>.jpg
│   ├── ...
├── synsets.txt # optional, list of labels
└── train.txt   # list of pairs (image_name, label)
  

Note: if you don’t have synsets file then Datumaro will automatically generate classes with a name pattern class-<i>.

Datumaro has few import options for imagenet_txt format, to apply them use the -- after the main command argument.

imagenet_txt import options:

  • --labels {file, generate}: allow to specify where to get label descriptions from (use file to load from the file specified by --labels-file; generate to create generic ones)
  • --labels-file allow to specify path to the file with label descriptions (“synsets.txt”)

Export ImageNet dataset

Datumaro can convert ImageNet into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports Label annotation objects.

# Using `convert` command
datum convert -if imagenet -i <path_to_imagenet> \
    -f voc -o <output_dir> -- --save-media

# Using Datumaro project
datum create
datum import -f imagenet_txt <path_to_imagenet> -- --labels generate
datum export -f open_images -o <output_dir>

And also you can convert your ImageNet dataset using Python API

import datumaro as dm

imagenet_dataset = dm.Dataset.import_from('<path_to_dataset', format='imagenet')

imagenet_dataset.export('<output_dir>', format='vgg_face2', save_media=True)

Note: some formats have extra export options. For particular format see the docs to get information about it.

Export dataset to the ImageNet format

If your dataset contains Label for images and you want to convert this dataset into the ImagetNet format, you can use Datumaro for it:

# Using convert command
datum convert -if open_images -i <path_to_oid> \
    -f imagenet_txt -o <output_dir> -- --save-media --save-dataset-meta

# Using Datumaro project
datum create
datum import -f open_images <path_to_oid>
datum export -f imagenet -o <output_dir>

Extra options for exporting to ImageNet formats:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext <IMAGE_EXT> allow to specify image extension for exporting the dataset (by default .png)
  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)

4.16 - Kinetics

Format specification

Kinetics 400/600/700 is a video datasets for action recognition task. Dataset is available for downloading here

Supported media type:

  • Video

Supported type of annotations:

  • Label

Supported attributes for labels:

  • time_start (integer) - time (in seconds) of the start of recognized action
  • time_end (integer) - time (in seconds) of the end of recognized action

Import Kinetics dataset

A Datumaro project with a Kinetics dataset can be created in the following way using CLI:

datum create
datum import -f kinetics <path_to_dataset>

Or using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path_to_dataset>', format='kinetics')
├── test.csv
├── train.json
├── train
│   ├── <name_of_video_1_with_yt_id>.avi # extension of video could be other
│   ├── <name_of_video_2_with_yt_id>.avi
│   ├── ...
└── test
    ├── <name_of_video_100_with_yt_id>.avi # extension of video could be other
    ├── <name_of_video_101_with_yt_id>.avi
    ├── ...

Kinetics dataset has two equivalent annotation file formats: .csv and .json. Datumaro supports both, but in case when two annotation files have same names but different extensions Datumaro will use .csv.

Note: name of each video file must contain youtube_id of this video, that specified in annotation file. And to speed up the import, you can leave only the youtube_id in the video filename.

See the full list of supported video extensions here.

4.17 - KITTI

Format specification

The KITTI dataset has many annotations for different tasks. Datumaro supports only a few of them.

Supported tasks / formats:

  • Object Detection - kitti_detection The format specification is available in README.md here.
  • Segmentation - kitti_segmentation The format specification is available in README.md here.
  • Raw 3D / Velodyne Points - described here

Supported annotation types:

  • Bbox (object detection)
  • Mask (segmentation)

Supported annotation attributes:

  • truncated (boolean) - indicates that the bounding box specified for the object does not correspond to the full extent of the object
  • occluded (boolean) - indicates that a significant portion of the object within the bounding box is occluded by another object
  • score (float) - indicates confidence in detection

Import KITTI dataset

The KITTI left color images for object detection are available here. The KITTI object detection labels are available here. The KITTI segmentation dataset is available here.

A Datumaro project with a KITTI source can be created in the following way:

datum create
datum import --format kitti <path/to/dataset>

It is possible to specify project name and project directory. Run datum create --help for more information.

KITTI detection dataset directory should have the following structure:

└─ Dataset/
    ├── testing/
    │   └── image_2/
    │       ├── <name_1>.<img_ext>
    │       ├── <name_2>.<img_ext>
    │       └── ...
    └── training/
        ├── image_2/ # left color camera images
        │   ├── <name_1>.<img_ext>
        │   ├── <name_2>.<img_ext>
        │   └── ...
        └─── label_2/ # left color camera label files
            ├── <name_1>.txt
            ├── <name_2>.txt
            └── ...

KITTI segmentation dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of non-format labels (optional)
    ├── label_colors.txt # optional, color map for non-original segmentation labels
    ├── testing/
    │   └── image_2/
    │       ├── <name_1>.<img_ext>
    │       ├── <name_2>.<img_ext>
    │       └── ...
    └── training/
        ├── image_2/ # left color camera images
        │   ├── <name_1>.<img_ext>
        │   ├── <name_2>.<img_ext>
        │   └── ...
        ├── label_2/ # left color camera label files
        │   ├── <name_1>.txt
        │   ├── <name_2>.txt
        │   └── ...
        ├── instance/ # instance segmentation masks
        │   ├── <name_1>.png
        │   ├── <name_2>.png
        │   └── ...
        ├── semantic/ # semantic segmentation masks (labels are encoded by its id)
        │   ├── <name_1>.png
        │   ├── <name_2>.png
        │   └── ...
        └── semantic_rgb/ # semantic segmentation masks (labels are encoded by its color)
            ├── <name_1>.png
            ├── <name_2>.png
            └── ...

To add custom classes, you can use dataset_meta.json and label_colors.txt. If the dataset_meta.json is not represented in the dataset, then label_colors.txt will be imported if possible.

You can import a dataset for specific tasks of KITTI dataset instead of the whole dataset, for example:

datum import --format kitti_detection <path/to/dataset>

To make sure that the selected dataset has been added to the project, you can run datum project info, which will display the project information.

Export to other formats

Datumaro can convert a KITTI dataset into any other format Datumaro supports.

Such conversion will only be successful if the output format can represent the type of dataset you want to convert, e.g. segmentation annotations can be saved in Cityscapes format, but not as COCO keypoints.

There are several ways to convert a KITTI dataset to other dataset formats:

datum create
datum import -f kitti <path/to/kitti>
datum export -f cityscapes -o <output/dir>

or

datum convert -if kitti -i <path/to/kitti> -f cityscapes -o <output/dir>

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'kitti')
dataset.export('save_dir', 'cityscapes', save_media=True)

Export to KITTI

There are several ways to convert a dataset to KITTI format:

# export dataset into KITTI format from existing project
datum export -p <path/to/project> -f kitti -o <output/dir> \
    -- --save-media
# converting to KITTI format from other format
datum convert -if cityscapes -i <path/to/dataset> \
    -f kitti -o <output/dir> -- --save-media

Extra options for exporting to KITTI format:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext IMAGE_EXT allow to specify image extension for exporting dataset (by default - keep original or use .png, if none)
  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)
  • --apply-colormap APPLY_COLORMAP allow to use colormap for class masks (in folder semantic_rgb, by default True)
  • --label_map allow to define a custom colormap. Example:
# mycolormap.txt :
# 0 0 255 sky
# 255 0 0 person
#...
datum export -f kitti -- --label-map mycolormap.txt

or you can use original kitti colomap:

datum export -f kitti -- --label-map kitti
  • --tasks TASKS allow to specify tasks for export dataset, by default Datumaro uses all tasks. Example:
datum export -f kitti -- --tasks detection
  • --allow-attributes ALLOW_ATTRIBUTES allow export of attributes (by default True).

Examples

Datumaro supports filtering, transformation, merging etc. for all formats and for the KITTI format in particular. Follow the user manual to get more information about these operations.

There are several examples of using Datumaro operations to solve particular problems with KITTI dataset:

Example 1. How to load an original KITTI dataset and convert to Cityscapes

datum create -o project
datum import -p project -f kitti ./KITTI/
datum stats -p project
datum export -p project -f cityscapes -- --save-media

Example 2. How to create a custom KITTI-like dataset

import numpy as np
import datumaro as dm

import datumaro.plugins.kitti_format as KITTI

label_map = {}
label_map['background'] = (0, 0, 0)
label_map['label_1'] = (1, 2, 3)
label_map['label_2'] = (3, 2, 1)
categories = KITTI.make_kitti_categories(label_map)

dataset = dm.Dataset.from_iterable([
  dm.DatasetItem(id=1,
    image=np.ones((1, 5, 3)),
    annotations=[
      dm.Mask(image=np.array([[1, 0, 0, 1, 1]]), label=1, id=0,
        attributes={'is_crowd': False}),
      dm.Mask(image=np.array([[0, 1, 1, 0, 0]]), label=2, id=0,
        attributes={'is_crowd': False}),
    ]
  ),
], categories=categories)

dataset.export('./dataset', format='kitti')

Examples of using this format from the code can be found in the format tests

4.18 - LFW

Format specification

LFW (Labeled Faces in the Wild Home) it’s dataset for face identification task, specification for this format is available here. You can also download original LFW dataset here.

Original dataset contains images with people faces. For each image contains information about person’s name, as well as information about images that matched with this person and mismatched with this person. Also LFW contains additional information about landmark points on the face.

Supported annotation types:

  • Label
  • Points (face landmark points)

Supported attributes:

  • negative_pairs: list with names of mismatched persons;
  • positive_pairs: list with names of matched persons;

Import LFW dataset

Importing LFW dataset into the Datumaro project:

datum create
datum import -f lfw <path_to_lfw_dataset>

See more information about adding datasets to the project in the docs.

Also you can import LFW dataset from Python API:

import datumaro as dm

lfw_dataset = dm.Dataset.import_from('<path_to_lfw_dataset>', 'lfw')

For successful importing the LFW dataset, the directory with it should has the following structure:

<path_to_lfw_dataset>/
├── subset_1
│    ├── annotations
│    │   ├── landmarks.txt # list with landmark points for each image
│    │   ├── pairs.txt # list of matched and mismatched pairs of person
│    │   └── people.txt # optional file with a list of persons name
│    └── images
│        ├── name0
│        │   ├── name0_0001.jpg
│        │   ├── name0_0002.jpg
│        │   ├── ...
│        ├── name1
│        │   ├── name1_0001.jpg
│        │   ├── name1_0002.jpg
│        │   ├── ...
├── subset_2
│    ├── ...
├── ...

Full description of annotation *.txt files available here.

Export LFW dataset

With Datumaro you can convert LFW dataset into any other format Datumaro supports. Pay attention that this format should also support Label and/or Points annotation types.

There is few ways to convert LFW dataset into other format:


# Converting to ImageNet with `convert` command:
datum convert -if lfw -i ./lfw_dataset \
    -f imagenet -o ./output_dir -- --save-media


# Converting to VggFace2 through the Datumaro project:
datum create
datum add -f lfw ./lfw_dataset
datum export -f vgg_face2 -o ./output_dir2

Note: some formats have extra export options. For particular format see the docs to get information about it.

Export dataset to the LFW format

With Datumaro you can export dataset that has Label or/and Points annotations to the LFW format, example:

# Converting VGG Face2 dataset into the LFW format
datum convert -if vgg_face2 -i ./voc_dataset \
    -f lfw -o ./output_dir


# Export dataaset to the LFW format through the Datumaro project:
datum create
datum import -f voc_classification ../vgg_dataset
datum export -f lfw -o ./output_dir -- --save-media --image-ext png

Available extra export options for LFW dataset format:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext IMAGE_EXT allow to specify image extension for exporting dataset (by default - keep original)

4.19 - Mapillary Vistas

Format specification

Mapillary Vistas dataset homepage is available here. After registration the dataset will be available for downloading. The specification for this format contains in the root directory of original dataset.

Supported annotation types: - Mask (class, instances, panoptic) - Polygon

Supported atttibutes: - is_crowd(boolean; on panoptic mask): Indicates that the annotation covers multiple instances of the same class.

Import Mapillary Vistas dataset

Use these instructions to import Mapillary Vistas dataset into Datumaro project:

datum create
datum add -f mapillary_vistas ./dataset

Note: the directory with dataset should be subdirectory of the project directory.

Note: there is no opportunity to import both instance and panoptic masks for one dataset.

Use one of subformats (mapillary_vistas_instances, mapillary_vistas_panoptic), if your dataset contains both panoptic and instance masks:

datum add -f mapillary_vistas_instances ./dataset

or

datum add -f mapillary_vistas_panoptic ./dataset

Extra options for adding a source in the Mapillary Vistas format:

  • --use-original-config: Use original config_*.json file for your version of Mapillary Vistas dataset. This options can helps to import dataset, in case when you don’t have config_*.json file, but your dataset is using original categories of Mapillary Vistas dataset. The version of dataset will be detect by the name of annotation directory in your dataset (v1.2 or v2.0).
  • --keep-original-category-ids: Add dummy label categories so that category indexes in the imported data source correspond to the category IDs in the original annotation file.

Example of using extra options:

datum add -f mapillary_vistas ./dataset -- --use-original-config

Mapillary Vistas dataset has two versions: v1.2, v2.0. They differ in the number of classes, the name of the classes, supported types of annotations, and the names of the directory with annotations. So, the directory with dataset should have one of these structures:

dataset
├── dataset_meta.json # a list of custom labels (optional)
├── config_v1.2.json # config file with description of classes (id, color, name)
├── <subset_name1>
│   ├── images
│   │   ├── <image_name1>.jpg
│   │   ├── <image_name2>.jpg
│   │   ├── ...
│   └── v1.2
│       ├── instances # directory with instance masks
│       │   └── <image_name1>.png
│       │   ├── <image_name2>.png
│       │   ├── ...
│       └── labels # directory with class masks
│           └── <image_name1>.png
│           ├── <image_name2>.png
│           ├── ...
├── <subset_name2>
│   ├── ...
├── ...
  
dataset
├── config_v2.0.json
├── <subset_name1> # config file with description of classes (id, color, name)
│   ├── images
│   │   ├── <image_name1>.jpg
│   │   ├── <image_name2>.jpg
│   │   ├── ...
│   └── v2.0
│       ├── instances # directory with instance masks
│       │   ├── <image_name1>.png
│       │   ├── <image_name2>.png
│       │   ├── ...
│       ├── labels # directory with class masks
│       │   ├── <image_name1>.png
│       │   ├── <image_name2>.png
│       │   ├── ...
│       ├── panoptic # directory with panoptic masks and panoptic config file
│       │   ├── panoptic_2020.json # description of classes and annotations
│       │   ├── <image_name1>.png
│       │   ├── <image_name2>.png
│       │   ├── ...
│       └── polygons # directory with description of polygons
│           ├── <image_name1>.json
│           ├── <image_name2>.json
│           ├── ...
├── <subset_name2>
    ├── ...
├── ...
  
dataset
├── config_v1.2.json # config file with description of classes (id, color, name)
├── images
│   ├── <image_name1>.jpg
│   ├── <image_name2>.jpg
│   ├── ...
└── v1.2
    ├── instances # directory with instance masks
    │   └── <image_name1>.png
    │   ├── <image_name2>.png
    │   ├── ...
    └── labels # directory with class masks
        └── <image_name1>.png
        ├── <image_name2>.png
        ├── ...
  
dataset
├── config_v2.0.json
├── images
│   ├── <image_name1>.jpg
│   ├── <image_name2>.jpg
│   ├── ...
└── v2.0
    ├── instances # directory with instance masks
    │   ├── <image_name1>.png
    │   ├── <image_name2>.png
    │   ├── ...
    ├── labels # directory with class masks
    │   ├── <image_name1>.png
    │   ├── <image_name2>.png
    │   ├── ...
    ├── panoptic # directory with panoptic masks and panoptic config file
    │   ├── panoptic_2020.json # description of classes and annotation objects
    │   ├── <image_name1>.png
    │   ├── <image_name2>.png
    │   ├── ...
    └── polygons # directory with description of polygons
        ├── <image_name1>.json
        ├── <image_name2>.json
        ├── ...
  

To add custom classes, you can use dataset_meta.json.

See examples of annotation files in test assets.

4.20 - Market-1501

Format specification

Market-1501 is a dataset for person re-identification task, link for downloading this dataset is available here.

Supported items attributes:

  • person_id (str): four-digit number that represent ID of pedestrian;
  • camera_id (int): one-digit number that represent ID of camera that took the image (original dataset has totally 6 cameras);
  • track_id (int): one-digit number that represent ID of the track with the particular pedestrian, this attribute matches with sequence_id in the original dataset;
  • frame_id (int): six-digit number, that mean number of frame within this track. For the tracks, their names are accumulated for each ID, but for frames, they start from “0001” in each track;
  • bbox_id (int): two-digit number, that mean number of bounding bbox that was selected for that image (see the original docs for more info).

These item attributes decodes into the image name with such convention:

0000_c1s1_000000_00.jpg
  • first four digits indicate the person_id;
  • digit after c indicates the camera_id;
  • digit after s indicate the track_id;
  • six digits after s1_ indicate the frame_id;
  • the last two digits before .jpg indicate the bbox_id.

Import Market-1501 dataset

Importing of Market-1501 dataset into the Datumaro project:

datum create
datum import -f market1501 <path_to_market1501>

See more information about adding datasets to the project in the docs.

Or you can import Market-1501 using Python API:

import datumaro as dm
dataset = dm.Dataset.import_from('<path_to_dataset>', 'market1501')

For successful importing the Market-1501 dataset, the directory with it should has the following structure:

market1501_dataset/
├── query # optional directory with query image
│   ├── 0001_c1s1_001051_00.jpg
│   ├── 0002_c1s1_001051_00.jpg
│   ├── ...
├── bounding_box_<subset_name1>
│   ├── 0003_c1s1_001051_00.jpg
│   ├── 0003_c2s1_001054_01.jpg
│   ├── 0004_c1s1_001051_00.jpg
│   ├── ...
├── bounding_box_<subset_name2>
│   ├── 0005_c1s1_001051_00.jpg
│   ├── 0006_c1s1_001051_00.jpg
│   ├── ...
├── ...

Export dataset to the Market-1501 format

With Datumaro you can export dataset, that has person_id item attribute, to the Market-1501 format, example:

# Converting MARS dataset into the Market-1501
datum convert -if mars -i ./mars_dataset \
    -f market1501 -o ./output_dir
# Export dataaset to the Market-1501 format through the Datumaro project:
datum create
datum add -f mars ../mars
datum export -f market1501 -o ./output_dir -- --save-media --image-ext png

Note: if your dataset contains only person_id attributes Datumaro will assign default values for other attributes (camera_id, track_id, bbox_id) and increment frame_id for collisions.

Available extra export options for Market-1501 dataset format:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext IMAGE_EXT allow to specify image extension for exporting dataset (by default - keep original)

4.21 - MARS

Format specification

MARS is a dataset for the motion analysis and person identification task. MARS dataset is available for downloading here

Supported types of annotations:

  • Bbox

Required attributes:

  • person_id (str): four-digit number that represent ID of pedestrian;
  • camera_id (int): one-digit number that represent ID of camera that took the image (original dataset has totally 6 cameras);
  • track_id (int): four-digit number that represent ID of the track with the particular pedestrian;
  • frame_id (int): three-digit number, that mean number of frame within this track. For the tracks, their names are accumulated for each ID, but for frames, they start from “0001” in each track.

Import MARS dataset

Use these instructions to import MARS dataset into Datumaro project:

datum create
datum add -f mars ./dataset

Note: the directory with dataset should be subdirectory of the project directory.

mars_dataset
├── <bbox_subset_name1>
│   ├── 0001 # directory with images of pedestrian with id 0001
│   │   ├── 0001C1T0001F001.jpg
│   │   ├── 0001C1T0001F002.jpg
│   │   ├── ...
│   ├── 0002 # directory with images of pedestrian with id 0002
│   │   ├── 0002C1T0001F001.jpg
│   │   ├── 0002C1T0001F001.jpg
│   │   ├── ...
│   ├── 0000 # distractors images, which negatively affect retrieval accuracy.
│   │   ├── 0000C1T0001F001.jpg
│   │   ├── 0000C1T0001F001.jpg
│   │   ├── ...
│   ├── 00-1 # junk images which do not affect retrieval accuracy
│   │   ├── 00-1C1T0001F001.jpg
│   │   ├── 00-1C1T0001F001.jpg
│   │   ├── ...
├── <bbox_subset_name2>
│   ├── ...
├── ...

All images in MARS dataset follow a strict convention of naming:

xxxxCxTxxxxFxxx.jpg
  • the first four digits indicate the pedestrian’s number;
  • digit after C indicates the camera id;
  • four digits after T indicate the track id for this pedestrian;
  • three digits after F indicate the frame id with this track.

Note: there are two specific pedestrian IDs 0000 and 00-1 which indicate distracting images and unwanted images respectively.

4.22 - MNIST

Format specification

MNIST format specification is available here.

Fashion MNIST format specification is available here.

MNIST in CSV format specification is available here.

The dataset has several data formats available. Datumaro supports the binary (Python pickle) format and the CSV variant. Each data format is covered by a separate Datumaro format.

Supported formats:

  • Binary (Python pickle) - mnist
  • CSV - mnist_csv

Supported annotation types:

  • Label

The format only supports single channel 28 x 28 images.

Import MNIST dataset

The MNIST dataset is available for free download:

The Fashion MNIST dataset is available for free download:

The MNIST in CSV dataset is available for free download:

A Datumaro project with a MNIST source can be created in the following way:

datum create
datum import --format mnist <path/to/dataset>
datum import --format mnist_csv <path/to/dataset>

MNIST dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of non-format labels (optional)
    ├── labels.txt # a list of non-digit labels  in other format (optional)
    ├── t10k-images-idx3-ubyte.gz
    ├── t10k-labels-idx1-ubyte.gz
    ├── train-images-idx3-ubyte.gz
    └── train-labels-idx1-ubyte.gz

MNIST in CSV dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of non-format labels (optional)
    ├── labels.txt # a list of non-digit labels  in other format (optional)
    ├── mnist_test.csv
    └── mnist_train.csv

To add custom classes, you can use dataset_meta.json and labels.txt. If the dataset_meta.json is not represented in the dataset, then labels.txt will be imported if possible.

For example, labels.txt for Fashion MNIST the following contents:

T-shirt/top
Trouser
Pullover
Dress
Coat
Sandal
Shirt
Sneaker
Bag
Ankle boot

Export to other formats

Datumaro can convert a MNIST dataset into any other format Datumaro supports. To get the expected result, convert the dataset to formats that support the classification task (e.g. CIFAR-10/100, ImageNet, PascalVOC, etc.)

There are several ways to convert a MNIST dataset to other dataset formats:

datum create
datum import -f mnist <path/to/mnist>
datum export -f imagenet -o <output/dir>

or

datum convert -if mnist -i <path/to/mnist> -f imagenet -o <output/dir>

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'mnist')
dataset.export('save_dir', 'imagenet', save_media=True)

These steps also will work for MNIST in CSV, if you use mnist_csv instead of mnist.

Export to MNIST

There are several ways to convert a dataset to MNIST format:

# export dataset into MNIST format from existing project
datum export -p <path/to/project> -f mnist -o <output/dir> \
    -- --save-media
# converting to MNIST format from other format
datum convert -if imagenet -i <path/to/dataset> \
    -f mnist -o <output/dir> -- --save-media

Extra options for exporting to MNIST format:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext <IMAGE_EXT> allow to specify image extension for exporting dataset (by default .png)
  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)

These commands also work for MNIST in CSV if you use mnist_csv instead of mnist.

Examples

Datumaro supports filtering, transformation, merging etc. for all formats and for the MNIST format in particular. Follow the user manual to get more information about these operations.

There are several examples of using Datumaro operations to solve particular problems with MNIST dataset:

Example 1. How to create a custom MNIST-like dataset

import numpy as np
import datumaro as dm

dataset = dm.Dataset.from_iterable([
    dm.DatasetItem(id=0, image=np.ones((28, 28)),
        annotations=[dm.Label(2)]
    ),
    dm.DatasetItem(id=1, image=np.ones((28, 28)),
        annotations=[dm.Label(7)]
    )
], categories=[str(label) for label in range(10)])

dataset.export('./dataset', format='mnist')

Example 2. How to filter and convert a MNIST dataset to ImageNet

Convert MNIST dataset to ImageNet format, keep only images with 3 class presented:

# Download MNIST dataset:
# https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
# https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
datum convert --input-format mnist --input-path <path/to/mnist> \
              --output-format imagenet \
              --filter '/item[annotation/label="3"]'

Examples of using this format from the code can be found in the binary format tests and csv format tests

4.23 - MPII Human Pose Dataset

Format specification

The original MPII Human Pose Dataset is available here.

Supported annotation types:

  • Bbox
  • Points

Supported attributes:

  • center (a list with two coordinates of the center point of the object)
  • scale (float)

Import MPII Human Pose Dataset

A Datumaro project with an MPII Human Pose Dataset source can be created in the following way:

datum create
datum import --format mpii <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

mpii_dataset = dm.Dataset.import_from('<path/to/dataset>', 'mpii')

MPII Human Pose Dataset directory should have the following structure:

dataset/
├── mpii_human_pose_v1_u12_1.mat
├── 000000001.jpg
├── 000000002.jpg
├── 000000003.jpg
└── ...

Export to other formats

Datumaro can convert an MPII Human Pose Dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports bounding boxes or points.

There are several ways to convert an MPII Human Pose Dataset to other dataset formats using CLI:

datum create
datum import -f mpii <path/to/dataset>
datum export -f voc -o ./save_dir -- --save-media

or

datum convert -if mpii -i <path/to/dataset> \
    -f voc -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'mpii')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in the format tests

4.24 - MPII Human Pose Dataset (JSON)

Format specification

The original MPII Human Pose Dataset is available here.

Supported annotation types:

  • Bbox
  • Points

Supported attributes:

  • center (a list with two coordinates of the center point of the object)
  • scale (float)

Import MPII Human Pose Dataset (JSON)

A Datumaro project with an MPII Human Pose Dataset (JSON) source can be created in the following way:

datum create
datum import --format mpii_json <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

mpii_dataset = dm.Dataset.import_from('<path/to/dataset>', 'mpii_json')

MPII Human Pose Dataset (JSON) directory should have the following structure:

dataset/
├── jnt_visible.npy # optional
├── mpii_annotations.json
├── mpii_headboxes.npy # optional
├── mpii_pos_gt.npy # optional
├── 000000001.jpg
├── 000000002.jpg
├── 000000003.jpg
└── ...

Export to other formats

Datumaro can convert an MPII Human Pose Dataset (JSON) into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports bounding boxes or points.

There are several ways to convert an MPII Human Pose Dataset (JSON) to other dataset formats using CLI:

datum create
datum import -f mpii_json <path/to/dataset>
datum export -f voc -o ./save_dir -- --save-media

or

datum convert -if mpii_json -i <path/to/dataset> \
    -f voc -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'mpii_json')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in the format tests

4.25 - Open Images

Format specification

A description of the Open Images Dataset (OID) format is available here. Datumaro supports versions 4, 5 and 6.

Supported annotation types:

  • Label (human-verified image-level labels)
  • Bbox (bounding boxes)
  • Mask (segmentation masks)

Supported annotation attributes:

  • Labels

    • score (read/write, float). The confidence level from 0 to 1. A score of 0 indicates that the image does not contain objects of the corresponding class.
  • Bounding boxes

    • score (read/write, float). The confidence level from 0 to 1. In the original dataset this is always equal to 1, but custom datasets may be created with arbitrary values.
    • occluded (read/write, boolean). Whether the object is occluded by another object.
    • truncated (read/write, boolean). Whether the object extends beyond the boundary of the image.
    • is_group_of (read/write, boolean). Whether the object represents a group of objects of the same class.
    • is_depiction (read/write, boolean). Whether the object is a depiction (such as a drawing) rather than a real object.
    • is_inside (read/write, boolean). Whether the object is seen from the inside.
  • Masks

    • box_id (read/write, string). An identifier for the bounding box associated with the mask.
    • predicted_iou (read/write, float). Predicted IoU value with respect to the ground truth.

Import Open Images dataset

The Open Images dataset is available for free download.

See the open-images-dataset GitHub repository for information on how to download the images.

Datumaro also requires the image description files, which can be downloaded from the following URLs:

In addition, the following metadata file must be present in the annotations directory:

You can optionally download the following additional metadata file:

Annotations can be downloaded from the following URLs:

All annotation files are optional, except that if the mask metadata files for a given subset are downloaded, all corresponding images must be downloaded as well, and vice versa.

A Datumaro project with an OID source can be created in the following way:

datum create
datum import --format open_images <path/to/dataset>

It is possible to specify project name and project directory. Run datum create --help for more information.

Open Images dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of custom labels (optional)
    ├── annotations/
    │   └── bbox_labels_600_hierarchy.json
    │   └── image_ids_and_rotation.csv  # optional
    │   └── oidv6-class-descriptions.csv
    │   └── *-annotations-bbox.csv
    │   └── *-annotations-human-imagelabels.csv
    │   └── *-annotations-object-segmentation.csv
    ├── images/
    |   ├── test/
    |   │   ├── <image_name1.jpg>
    |   │   ├── <image_name2.jpg>
    |   │   └── ...
    |   ├── train/
    |   │   ├── <image_name1.jpg>
    |   │   ├── <image_name2.jpg>
    |   │   └── ...
    |   └── validation/
    |       ├── <image_name1.jpg>
    |       ├── <image_name2.jpg>
    |       └── ...
    └── masks/
        ├── test/
        │   ├── <mask_name1.png>
        │   ├── <mask_name2.png>
        │   └── ...
        ├── train/
        │   ├── <mask_name1.png>
        │   ├── <mask_name2.png>
        │   └── ...
        └── validation/
            ├── <mask_name1.png>
            ├── <mask_name2.png>
            └── ...

The mask images must be extracted from the ZIP archives linked above.

To use per-subset image description files instead of image_ids_and_rotation.csv, place them in the annotations subdirectory. The annotations directory is optional and you can store all annotation files in the root of input path.

To add custom classes, you can use dataset_meta.json.

Creating an image metadata file

To load bounding box and segmentation mask annotations, Datumaro needs to know the sizes of the corresponding images. By default, it will determine these sizes by loading each image from disk, which requires the images to be present and makes the loading process slow.

If you want to load the aforementioned annotations on a machine where the images are not available, or just to speed up the dataset loading process, you can extract the image size information in advance and record it in an image metadata file. This file must be placed at annotations/images.meta, and must contain one line per image, with the following structure:

<ID> <height> <width>

Where <ID> is the file name of the image without the extension, and <height> and <width> are the dimensions of that image. <ID> may be quoted with either single or double quotes.

The image metadata file, if present, will be used to determine the image sizes without loading the images themselves.

Here’s one way to create the images.meta file using ImageMagick, assuming that the images are present on the current machine:

# run this from the dataset directory
find images -name '*.jpg' -exec \
    identify -format '"%[basename]" %[height] %[width]\n' {} + \
    > annotations/images.meta

Export to other formats

Datumaro can convert OID into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports image-level labels. There are several ways to convert OID to other dataset formats:

datum create
datum import -f open_images <path/to/open_images>
datum export -f cvat -o <output/dir>

or

datum convert -if open_images -i <path/to/open_images> -f cvat -o <output/dir>

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'open_images')
dataset.export('save_dir', 'cvat', save_media=True)

Export to Open Images

There are several ways to convert an existing dataset to the Open Images format:

# export dataset into Open Images format from existing project
datum export -p <path/to/project> -f open_images -o <output/dir> \
  -- --save_media
# convert a dataset in another format to the Open Images format
datum convert -if imagenet -i <path/to/dataset> \
    -f open_images -o <output/dir> \
    -- --save-media

Extra options for exporting to the Open Images format:

  • --save-media - save media files when exporting the dataset (by default, False)
  • --image-ext IMAGE_EXT - save image files with the specified extension when exporting the dataset (by default, uses the original extension or .jpg if there isn’t one)
  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)

Examples

Datumaro supports filtering, transformation, merging etc. for all formats and for the Open Images format in particular. Follow the user manual to get more information about these operations.

Here are a few examples of using Datumaro operations to solve particular problems with the Open Images dataset:

Example 1. Load the Open Images dataset and convert to the CVAT format

datum create -o project
datum import -p project -f open_images ./open-images-dataset/
datum stats -p project
datum export -p project -f cvat -- --save-media

Example 2. Create a custom OID-like dataset

import numpy as np
import datumaro as dm

dataset = dm.Dataset.from_iterable([
    dm.DatasetItem(
        id='0000000000000001',
        image=np.ones((1, 5, 3)),
        subset='validation',
        annotations=[
            dm.Label(0, attributes={'score': 1}),
            dm.Label(1, attributes={'score': 0}),
        ],
    ),
], categories=['/m/0', '/m/1'])

dataset.export('./dataset', format='open_images')

Examples of using this format from the code can be found in the format tests.

4.26 - Pascal VOC

Format specification

Pascal VOC format specification is available here.

The dataset has annotations for multiple tasks. Each task has its own format in Datumaro, and there is also a combined voc format, which includes all the available tasks. The sub-formats have the same options as the “main” format and only limit the set of annotation files they work with. To work with multiple formats, use the corresponding option of the voc format.

Supported tasks / formats:

  • The combined format - voc
  • Image classification - voc_classification
  • Object detection - voc_detection
  • Action classification - voc_action
  • Class and instance segmentation - voc_segmentation
  • Person layout detection - voc_layout

Supported annotation types:

  • Label (classification)
  • Bbox (detection, action detection and person layout)
  • Mask (segmentation)

Supported annotation attributes:

  • occluded (boolean) - indicates that a significant portion of the object within the bounding box is occluded by another object
  • truncated (boolean) - indicates that the bounding box specified for the object does not correspond to the full extent of the object
  • difficult (boolean) - indicates that the object is considered difficult to recognize
  • action attributes (boolean) - jumping, reading and others. Indicate that the object does the corresponding action.
  • arbitrary attributes (string/number) - A Datumaro extension. Stored in the attributes section of the annotation xml file. Available for bbox annotations only.

Import Pascal VOC dataset

The Pascal VOC dataset is available for free download here

A Datumaro project with a Pascal VOC source can be created in the following way:

datum create
datum import --format voc <path/to/dataset>

It is possible to specify project name and project directory. Run datum create --help for more information.

Pascal VOC dataset directory should have the following structure:

└─ Dataset/
   ├── dataset_meta.json # a list of non-Pascal labels (optional)
   ├── labelmap.txt # or a list of non-Pascal labels in other format (optional)
   │
   ├── Annotations/
   │     ├── ann1.xml # Pascal VOC format annotation file
   │     ├── ann2.xml
   │     └── ...
   ├── JPEGImages/
   │    ├── img1.jpg
   │    ├── img2.jpg
   │    └── ...
   ├── SegmentationClass/ # directory with semantic segmentation masks
   │    ├── img1.png
   │    ├── img2.png
   │    └── ...
   ├── SegmentationObject/ # directory with instance segmentation masks
   │    ├── img1.png
   │    ├── img2.png
   │    └── ...
   │
   └── ImageSets/
        ├── Main/ # directory with list of images for detection and classification task
        │   ├── test.txt  # list of image names in test subset  (without extension)
        |   ├── train.txt # list of image names in train subset (without extension)
        |   └── ...
        ├── Layout/ # directory with list of images for person layout task
        │   ├── test.txt
        |   ├── train.txt
        |   └── ...
        ├── Action/ # directory with list of images for action classification task
        │   ├── test.txt
        |   ├── train.txt
        |   └── ...
        └── Segmentation/ # directory with list of images for segmentation task
            ├── test.txt
            ├── train.txt
            └── ...

The ImageSets directory should contain at least one of the directories: Main, Layout, Action, Segmentation. These directories contain .txt files with a list of images in a subset, the subset name is the same as the .txt file name. Subset names can be arbitrary.

To add custom classes, you can use dataset_meta.json and labelmap.txt. If the dataset_meta.json is not represented in the dataset, then labelmap.txt will be imported if possible.

In labelmap.txt you can define custom color map and non-pascal labels, for example:

# label_map [label : color_rgb : parts : actions]
helicopter:::
elephant:0:124:134:head,ear,foot:

It is also possible to import grayscale (1-channel) PNG masks. For grayscale masks provide a list of labels with the number of lines equal to the maximum color index on images. The lines must be in the right order so that line index is equal to the color index. Lines can have arbitrary, but different, colors. If there are gaps in the used color indices in the annotations, they must be filled with arbitrary dummy labels. Example:

car:0,128,0:: # color index 0
aeroplane:10,10,128:: # color index 1
_dummy2:2,2,2:: # filler for color index 2
_dummy3:3,3,3:: # filler for color index 3
boat:108,0,100:: # color index 3
...
_dummy198:198,198,198:: # filler for color index 198
_dummy199:199,199,199:: # filler for color index 199
the_last_label:12,28,0:: # color index 200

You can import dataset for specific tasks of Pascal VOC dataset instead of the whole dataset, for example:

datum import -f voc_detection -r ImageSets/Main/train.txt <path/to/dataset>

To make sure that the selected dataset has been added to the project, you can run datum project info, which will display the project information.

Export to other formats

Datumaro can convert a Pascal VOC dataset into any other format Datumaro supports.

Such conversion will only be successful if the output format can represent the type of dataset you want to convert, e.g. image classification annotations can be saved in ImageNet format, but not as COCO keypoints.

There are several ways to convert a Pascal VOC dataset to other dataset formats:

datum create
datum import -f voc <path/to/voc>
datum export -f coco -o <output/dir>

or

datum convert -if voc -i <path/to/voc> -f coco -o <output/dir>

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'voc')
dataset.export('save_dir', 'coco', save_media=True)

Export to Pascal VOC

There are several ways to convert an existing dataset to Pascal VOC format:

# export dataset into Pascal VOC format (classification) from existing project
datum export -p <path/to/project> -f voc -o <output/dir> -- --tasks classification
# converting to Pascal VOC format from other format
datum convert -if imagenet -i <path/to/dataset> \
    -f voc -o <output/dir> \
    -- --label_map voc --save-media

Extra options for exporting to Pascal VOC format:

  • --save-media - allow to export dataset with saving media files (by default False)
  • --image-ext IMAGE_EXT - allow to specify image extension for exporting dataset (by default use original or .jpg if none)
  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)
  • --apply-colormap APPLY_COLORMAP - allow to use colormap for class and instance masks (by default True)
  • --allow-attributes ALLOW_ATTRIBUTES - allow export of attributes (by default True)
  • --keep-empty KEEP_EMPTY - write subset lists even if they are empty (by default False)
  • --tasks TASKS - allow to specify tasks for export dataset, by default Datumaro uses all tasks. Example:
datum export -f voc -- --tasks detection,classification
  • --label_map PATH - allows to define a custom colormap. Example:
# mycolormap.txt [label : color_rgb : parts : actions]:
# cat:0,0,255::
# person:255,0,0:head:
datum export -f voc_segmentation -- --label-map mycolormap.txt

or you can use original voc colomap:

datum export -f voc_segmentation -- --label-map voc

Examples

Datumaro supports filtering, transformation, merging etc. for all formats and for the Pascal VOC format in particular. Follow user manual to get more information about these operations.

There are few examples of using Datumaro operations to solve particular problems with Pascal VOC dataset:

Example 1. How to prepare an original dataset for training.

In this example, preparing the original dataset to train the semantic segmentation model includes: loading, checking duplicate images, setting the number of images, splitting into subsets, export the result to Pascal VOC format.

datum create -o project
datum import -p project -f voc_segmentation ./VOC2012/ImageSets/Segmentation/trainval.txt
datum stats -p project # check statisctics.json -> repeated images
datum transform -p project -t ndr -- -w trainval -k 2500
datum filter -p project -e '/item[subset="trainval"]'
datum transform -p project -t random_split -- -s train:.8 -s val:.2
datum export -p project -f voc -- --label-map voc --save-media

Example 2. How to create a custom dataset

import datumaro as dm

dataset = dm.Dataset.from_iterable([
    dm.DatasetItem(id='image1', image=dm.Image(path='image1.jpg', size=(10, 20)),
        annotations=[
            dm.Label(3),
            dm.Bbox(1.0, 1.0, 10.0, 8.0, label=0, attributes={'difficult': True, 'running': True}),
            dm.Polygon([1, 2, 3, 2, 4, 4], label=2, attributes={'occluded': True}),
            dm.Polygon([6, 7, 8, 8, 9, 7, 9, 6], label=2),
        ]
    ),
], categories=['person', 'sky', 'water', 'lion'])

dataset.transform('polygons_to_masks')
dataset.export('./mydataset', format='voc', label_map='my_labelmap.txt')

my_labelmap.txt has the following contents:

# label:color_rgb:parts:actions
person:0,0,255:hand,foot:jumping,running
sky:128,0,0::
water:0,128,0::
lion:255,128,0::

Example 3. Load, filter and convert from code

Load Pascal VOC dataset, and export train subset with items which has jumping attribute:

import datumaro as dm

dataset = dm.Dataset.import_from('./VOC2012', format='voc')

train_dataset = dataset.get_subset('train').as_dataset()

def only_jumping(item):
    for ann in item.annotations:
        if ann.attributes.get('jumping'):
            return True
    return False

train_dataset.select(only_jumping)

train_dataset.export('./jumping_label_me', format='label_me', save_media=True)

Example 4. Get information about items in Pascal VOC 2012 dataset for segmentation task:

import datumaro as dm

dataset = dm.Dataset.import_from('./VOC2012', format='voc')

def has_mask(item):
    for ann in item.annotations:
        if ann.type == dm.AnnotationType.mask:
            return True
    return False

dataset.select(has_mask)

print("Pascal VOC 2012 has %s images for segmentation task:" % len(dataset))
for subset_name, subset in dataset.subsets().items():
    for item in subset:
        print(item.id, subset_name, end=";")

After executing this code, we can see that there are 5826 images in Pascal VOC 2012 has for segmentation task and this result is the same as the official documentation

Examples of using this format from the code can be found in tests

4.27 - Supervisely Point Cloud

Format specification

Specification for the Point Cloud data format is available here.

You can also find examples of working with the dataset here.

Supported annotation types:

  • cuboid_3d

Supported annotation attributes:

  • track_id (read/write, integer), responsible for object field
  • createdAt (write, string),
  • updatedAt (write, string),
  • labelerLogin (write, string), responsible for the corresponding fields in the annotation file.
  • arbitrary attributes

Supported image attributes:

  • description (read/write, string),
  • createdAt (write, string),
  • updatedAt (write, string),
  • labelerLogin (write, string), responsible for the corresponding fields in the annotation file.
  • frame (read/write, integer). Indicates frame number of the image.
  • arbitrary attributes

Import Supervisely Point Cloud dataset

An example dataset in Supervisely Point Cloud format is available for download:

https://drive.google.com/u/0/uc?id=1BtZyffWtWNR-mk_PHNPMnGgSlAkkQpBl&export=download

Point Cloud dataset directory should have the following structure:

└─ Dataset/
    ├── ds0/
    │   ├── ann/
    │   │   ├── <pcdname1.pcd.json>
    │   │   ├── <pcdname2.pcd.json>
    │   │   └── ...
    │   ├── pointcloud/
    │   │   ├── <pcdname1.pcd>
    │   │   ├── <pcdname1.pcd>
    │   │   └── ...
    │   ├── related_images/
    │   │   ├── <pcdname1_pcd>/
    │   │   |  ├── <image_name.ext.json>
    │   │   |  ├── <image_name.ext.json>
    │   │   └── ...
    ├── key_id_map.json
    └── meta.json

There are two ways to import a Supervisely Point Cloud dataset:

datum create
datum import --format sly_pointcloud --input-path <path/to/dataset>

or

datum create
datum import -f sly_pointcloud <path/to/dataset>

To make sure that the selected dataset has been added to the project, you can run datum project info, which will display the project and dataset information.

Export to other formats

Datumaro can convert Supervisely Point Cloud dataset into any other format Datumaro supports.

Such conversion will only be successful if the output format can represent the type of dataset you want to convert, e.g. 3D point clouds can be saved in KITTI Raw format, but not in COCO keypoints.

There are several ways to convert a Supervisely Point Cloud dataset to other dataset formats:

datum create
datum import -f sly_pointcloud <path/to/sly_pcd/>
datum export -f kitti_raw -o <output/dir>

or

datum convert -if sly_pointcloud -i <path/to/sly_pcd/> -f kitti_raw

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'sly_pointcloud')
dataset.export('save_dir', 'kitti_raw', save_media=True)

Export to Supervisely Point Cloud

There are several ways to convert a dataset to Supervisely Point Cloud format:

# export dataset into Supervisely Point Cloud format from existing project
datum export -p <path/to/project> -f sly_pointcloud -o <output/dir> \
    -- --save-media
# converting to Supervisely Point Cloud format from other format
datum convert -if kitti_raw -i <path/to/dataset> \
    -f sly_pointcloud -o <output/dir> -- --save-media

Extra options for exporting in Supervisely Point Cloud format:

  • --save-media allow to export dataset with saving media files. This will include point clouds and related images (by default False)
  • --image-ext IMAGE_EXT allow to specify image extension for exporting dataset (by default - keep original or use .png, if none)
  • --reindex assigns new indices to frames and annotations.
  • --allow-undeclared-attrs allows writing arbitrary annotation attributes. By default, only attributes specified in the input dataset metainfo will be written.

Examples

Example 1. Import dataset, compute statistics

datum create -o project
datum import -p project -f sly_pointcloud ../sly_dataset/
datum stats -p project

Example 2. Convert Supervisely Point Clouds to KITTI Raw

datum convert -if sly_pointcloud -i ../sly_pcd/ \
    -f kitti_raw -o my_kitti/ -- --save-media --reindex --allow-attrs

Example 3. Create a custom dataset

import datumaro as dm

dataset = dm.Dataset.from_iterable([
    dm.DatasetItem(id='frame_1',
        annotations=[
            dm.Cuboid3d(id=206, label=0,
                position=[320.86, 979.18, 1.04],
                attributes={'occluded': False, 'track_id': 1, 'x': 1}),

            dm.Cuboid3d(id=207, label=1,
                position=[318.19, 974.65, 1.29],
                attributes={'occluded': True, 'track_id': 2}),
        ],
        pcd='path/to/pcd1.pcd',
        attributes={'frame': 0, 'description': 'zzz'}
    ),

    dm.DatasetItem(id='frm2',
        annotations=[
            dm.Cuboid3d(id=208, label=1,
                position=[23.04, 8.75, -0.78],
                attributes={'occluded': False, 'track_id': 2})
        ],
        pcd='path/to/pcd2.pcd', related_images=['image2.png'],
        attributes={'frame': 1}
    ),
], categories=['cat', 'dog'])

dataset.export('my_dataset/', format='sly_pointcloud', save_media=True,
    allow_undeclared_attrs=True)

Examples of using this format from the code can be found in the format tests

4.28 - SYNTHIA

Format specification

The original SYNTHIA dataset is available here.

Datumaro supports all SYNTHIA formats except SYNTHIA-AL.

Supported annotation types:

  • Mask

Supported annotation attributes:

  • dynamic_object (boolean): whether the object moving

Import SYNTHIA dataset

A Datumaro project with a SYNTHIA source can be created in the following way:

datum create
datum import --format synthia <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

synthia_dataset = dm.Dataset.import_from('<path/to/dataset>', 'synthia')

SYNTHIA dataset directory should have the following structure:

dataset/
├── dataset_meta.json # a list of non-format labels (optional)
├── GT/
│   ├── COLOR/
│   │   ├── Stereo_Left/
│   │   │   ├── Omni_B
│   │   │   │   ├── 000000.png
│   │   │   │   ├── 000001.png
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── Stereo_Right
│   │       ├── Omni_B
│   │       │   ├── 000000.png
│   │       │   ├── 000001.png
│   │       │   └── ...
│   │       └── ...
│   └── LABELS
│       ├── Stereo_Left
│       │   ├── Omni_B
│       │   │   ├── 000000.png
│       │   │   ├── 000001.png
│       │   │   └── ...
│       │   └── ...
│       └── Stereo_Right
│           ├── Omni_B
│           │   ├── 000000.png
│           │   ├── 000001.png
│           │   └── ...
│           └── ...
└── RGB
    ├── Stereo_Left
    │   ├── Omni_B
    │   │   ├── 000000.png
    │   │   ├── 000001.png
    │   │   └── ...
    │   └── ...
    └── Stereo_Right
        ├── Omni_B
        │   ├── 000000.png
        │   ├── 000001.png
        │   └── ...
        └── ...
  • RGB folder containing standard RGB images used for training.
  • GT/LABELS folder containing containing PNG files (one per image). Annotations are given in three channels. The red channel contains the class of that pixel. The green channel contains the class only for those objects that are dynamic (cars, pedestrians, etc.), otherwise it contains 0.
  • GT/COLOR folder containing png files (one per image). Annotations are given using a color representation.

When importing a dataset, only GT/LABELS folder will be used. If it is missing, GT/COLOR folder will be used.

The original dataset also contains depth information, but Datumaro does not currently support it.

To add custom classes, you can use dataset_meta.json.

Export to other formats

Datumaro can convert a SYNTHIA dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports segmentation masks.

There are several ways to convert a SYNTHIA dataset to other dataset formats using CLI:

datum create
datum import -f synthia <path/to/dataset>
datum export -f voc -o <output/dir> -- --save-media

or

datum convert -if synthia -i <path/to/dataset> \
    -f voc -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'synthia')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in the format tests

4.29 - Velodyne Points / KITTI Raw 3D

Format specification

Velodyne Points / KITTI Raw 3D data format homepage is available here.

Velodyne Points / KITTI Raw 3D data format specification is available here.

Supported annotation types:

  • Cuboid3d (represent tracks)

Supported annotation attributes:

  • truncation (write, string), possible values: truncation_unset, in_image, truncated, out_image, behind_image (case-independent).
  • occlusion (write, string), possible values: occlusion_unset, visible, partly, fully (case-independent). This attribute has priority over occluded.
  • occluded (read/write, boolean)
  • keyframe (read/write, boolean). Responsible for occlusion_kf field.
  • track_id (read/write, integer). Indicates the group over frames for annotations, represent tracks.

Supported image attributes:

  • frame (read/write, integer). Indicates frame number of the image.

Import KITTI Raw dataset

The velodyne points/KITTI Raw dataset is available for download here and here.

KITTI Raw dataset directory should have the following structure:

└─ Dataset/
    ├── dataset_meta.json # a list of custom labels (optional)
    ├── image_00/ # optional, aligned images from different cameras
    │   └── data/
    │       ├── <name1.ext>
    │       └── <name2.ext>
    ├── image_01/
    │   └── data/
    │       ├── <name1.ext>
    │       └── <name2.ext>
    ...
    │
    ├── velodyne_points/ # optional, 3d point clouds
    │   └── data/
    │       ├── <name1.pcd>
    │       └── <name2.pcd>
    ├── tracklet_labels.xml
    └── frame_list.txt # optional, required for custom image names

The format does not support arbitrary image names and paths, but Datumaro provides an option to use a special index file to allow this.

frame_list.txt contents:

12345 relative/path/to/name1/from/data
46 relative/path/to/name2/from/data
...

To add custom classes, you can use dataset_meta.json.

A Datumaro project with a KITTI source can be created in the following way:

datum create
datum import --format kitti_raw <path/to/dataset>

To make sure that the selected dataset has been added to the project, you can run datum project info, which will display the project and dataset information.

Export to other formats

Datumaro can convert a KITTI Raw dataset into any other format Datumaro supports.

Such conversion will only be successful if the output format can represent the type of dataset you want to convert, e.g. 3D point clouds can be saved in Supervisely Point Clouds format, but not in COCO keypoints.

There are several ways to convert a KITTI Raw dataset to other dataset formats:

datum create
datum import -f kitti_raw <path/to/kitti_raw>
datum export -f sly_pointcloud -o <output/dir>

or

datum convert -if kitti_raw -i <path/to/kitti_raw> -f sly_pointcloud

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'kitti_raw')
dataset.export('save_dir', 'sly_pointcloud', save_media=True)

Export to KITTI Raw

There are several ways to convert a dataset to KITTI Raw format:

# export dataset into KITTI Raw format from existing project
datum export -p <path/to/project> -f kitti_raw -o <output/dir> \
    -- --save-media
# converting to KITTI Raw format from other format
datum convert -if sly_pointcloud -i <path/to/dataset> \
    -f kitti_raw -o <output/dir> -- --save-media --reindex

Extra options for exporting to KITTI Raw format:

  • --save-media allow to export dataset with saving media files. This will include point clouds and related images (by default False)
  • --image-ext IMAGE_EXT allow to specify image extension for exporting dataset (by default - keep original or use .png, if none)
  • --reindex assigns new indices to frames and tracks. Allows annotations without track_id attribute (they will be exported as single-frame tracks).
  • --allow-attrs allows writing arbitrary annotation attributes. They will be written in <annotations> section of <poses><item> (disabled by default)

Examples

Example 1. Import dataset, compute statistics

datum create -o project
datum import -p project -f kitti_raw ../kitti_raw/
datum stats -p project

Example 2. Convert Supervisely Pointclouds to KITTI Raw

datum convert -if sly_pointcloud -i ../sly_pcd/ \
    -f kitti_raw -o my_kitti/ -- --save-media --allow-attrs

Example 3. Create a custom dataset

import numpy as np
import datumaro as dm

dataset = dm.Dataset.from_iterable([
    dm.DatasetItem(id='some/name/qq',
        annotations=[
            dm.Cuboid3d(position=[13.54, -9.41, 0.24], label=0,
                attributes={'occluded': False, 'track_id': 1}),

            dm.Cuboid3d(position=[3.4, -2.11, 4.4], label=1,
                attributes={'occluded': True, 'track_id': 2})
        ],
        pcd='path/to/pcd1.pcd',
        related_images=[np.ones((10, 10)), 'path/to/image2.png', 'image3.jpg'],
        attributes={'frame': 0}
    ),
], categories=['cat', 'dog'])

dataset.export('my_dataset/', format='kitti_raw', save_media=True)

Examples of using this format from the code can be found in the format tests

4.30 - Vgg Face2 CSV

Format specification

Vgg Face 2 is a dataset for face-recognition task, the repository with some information and sample data of Vgg Face 2 is available here

Supported types of annotations:

  • Bbox
  • Points
  • Label

Format doesn’t support any attributes for annotations objects.

Import Vgg Face2 dataset

A Datumaro project with a Vgg Face 2 dataset can be created in the following way:

datum create
datum import -f vgg_face2 <path_to_dataset>

Note: if you use datum import then <path_to_dataset> should not be a subdirectory of directory with Datumaro project, see more information about it in the docs.

And you can also load Vgg Face 2 through the Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path_to_dataset>', format='vgg_face2')

For successful importing of Vgg Face2 face the input directory with dataset should has the following structure:

vgg_face2_dataset/
├── labels.txt # labels mapping
├── bb_landmark
│   ├── loose_bb_test.csv  # information about bounding boxes for test subset
│   ├── loose_bb_train.csv
│   ├── loose_bb_<any_other_subset_name>.csv
│   ├── loose_landmark_test.csv # landmark points information for test subset
│   ├── loose_landmark_train.csv
│   └── loose_landmark_<any_other_subset_name>.csv
├── test
│   ├── n000001 # directory with images for n000001 label
│   │   ├── 0001_01.jpg
│   │   ├── 0001_02.jpg
│   │   ├── ...
│   ├── n000002 # directory with images for n000002 label
│   │   ├── 0002_01.jpg
│   │   ├── 0003_01.jpg
│   │   ├── ...
│   ├── ...
├── train
│   ├── n000004
│   │   ├── 0004_01.jpg
│   │   ├── 0004_02.jpg
│   │   ├── ...
│   ├── ...
└── <any_other_subset_name>
    ├── ...

Export Vgg Face2 dataset

Datumaro can convert a Vgg Face2 dataset into any other format Datumaro supports. There is few examples how to do it:

# Using `convert` command
datum convert -if vgg_face2 -i <path_to_vgg_face2> \
    -f voc -o <output_dir> -- --save-images

# Using Datumaro project
datum create
datum import -f vgg_face2 <path_to_vgg_face2>
datum export -f yolo -o <output_dir>

Note: to get the expected result from the conversion, the output format should support the same types of annotations (one or more) as Vgg Face2 (Bbox, Points, Label)

And also you can convert your Vgg Face2 dataset using Python API

import datumaro as dm

vgg_face2_dataset = dm.Dataset.import_from('<path_to_dataset', format='vgg_face2')

vgg_face2_dataset.export('<output_dir>', format='open_images', save_media=True)

Note: some formats have extra export options. For particular format see the docs to get information about it.

Export dataset to the Vgg Face2 format

If you have dataset in some format and want to convert this dataset into the Vgg Face2, ensure that this dataset contains Bbox or/and Points or/and Label and use Datumaro to perform conversion. There is few examples:

# Using convert command
datum convert -if wider_face -i <path_to_wider> \
    -f vgg_face2 -o <output_dir>

# Using Datumaro project
datum create
datum import -f wider_face <path_to_wider>
datum export -f vgg_face2 -o <output_dir> -- --save-media --image-ext '.png'

Note: vgg_face2 format supports only one Bbox per image

Extra options for exporting to Vgg Face2 format:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext <IMAGE_EXT> allow to specify image extension for exporting the dataset (by default .png)
  • --save-dataset-meta - allow to export dataset with saving dataset meta file (by default False)

4.31 - VoTT CSV

Format specification

VoTT (Visual Object Tagging Tool) is an open source annotation tool released by Microsoft. VoTT CSV is the format used by VoTT when the user exports a project and selects “CSV” as the export format.

Supported annotation types:

  • Bbox

Import VoTT dataset

A Datumaro project with a VoTT CSV source can be created in the following way:

datum create
datum import --format vott_csv <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

vott_csv_dataset = dm.Dataset.import_from('<path/to/dataset>', 'vott_csv')

VoTT CSV dataset directory should have the following structure:

dataset/
├── dataset_meta.json # a list of custom labels (optional)
├── img0001.jpg
├── img0002.jpg
├── img0003.jpg
├── img0004.jpg
├── ...
├── test-export.csv
├── train-export.csv
└── ...

To add custom classes, you can use dataset_meta.json.

Export to other formats

Datumaro can convert a VoTT CSV dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports bounding boxes.

There are several ways to convert a VoTT CSV dataset to other dataset formats using CLI:

datum create
datum import -f vott_csv <path/to/dataset>
datum export -f voc -o ./save_dir -- --save-media

or

datum convert -if vott_csv -i <path/to/dataset> \
    -f voc -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'vott_csv')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in VoTT CSV tests.

4.32 - VoTT JSON

Format specification

VoTT (Visual Object Tagging Tool) is an open source annotation tool released by Microsoft. VoTT JSON is the format used by VoTT when the user exports a project and selects “VoTT JSON” as the export format.

Supported annotation types:

  • Bbox

Import VoTT dataset

A Datumaro project with a VoTT JSON source can be created in the following way:

datum create
datum import --format vott_json <path/to/dataset>

It is also possible to import the dataset using Python API:

import datumaro as dm

vott_json_dataset = dm.Dataset.import_from('<path/to/dataset>', 'vott_json')

VoTT JSON dataset directory should have the following structure:

dataset/
├── dataset_meta.json # a list of custom labels (optional)
├── img0001.jpg
├── img0002.jpg
├── img0003.jpg
├── img0004.jpg
├── ...
├── test-export.json
├── train-export.json
└── ...

To add custom classes, you can use dataset_meta.json.

Export to other formats

Datumaro can convert a VoTT JSON dataset into any other format Datumaro supports. To get the expected result, convert the dataset to a format that supports bounding boxes.

There are several ways to convert a VoTT JSON dataset to other dataset formats using CLI:

datum create
datum import -f vott_json <path/to/dataset>
datum export -f voc -o ./save_dir -- --save-media

or

datum convert -if vott_json -i <path/to/dataset> \
    -f voc -o <output/dir> -- --save-media

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'vott_json')
dataset.export('save_dir', 'voc')

Examples

Examples of using this format from the code can be found in VoTT JSON tests.

4.33 - WIDER Face

Format specification

WIDER Face dataset is a face detection benchmark dataset, that available for download here.

Supported types of annotation:

  • Bbox
  • Label

Supported attributes for bboxes:

  • blur:
    • 0 face without blur;
    • 1 face with normal blur;
    • 2 face with heavy blur.
  • expression:
    • 0 face with typical expression;
    • 1 face with exaggerate expression.
  • illumination:
    • 0 image contains normal illumination;
    • 1 image contains extreme illumination.
  • pose:
    • 0 pose is typical;
    • 1 pose is atypical.
  • invalid:
    • 0 image is valid;
    • 1 image is invalid.
  • occluded:
    • 0 face without occlusion;
    • 1 face with partial occlusion;
    • 2 face with heavy occlusion.

Import WIDER Face dataset

Importing of WIDER Face dataset into the Datumaro project:

datum create
datum import -f wider_face <path_to_wider_face>

Directory with WIDER Face dataset should has the following structure:

<path_to_wider_face>
├── labels.txt  # optional file with list of classes
├── wider_face_split # directory with description of bboxes for each image
│   ├── wider_face_subset1_bbx_gt.txt
│   ├── wider_face_subset2_bbx_gt.txt
│   ├── ...
├── WIDER_subset1 # instead of 'subset1' you can use any other subset name
│   └── images
│       ├── 0--label_0 # instead of 'label_<n>' you can use any other class name
│       │   ├──  0_label_0_image_01.jpg
│       │   ├──  0_label_0_image_02.jpg
│       │   ├──  ...
│       ├── 1--label_1
│       │   ├──  1_label_1_image_01.jpg
│       │   ├──  1_label_1_image_02.jpg
│       │   ├──  ...
│       ├── ...
├── WIDER_subset2
│  └── images
│      ├── ...
├── ...

Check README file of the original WIDER Face dataset to get more information about structure of .txt annotation files. Also example of WIDER Face dataset available in our test assets.

Export WIDER Face dataset

With Datumaro you can convert WIDER Face dataset into any other format Datumaro supports. Pay attention that this format should also support Label and/or Bbox annotation types.

Few ways to export WIDER Face dataset using CLI:

# Using `convert` command
datum convert -if wider_face -i <path_to_wider_face> \
    -f voc -o <output_dir> -- --save-media

# Through the Datumaro project
datum create
datum import -f wider_face <path_to_wider_face>
datum export -f voc -o <output_dir> -- -save-media

Export WIDER Face dataset using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path_to_wider_face', 'wider_face')

# Here you can perform some transformation using dataset.transform or
# dataset.filter

dataset.export('output_dir', 'open_images', save_media=True)

Note: some formats have extra export options. For particular format see the docs to get information about it.

Export to WIDER Face dataset

Using Datumaro you can convert your dataset into the WIDER Face format, but for succseful exporting your dataset should contain Label and/or Bbox.

Here example of exporting VOC dataset (object detection task) into the WIDER Face format:

datum create
datum import -f voc_detection <path_to_voc>
datum export -f wider_face -o <output_dir> -- --save-media --image-ext='.png'

Available extra export options for WIDER Face dataset format:

  • --save-media allow to export dataset with saving media files (by default False)
  • --image-ext IMAGE_EXT allow to specify image extension for exporting dataset (by default - keep original)

4.34 - YOLO

Format specification

The YOLO dataset format is for training and validating object detection models. Specification for this format is available here.

You can also find official examples of working with YOLO dataset here.

Supported annotation types:

  • Bounding boxes

YOLO format doesn’t support attributes for annotations.

The format supports arbitrary subset names, except classes, names and backup.

Note, that by default, the YOLO framework does not expect any subset names, except train and valid, Datumaro supports this as an extension. If there is no subset separation in a project, the data will be saved in the train subset.

Import YOLO dataset

A Datumaro project with a YOLO source can be created in the following way:

datum create
datum import --format yolo <path/to/dataset>

YOLO dataset directory should have the following structure:

└─ yolo_dataset/
   │
   ├── dataset_meta.json # a list of non-format labels (optional)
   ├── obj.names  # file with list of classes
   ├── obj.data   # file with dataset information
   ├── train.txt  # list of image paths in train subset
   ├── valid.txt  # list of image paths in valid subset
   │
   ├── obj_train_data/  # directory with annotations and images for train subset
   │    ├── image1.txt  # list of labeled bounding boxes for image1
   │    ├── image1.jpg
   │    ├── image2.txt
   │    ├── image2.jpg
   │    └── ...
   │
   └── obj_valid_data/  # directory with annotations and images for valid subset
        ├── image101.txt
        ├── image101.jpg
        ├── image102.txt
        ├── image102.jpg
        └── ...
  • obj.data should have the following content, it is not necessary to have both subsets, but necessary to have one of them:
classes = 5 # optional
names = <path/to/obj.names>
train = <path/to/train.txt>
valid = <path/to/valid.txt>
backup = backup/ # optional
  • obj.names contains a list of classes. The line number for the class is the same as its index:
label1  # label1 has index 0
label2  # label2 has index 1
label3  # label2 has index 2
...
  • Files train.txt and valid.txt should have the following structure:
<path/to/image1.jpg>
<path/to/image2.jpg>
...
  • Files in directories obj_train_data/ and obj_valid_data/ should contain information about labeled bounding boxes for images:
# image1.txt:
# <label_index> <x_center> <y_center> <width> <height>
0 0.250000 0.400000 0.300000 0.400000
3 0.600000 0.400000 0.400000 0.266667

Here x_center, y_center, width, and height are relative to the image’s width and height. The x_center and y_center are center of rectangle (are not top-left corner).

To add custom classes, you can use dataset_meta.json.

Export to other formats

Datumaro can convert YOLO dataset into any other format Datumaro supports. For successful conversion the output format should support object detection task (e.g. Pascal VOC, COCO, TF Detection API etc.)

There are several ways to convert a YOLO dataset to other dataset formats:

datum create
datum add -f yolo <path/to/yolo/>
datum export -f voc -o <output/dir>

or

datum convert -if yolo -i <path/to/dataset> \
              -f coco_instances -o <path/to/dataset>

Or, using Python API:

import datumaro as dm

dataset = dm.Dataset.import_from('<path/to/dataset>', 'yolo')
dataset.export('save_dir', 'coco_instances', save_media=True)

Export to YOLO format

Datumaro can convert an existing dataset to YOLO format, if the dataset supports object detection task.

Example:

datum create
datum import -f coco_instances <path/to/dataset>
datum export -f yolo -o <path/to/dataset> -- --save-media

Extra options for exporting to YOLO format:

  • --save-media allow to export dataset with saving media files (default: False)
  • --image-ext <IMAGE_EXT> allow to specify image extension for exporting dataset (default: use original or .jpg, if none)
  • --add-path-prefix allows to specify, whether to include the data/ path prefix in the annotation files or not (default: True)

Examples

Example 1. Prepare PASCAL VOC dataset for exporting to YOLO format dataset

datum create -o project
datum import -p project -f voc ./VOC2012
datum filter -p project -e '/item[subset="train" or subset="val"]'
datum transform -p project -t map_subsets -- -s train:train -s val:valid
datum export -p project -f yolo -- --save-media

Example 2. Remove a class from YOLO dataset

Delete all items, which contain cat objects and remove cat from list of classes:

datum create -o project
datum import -p project -f yolo ./yolo_dataset
datum filter -p project -m i+a -e '/item/annotation[label!="cat"]'
datum transform -p project -t remap_labels -- -l cat:
datum export -p project -f yolo -o ./yolo_without_cats

Example 3. Create a custom dataset in YOLO format

import numpy as np
import datumaro as dm

dataset = dm.Dataset.from_iterable([
    dm.DatasetItem(id='image_001', subset='train',
        image=np.ones((20, 20, 3)),
        annotations=[
            dm.Bbox(3.0, 1.0, 8.0, 5.0, label=1),
            dm.Bbox(1.0, 1.0, 10.0, 1.0, label=2)
        ]
    ),
    dm.DatasetItem(id='image_002', subset='train',
        image=np.ones((15, 10, 3)),
        annotations=[
            dm.Bbox(4.0, 4.0, 4.0, 4.0, label=3)
        ]
    )
], categories=['house', 'bridge', 'crosswalk', 'traffic_light'])

dataset.export('../yolo_dataset', format='yolo', save_media=True)

Example 4. Get information about objects on each image

If you only want information about label names for each image, then you can get it from code:

import datumaro as dm

dataset = dm.Dataset.import_from('./yolo_dataset', format='yolo')
cats = dataset.categories()[dm.AnnotationType.label]

for item in dataset:
    for ann in item.annotations:
        print(item.id, cats[ann.label].name)

And If you want complete information about each item you can run:

datum create -o project
datum import -p project -f yolo ./yolo_dataset
datum filter -p project --dry-run -e '/item'

5 - Plugins

5.1 - OpenVINO™ Inference Interpreter

Interpreter samples to parse OpenVINO™ inference outputs. This section on GitHub

Models supported from interpreter samples

There are detection and image classification examples.

You can find more OpenVINO™ Trained Models here To run the inference with OpenVINO™, the model format should be Intermediate Representation(IR). For the Caffe/TensorFlow/MXNet/Kaldi/ONNX models, please see the Model Conversion Instruction

You need to implement your own interpreter samples to support the other OpenVINO™ Trained Models.

Model download

Prerequisites:

Open Model Zoo models can be downloaded with the Model Downloader tool from OpenVINO™ distribution:

cd <openvino_dir>/deployment_tools/open_model_zoo/tools/downloader
./downloader.py --name <model_name>

Example: download the “face-detection-0200” model

cd /opt/intel/openvino/deployment_tools/open_model_zoo/tools/downloader
./downloader.py --name face-detection-0200

Model inference

Prerequisites:

Examples

To run the inference with OpenVINO™ models and the interpreter samples, please follow the instructions below.

source <openvino_dir>/bin/setupvars.sh
datum create -o <proj_dir>
datum model add -l <launcher> -p <proj_dir> --copy -- \
  -d <path/to/xml> -w <path/to/bin> -i <path/to/interpreter/script>
datum import -p <proj_dir> -f <format> <path_to_dataset>
datum model run -p <proj_dir> -m model-0

Detection: ssd_mobilenet_v2_coco

source /opt/intel/openvino/bin/setupvars.sh
cd datumaro/plugins/openvino_plugin
datum create -o proj
datum model add -l openvino -p proj --copy -- \
    --output-layers=do_ExpandDims_conf/sigmoid \
    -d model/ssd_mobilenet_v2_coco.xml \
    -w model/ssd_mobilenet_v2_coco.bin \
    -i samples/ssd_mobilenet_coco_detection_interp.py
datum import -p proj -f voc VOCdevkit/
datum model run -p proj -m model-0

Classification: mobilenet-v2-pytorch

source /opt/intel/openvino/bin/setupvars.sh
cd datumaro/plugins/openvino_plugin
datum create -o proj
datum model add -l openvino -p proj --copy -- \
    -d model/mobilenet-v2-pytorch.xml \
    -w model/mobilenet-v2-pytorch.bin \
    -i samples/mobilenet_v2_pytorch_interp.py
datum import -p proj -f voc VOCdevkit/
datum model run -p proj -m model-0

6 - Contribution Guide

Installation

Prerequisites

  • Python (3.7+)
git clone https://github.com/cvat-ai/datumaro

Optionally, install a virtual environment (recommended):

python -m pip install virtualenv
python -m virtualenv venv
. venv/bin/activate

Then install all dependencies:

pip install -r requirements.txt

Install Datumaro:

pip install -e /path/to/the/cloned/repo/

Optional dependencies

These components are only required for plugins and not installed by default:

  • OpenVINO
  • Accuracy Checker
  • TensorFlow
  • PyTorch
  • MxNet
  • Caffe

Usage

datum --help
python -m datumaro --help
python datumaro/ --help
python datum.py --help
import datumaro

Code style

Try to be readable and consistent with the existing codebase.

The project uses Black for code formatting and isort for sorting import statements. You can find corresponding configurations in pyproject.toml in the repository root. No trailing whitespaces, at most 100 characters per line.

Datumaro includes a Git pre-commit hook, dev/pre-commit.py that can help you follow the style requirements. See the comment at the top of that file for more information.

Environment

The recommended editor is VS Code with the Python language plugin.

Testing

It is expected that all Datumaro functionality is covered and checked by unit tests. Tests are placed in the tests/ directory. Additional pre-generated files for tests can be stored in the tests/assets/ directory. CLI tests are separated from the core tests, they are stored in the tests/cli/ directory.

Currently, we use pytest for testing.

To run tests use:

pytest -v

or

python -m pytest -v

Test cases

Test marking

For better integration with CI and requirements tracking, we use special annotations for tests.

A test needs to linked with a requirement it is related to. To link a test, use:

from unittest import TestCase
from .requirements import Requirements, mark_requirement

class MyTests(TestCase):
    @mark_requirement(Requirements.DATUM_GENERAL_REQ)
    def test_my_requirement(self):
        ... do stuff ...

Such marking will apply markings from the requirement specified. They can be overridden for a specific test:

import pytest

class MyTests(TestCase):
    @pytest.mark.priority_low
    @mark_requirement(Requirements.DATUM_GENERAL_REQ)
    def test_my_requirement(self):
        ... do stuff ...

Requirements

Requirements and other links need to be added to tests/requirements.py:

DATUM_244 = "Add Snyk integration"
DATUM_BUG_219 = "Return format is not uniform"
# Fully defined in GitHub issues:
@pytest.mark.reqids(Requirements.DATUM_244, Requirements.DATUM_333)

# And defined any other way:
@pytest.mark.reqids(Requirements.DATUM_GENERAL_REQ)
Available annotations for tests and requirements

Markings are defined in tests/conftest.py.

A list of requirements and bugs

@pytest.mark.requids(Requirements.DATUM_123)
@pytest.mark.bugs(Requirements.DATUM_BUG_456)

A priority

@pytest.mark.priority_low
@pytest.mark.priority_medium
@pytest.mark.priority_high

Component The marking used for indication of different system components

@pytest.mark.components(DatumaroComponent.Datumaro)

Skipping tests

@pytest.mark.skip(SkipMessages.NOT_IMPLEMENTED)

Parametrized runs

Parameters are used for running the same test with different parameters e.g.

@pytest.mark.parametrize("numpy_array, batch_size", [
    (np.zeros([2]), 0),
    (np.zeros([2]), 1),
    (np.zeros([2]), 2),
    (np.zeros([2]), 5),
    (np.zeros([5]), 2),
])

Test documentation

Tests are documented with docs strings. Test descriptions must contain the following: sections: Description, Expected results and Steps.

def test_can_convert_polygons_to_mask(self):
    """
    <b>Description:</b>
    Ensure that the dataset polygon annotation can be properly converted
    into dataset segmentation mask.

    <b>Expected results:</b>
    Dataset segmentation mask converted from dataset polygon annotation
    is equal to an expected mask.

    <b>Steps:</b>
    1. Prepare dataset with polygon annotation
    2. Prepare dataset with expected mask segmentation mode
    3. Convert source dataset to target, with conversion of annotation
      from polygon to mask.
    4. Verify that resulting segmentation mask is equal to the expected mask.
    """

7 - Release notes

Notes about the release of the developed version can be read in the CHANGELOG.md of the develop branch.