Transform Dataset

Often datasets need to be modified during preparation for model training and experimenting. In trivial cases it can be done manually - e.g. image renaming or label renaming. However, in more complex cases even simple modifications can require too much efforts, distracting the user from the real work. Datumaro provides the datum transform command to help in such cases.

This command allows to modify dataset images or annotations all at once.

This command is designed for batch dataset processing, so if you only need to modify few elements of a dataset, you might want to use other approaches for better performance. A possible solution can be a simple script, which uses Datumaro API.

The command can be applied to a dataset or a project build target, a stage or the combined project target, in which case all the project targets will be affected. A build tree stage will be recorded if --stage is enabled, and the resulting dataset(-s) will be saved if --apply is enabled.

By default, datasets are updated in-place. The -o/--output-dir option can be used to specify another output directory. When updating in-place, use the --overwrite parameter (in-place updates fail by default to prevent data loss), unless a project target is modified.

The current project (-p/--project) is also used as a context for plugins, so it can be useful for dataset paths having custom formats. When not specified, the current project’s working tree is used.

Usage:

datum transform [-h] -t TRANSFORM [-o DST_DIR] [--overwrite]
  [-p PROJECT_DIR] [--stage STAGE] [--apply APPLY] [target] [-- EXTRA_ARGS]

Parameters:

<target> (string) - Target dataset revpath. By default, transforms all targets of the current project.
-t, --transform (string) - Transform method name
--stage (bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and the project target, but not intermediate stages). Enabled by default.
--apply (bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default.
-o, --output-dir (string) - Output directory. Can be omitted for main project targets (i.e. data sources and the project target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace.
--overwrite - Allows to overwrite existing files in the output directory, when it is specified and is not empty.
-p, --project (string) - Directory of the project to operate on (default: current directory).
-h, --help - Print the help message and exit.
<extra args> - The list of extra transformation parameters. Should be passed after the -- separator after the main command arguments. See transform descriptions for info about extra parameters. Use the --help option to print parameter info.

Examples:

Split a VOC-like dataset randomly:

datum transform -t random_split --overwrite path/to/dataset:voc

Rename images in a project data source by a regex from frame_XXX to XXX:

datum create <...>
datum import <...> -n source-1
datum transform -t rename source-1 -- -e '|frame_(\d+)|\\1|'

Built-in transforms

Basic dataset item manipulations:

rename - Renames dataset items by regular expression
id_from_image_name - Renames dataset items to their image filenames
reindex - Renames dataset items with numbers
ndr - Removes duplicated images from dataset
sampler - Runs inference and leaves only the most representative images

Subset manipulations:

random_split - Splits dataset into subsets randomly
split - Splits dataset into subsets for classification, detection, segmentation or re-identification
map_subsets - Renames and removes subsets

Annotation manipulations:

remap_labels - Renames, adds or removes labels in dataset
project_labels - Sets dataset labels to the requested sequence
shapes_to_boxes - Replaces spatial annotations with bounding boxes
boxes_to_masks - Converts bounding boxes to instance masks
polygons_to_masks - Converts polygons to instance masks
masks_to_polygons - Converts instance masks to polygons
anns_to_labels - Replaces annotations having labels with label annotations
merge_instance_segments - Merges grouped spatial annotations into a mask
crop_covered_segments - Removes occluded segments of covered masks
bbox_value_decrement - Subtracts 1 from bbox coordinates

Examples:

Split a dataset randomly to train and test subsets, ratio is 2:1

datum transform -t random_split -- --subset train:.67 --subset test:.33

Split a dataset for a specific task. The tasks supported are classification, detection, segmentation and re-identification.

datum transform -t split -- \
  -t classification --subset train:.5 --subset val:.2 --subset test:.3

datum transform -t split -- \
  -t detection --subset train:.5 --subset val:.2 --subset test:.3

datum transform -t split -- \
  -t segmentation --subset train:.5 --subset val:.2 --subset test:.3

datum transform -t split -- \
  -t reid --subset train:.5 --subset val:.2 --subset test:.3 \
  --query .5

Convert spatial annotations between each other

datum transform -t boxes_to_masks
datum transform -t masks_to_polygons
datum transform -t polygons_to_masks
datum transform -t shapes_to_boxes

Set dataset labels to {person, cat, dog}, remove others, add missing. Original labels (can be any): cat, dog, elephant, human New labels: person (added), cat (kept), dog (kept)

datum transform -t project_labels -- -l person -l cat -l dog

Remap dataset labels, person to car and cat to dog, keep bus, remove others

datum transform -t remap_labels -- \
  -l person:car -l bus:bus -l cat:dog \
  --default delete

Rename dataset items by a regular expression
- Replace pattern with replacement
- Remove frame_ from item ids

datum transform -t rename -- -e '|pattern|replacement|'
datum transform -t rename -- -e '|frame_(\d+)|\\1|'

Create a dataset from K the most hard items for a model. The dataset will be split into the sampled and unsampled subsets, based on the model confidence, which is stored in the scores annotation attribute.

There are five methods of sampling (the -m/--method option):

topk - Return the k with high uncertainty data
lowk - Return the k with low uncertainty data
randk - Return the random k data
mixk - Return half to topk method and the rest to lowk method
randtopk - First, select 3 times the number of k randomly, and return the topk among them.

datum transform -t sampler -- \
  -a entropy \
  -i train \
  -o sampled \
  -u unsampled \
  -m topk \
  -k 20

Remove duplicated images from a dataset. Keep at most N resulting images.
- Available sampling options (the -e parameter):
  - random - sample from removed data randomly
  - similarity - sample from removed data with ascending
- Available sampling methods (the -u parameter):
  - uniform - sample data with uniform distribution
  - inverse - sample data with reciprocal of the number

datum transform -t ndr -- \
  -w train \
  -a gradient \
  -k 100 \
  -e random \
  -u uniform