Transform Dataset
Often datasets need to be modified during preparation for model training and
experimenting. In trivial cases it can be done manually - e.g. image renaming
or label renaming. However, in more complex cases even simple modifications
can require too much efforts, distracting the user from the real work.
Datumaro provides the datum transform
command to help in such cases.
This command allows to modify dataset images or annotations all at once.
This command is designed for batch dataset processing, so if you only need to modify few elements of a dataset, you might want to use other approaches for better performance. A possible solution can be a simple script, which uses Datumaro API.
The command can be applied to a dataset or a project build target,
a stage or the combined project
target, in which case all the project
targets will be affected. A build tree stage will be recorded
if --stage
is enabled, and the resulting dataset(-s) will be
saved if --apply
is enabled.
By default, datasets are updated in-place. The -o/--output-dir
option can be used to specify another output directory. When
updating in-place, use the --overwrite
parameter (in-place
updates fail by default to prevent data loss), unless a project
target is modified.
The current project (-p/--project
) is also used as a context for
plugins, so it can be useful for dataset paths having custom formats.
When not specified, the current project’s working tree is used.
Usage:
datum transform [-h] -t TRANSFORM [-o DST_DIR] [--overwrite]
[-p PROJECT_DIR] [--stage STAGE] [--apply APPLY] [target] [-- EXTRA_ARGS]
Parameters:
<target>
(string) - Target dataset revpath. By default, transforms all targets of the current project.-t, --transform
(string) - Transform method name--stage
(bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and theproject
target, but not intermediate stages). Enabled by default.--apply
(bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default.-o, --output-dir
(string) - Output directory. Can be omitted for main project targets (i.e. data sources and theproject
target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace.--overwrite
- Allows to overwrite existing files in the output directory, when it is specified and is not empty.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.<extra args>
- The list of extra transformation parameters. Should be passed after the--
separator after the main command arguments. See transform descriptions for info about extra parameters. Use the--help
option to print parameter info.
Examples:
- Split a VOC-like dataset randomly:
datum transform -t random_split --overwrite path/to/dataset:voc
- Rename images in a project data source by a regex from
frame_XXX
toXXX
:
datum create <...>
datum import <...> -n source-1
datum transform -t rename source-1 -- -e '|frame_(\d+)|\\1|'
Built-in transforms
Basic dataset item manipulations:
rename
- Renames dataset items by regular expressionid_from_image_name
- Renames dataset items to their image filenamesreindex
- Renames dataset items with numbersndr
- Removes duplicated images from datasetsampler
- Runs inference and leaves only the most representative images
Subset manipulations:
random_split
- Splits dataset into subsets randomlysplit
- Splits dataset into subsets for classification, detection, segmentation or re-identificationmap_subsets
- Renames and removes subsets
Annotation manipulations:
remap_labels
- Renames, adds or removes labels in datasetproject_labels
- Sets dataset labels to the requested sequenceshapes_to_boxes
- Replaces spatial annotations with bounding boxesboxes_to_masks
- Converts bounding boxes to instance maskspolygons_to_masks
- Converts polygons to instance masksmasks_to_polygons
- Converts instance masks to polygonsanns_to_labels
- Replaces annotations having labels with label annotationsmerge_instance_segments
- Merges grouped spatial annotations into a maskcrop_covered_segments
- Removes occluded segments of covered masksbbox_value_decrement
- Subtracts 1 from bbox coordinates
Examples:
- Split a dataset randomly to
train
andtest
subsets, ratio is 2:1
datum transform -t random_split -- --subset train:.67 --subset test:.33
- Split a dataset for a specific task. The tasks supported are classification, detection, segmentation and re-identification.
datum transform -t split -- \
-t classification --subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- \
-t detection --subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- \
-t segmentation --subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- \
-t reid --subset train:.5 --subset val:.2 --subset test:.3 \
--query .5
- Convert spatial annotations between each other
datum transform -t boxes_to_masks
datum transform -t masks_to_polygons
datum transform -t polygons_to_masks
datum transform -t shapes_to_boxes
- Set dataset labels to {
person
,cat
,dog
}, remove others, add missing. Original labels (can be any):cat
,dog
,elephant
,human
New labels:person
(added),cat
(kept),dog
(kept)
datum transform -t project_labels -- -l person -l cat -l dog
- Remap dataset labels,
person
tocar
andcat
todog
, keepbus
, remove others
datum transform -t remap_labels -- \
-l person:car -l bus:bus -l cat:dog \
--default delete
- Rename dataset items by a regular expression
- Replace
pattern
withreplacement
- Remove
frame_
from item ids
- Replace
datum transform -t rename -- -e '|pattern|replacement|'
datum transform -t rename -- -e '|frame_(\d+)|\\1|'
- Create a dataset from K the most hard items for a model. The dataset will
be split into the
sampled
andunsampled
subsets, based on the model confidence, which is stored in thescores
annotation attribute.
There are five methods of sampling (the -m/--method
option):
topk
- Return the k with high uncertainty datalowk
- Return the k with low uncertainty datarandk
- Return the random k datamixk
- Return half to topk method and the rest to lowk methodrandtopk
- First, select 3 times the number of k randomly, and return the topk among them.
datum transform -t sampler -- \
-a entropy \
-i train \
-o sampled \
-u unsampled \
-m topk \
-k 20
- Remove duplicated images from a dataset. Keep at most N resulting images.
- Available sampling options (the
-e
parameter):random
- sample from removed data randomlysimilarity
- sample from removed data with ascending
- Available sampling methods (the
-u
parameter):uniform
- sample data with uniform distributioninverse
- sample data with reciprocal of the number
- Available sampling options (the
datum transform -t ndr -- \
-w train \
-a gradient \
-k 100 \
-e random \
-u uniform