Transform Dataset
Often datasets need to be modified during preparation for model training and
experimenting. In trivial cases it can be done manually - e.g. image renaming
or label renaming. However, in more complex cases even simple modifications
can require too much efforts, distracting the user from the real work.
Datumaro provides the datum transform
command to help in such cases.
This command allows to modify dataset images or annotations all at once.
This command is designed for batch dataset processing, so if you only need to modify few elements of a dataset, you might want to use other approaches for better performance. A possible solution can be a simple script, which uses Datumaro API.
The command can be applied to a dataset or a project build target,
a stage or the combined project
target, in which case all the project
targets will be affected. A build tree stage will be recorded
if --stage
is enabled, and the resulting dataset(-s) will be
saved if --apply
is enabled.
By default, datasets are updated in-place. The -o/--output-dir
option can be used to specify another output directory. When
updating in-place, use the --overwrite
parameter (in-place
updates fail by default to prevent data loss), unless a project
target is modified.
The current project (-p/--project
) is also used as a context for
plugins, so it can be useful for dataset paths having custom formats.
When not specified, the current project’s working tree is used.
Usage:
datum transform [-h] -t TRANSFORM [-o DST_DIR] [--overwrite]
[-p PROJECT_DIR] [--stage STAGE] [--apply APPLY] [target] [-- EXTRA_ARGS]
Parameters:
<target>
(string) - Target dataset revpath. By default, transforms all targets of the current project.-t, --transform
(string) - Transform method name--stage
(bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and theproject
target, but not intermediate stages). Enabled by default.--apply
(bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default.-o, --output-dir
(string) - Output directory. Can be omitted for main project targets (i.e. data sources and theproject
target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace.--overwrite
- Allows to overwrite existing files in the output directory, when it is specified and is not empty.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.<extra args>
- The list of extra transformation parameters. Should be passed after the--
separator after the main command arguments. See transform descriptions for info about extra parameters. Use the--help
option to print parameter info.
Examples:
- Split a VOC-like dataset randomly:
datum transform -t random_split --overwrite path/to/dataset:voc
- Rename images in a project data source by a regex from
frame_XXX
toXXX
:
datum create <...>
datum import <...> -n source-1
datum transform -t rename source-1 -- -e '|^frame_||'
Built-in transforms
Basic dataset item manipulations:
rename
- Renames dataset items by regular expressionid_from_image_name
- Renames dataset items to their image filenamesreindex
- Renames dataset items with numbersndr
- Removes duplicated images from datasetrelevancy_sampler
- Leaves only the most important images (requires model inference results)random_sampler
- Leaves no more than k items from the dataset randomlylabel_random_sampler
- Leaves at least k images with annotations per classresize
- Resizes images and annotations in the datasetremove_images
- Removes specific imagesremove_annotations
- Removes annotationsremove_attributes
- Removes attributes
Subset manipulations:
random_split
- Splits dataset into subsets randomlysplit
- Splits dataset into subsets for classification, detection, segmentation or re-identificationmap_subsets
- Renames and removes subsets
Annotation manipulations:
remap_labels
- Renames, adds or removes labels in datasetproject_labels
- Sets dataset labels to the requested sequenceshapes_to_boxes
- Replaces spatial annotations with bounding boxesboxes_to_masks
- Converts bounding boxes to instance maskspolygons_to_masks
- Converts polygons to instance masksmasks_to_polygons
- Converts instance masks to polygonsanns_to_labels
- Replaces annotations having labels with label annotationsmerge_instance_segments
- Merges grouped spatial annotations into a maskcrop_covered_segments
- Removes occluded segments of covered masksbbox_value_decrement
- Subtracts 1 from bbox coordinates
rename
Renames items in the dataset. Supports regular expressions.
The first character in the expression is a delimiter for
the pattern and replacement parts. Replacement part can also
contain str.format
replacement fields with the item
(of type DatasetItem
) object available.
Usage:
rename [-h] [-e REGEX]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-e
,--regex
(string) - Regex for renaming in the form<sep><search><sep><replacement><sep>
Examples: Replace ‘pattern’ with ‘replacement’:
datum transform -t rename -- -e '|pattern|replacement|'
Remove the frame_
prefix from item ids:
datum transform -t rename -- -e '|^frame_|\1|'
Collect images from subdirectories into the base image directory using regex:
datum transform -t rename -- -e '|^((.+[/\\])*)?(.+)$|\2|'
Add subset prefix to images:
datum transform -t rename -- -e '|(.*)|{item.subset}_\1|'
id_from_image_name
Renames items in the dataset using image file name (without extension).
Usage:
id_from_image_name [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
reindex
Replaces dataset item IDs with sequential indices.
Usage:
reindex [-h] [-s START]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-s
,--start
(int) - Start value for item ids (default: 1)
ndr
Removes near-duplicated images in subset.
Remove duplicated images from a dataset. Keep at most -k/--num_cut
resulting images.
Available oversampling policies (the -e
parameter):
random
- sample from removed data randomlysimilarity
- sample from removed data with ascending similarity score
Available undersampling policies (the -u
parameter):
uniform
- sample data with uniform distributioninverse
- sample data with reciprocal of the number of number of items with the same similarity
Usage:
ndr [-h] [-w WORKING_SUBSET] [-d DUPLICATED_SUBSET] [-a {gradient}]
[-k NUM_CUT] [-e {random,similarity}] [-u {uniform,inverse}] [-s SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-w
,--working_subset
(str) - Name of the subset to operate (default:None
)-d
,--duplicated_subset
(str) - Name of the subset for the removed data after NDR runs (default: duplicated)-a
,--algorithm
(one of:gradient
) - Name of the algorithm to use (default:gradient
)-k
,--num_cut
(int) - Maximum output dataset size-e
,--over_sample
(one of:random
,similarity
) - The policy to use whennum_cut
is bigger than result length (default:random
)-u
,--under_sample
(one of:uniform
,inverse
) - The policy to use whennum_cut
is smaller than result length (default:uniform
)-s
,--seed
(int) - Random seed
Example: apply NDR, return no more than 100 images
datum transform -t ndr -- \
--working_subset train
--algorithm gradient
--num_cut 100
--over_sample random
--under_sample uniform
relevancy_sampler
Sampler that analyzes model inference results on the dataset and picks the most relevant samples for training.
Creates a dataset from the -k/--count
hardest items for a model.
The whole dataset or a single subset will be split into the sampled
and unsampled
subsets based on the model confidence. The dataset
must contain model confidence values in the scores
attributes
of annotations.
There are five methods of sampling (the -m/--method
option):
topk
- Return the k items with the highest uncertainty datalowk
- Return the k items with the lowest uncertainty datarandk
- Return random k itemsmixk
- Return a half using topk, and the other half using lowk methodrandtopk
- Select 3*k items randomly, and return the topk among them
Notes:
- Each image’s inference result must contain the probability for
all classes (in the
scores
attribute). - Requesting a sample larger than the number of all images will return all images.
Usage:
relevancy_sampler [-h] -k COUNT [-a {entropy}] [-i INPUT_SUBSET]
[-o SAMPLED_SUBSET] [-u UNSAMPLED_SUBSET]
[-m {topk,lowk,randk,mixk,randtopk}] [-d OUTPUT_FILE]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-k
,--count
(int) - Number of items to sample-a
,--algorithm
(one of:entropy
) - Sampling algorithm (default:entropy
)-i
,--input_subset
(str) - Subset name to select sample from (default:None
)-o
,--sampled_subset
(str) - Subset name to put sampled data to (default:sample
)-u
,--unsampled_subset
(str) - Subset name to put the rest data to (default:unsampled
)-m
,--sampling_method
(one of:topk
,lowk
,randk
,mixk
,randtopk
) - Sampling method (default:topk
)-d
,--output_file
(path) - A.csv
file path to dump sampling results
Examples:
Select the most relevant data subset of 20 images
based on model certainty, put the result into sample
subset
and put all the rest into unsampled
subset, use train
subset
as input. The dataset must contain model confidence values in the scores
attributes of annotations.
datum transform -t relevancy_sampler -- \
--algorithm entropy \
--subset_name train \
--sample_name sample \
--unsampled_name unsampled \
--sampling_method topk -k 20
random_sampler
Sampler that keeps no more than required number of items in the dataset.
Notes:
- Items are selected uniformly (tries to keep original item distribution by subsets)
- Requesting a sample larger than the number of all images will return all images
Usage:
random_sampler [-h] -k COUNT [-s SUBSET] [--seed SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-k
,--count
(int) - Maximum number of items to sample-s
,--subset
(str) - Limit changes to this subset (default: affect all dataset)--seed
(int) - Initial value for random number generator
Examples: Select subset of 20 images randomly
datum transform -t random_sampler -- -k 20
Select subset of 20 images, modify only train
subset
datum transform -t random_sampler -- -k 20 -s train
random_label_sampler
Sampler that keeps at least the required number of annotations of each class in the dataset for each subset separately.
Consider using the “stats” command to get class distribution in the dataset.
Notes:
- Items can contain annotations of several selected classes
(e.g. 3 bounding boxes per image). The number of annotations in the
resulting dataset varies between
max(class counts)
andsum(class counts)
- If the input dataset does not has enough class annotations, the result will contain only what is available
- Items are selected uniformly
- For reasons above, the resulting class distribution in the dataset may not be the same as requested
- The resulting dataset will only keep annotations for classes with
specified
count
> 0
Usage:
label_random_sampler [-h] -k COUNT [-l LABEL_COUNTS] [--seed SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-k
,--count
(int) - Minimum number of annotations of each class-l
,--label
(str; repeatable) - Minimum number of annotations of a specific class. Overrides the-k/--count
setting for the class. The format is<label_name>:<count>
--seed
(int) - Initial value for random number generator
Examples: Select a dataset with at least 10 images of each class:
datum transform -t label_random_sampler -- -k 10
Select a dataset with at least 20 cat
images, 5 dog
, 0 car
and 10 of each
unmentioned class:
datum transform -t label_random_sampler -- \
-l cat:20 \ # keep 20 images with cats
-l dog:5 \ # keep 5 images with dogs
-l car:0 \ # remove car annotations
-k 10 # for remaining classes
resize
Resizes images and annotations in the dataset to the specified size. Supports upscaling, downscaling and mixed variants.
Usage:
resize [-h] [-dw WIDTH] [-dh HEIGHT]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-dw
,--width
(int) - Destination image width-dh
,--height
(int) - Destination image height
Examples: Resize all images to 256x256 size
datum transform -t resize -- -dw 256 -dh 256
remove_images
Removes specific dataset items by their ids.
Usage:
remove_images [-h] [--id IDs]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit--id
(str) - Item id to remove. Id is ‘: ’ pair (repeatable)
Examples:
Remove specific images from the dataset
datum transform -t remove_images -- --id 'image1:train' --id 'image2:test'
remove_annotations
Allows to remove annotations on specific dataset items.
Can be useful to clean the dataset from broken or unnecessary annotations.
Usage:
remove_annotations [-h] [--id IDs]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit--id
(str) - Item id to clean from annotations. Id is ‘: ’ pair. If not specified, removes all annotations (repeatable)
Examples: Remove annotations from specific items in the dataset
datum transform -t remove_annotations -- --id 'image1:train' --id 'image2:test'
remove_attributes
Allows to remove item and annotation attributes in a dataset.
Can be useful to clean the dataset from broken or unnecessary attributes.
Usage:
remove_attributes [-h] [--id IDs] [--attr ATTRIBUTE_NAME]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit--id
(str) - Image id to clean from annotations. Id is ‘: ’ pair. If not specified, affects all items and annotations (repeatable) -a
,--attr
(flag) - Attribute name to be removed. If not specified, removes all attributes (repeatable)
Examples:
Remove the is_crowd
attribute from dataset
datum transform -t remove_attributes -- \
--attr 'is_crowd'
Remove the occluded
attribute from annotations of
the 2010_001705
item in the train
subset
datum transform -t remove_attributes -- \
--id '2010_001705:train' --attr 'occluded'
random_split
Joins all subsets into one and splits the result into few parts. It is expected that item ids are unique and subset ratios sum up to 1.
Usage:
random_split [-h] [-s SPLITS] [--seed SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-s
,--subset
(str, repeatable) - Subsets in the form: ‘: ’ (repeatable, default: { train
: 0.67,test
: 0.33})--seed
(int) - Random seed
Example:
Split a dataset randomly to train
and test
subsets, ratio is 2:1
datum transform -t random_split -- --subset train:.67 --subset test:.33
split
Splits a dataset for model training, using task information:
-
classification splits Splits dataset into subsets (train/val/test) in class-wise manner. Splits dataset images in the specified ratio, keeping the initial class distribution.
-
detection & segmentation splits Each image can have multiple object annotations - bbox, mask, polygon. Since an image shouldn’t be included in multiple subsets at the same time, and image annotations shouldn’t be split, in general, dataset annotations are unlikely to be split exactly in the specified ratio. This split tries to split dataset images as close as possible to the specified ratio, keeping the initial class distribution.
-
reidentification splits In this task, the test set should consist of images of unseen people or objects during the training phase. This function splits a dataset in the following way:
- Splits the dataset into
train + val
andtest
sets based on person or object ID. - Splits
test
set intotest-gallery
andtest-query
sets in class-wise manner. - Splits the
train + val
set intotrain
andval
sets in the same way. The final subsets would betrain
,val
,test-gallery
andtest-query
.
Notes:
- Each image is expected to have only one
Annotation
. Unlabeled or multi-labeled images will be split into subsets randomly. - If Labels also have attributes, also splits by attribute values.
- If there is not enough images in some class or attributes group, the split ratio can’t be guaranteed.
In reidentification task,
- Object ID can be described by Label, or by attribute (
--attr
parameter) - The splits of the test set are controlled by
--query
parameter Gallery ratio would be1.0 - query
.
Usage:
split [-h] [-t {classification,detection,segmentation,reid}]
[-s SPLITS] [--query QUERY] [--attr ATTR_FOR_ID] [--seed SEED]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-t
,--task
(one of:classification
,detection
,segmentation
,reid
) - Dataset task (default:classification
)-s
,--subset
(str; repeatable) - Subsets in the form: ‘: ’ (default: { train
: 0.5,val
: 0.2,test
: 0.3})--query
(float) - Query ratio in the test set (default: 0.5)--attr
(str) - Attribute name representing the ID (default: use label)--seed
(int) - Random seed
Example:
datum transform -t split -- -t classification \
--subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- -t detection \
--subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- -t segmentation \
--subset train:.5 --subset val:.2 --subset test:.3
datum transform -t split -- -t reid \
--subset train:.5 --subset val:.2 --subset test:.3 --query .5
Example: use person_id
attribute for splitting
datum transform -t split -- -t detection --attr person_id
map_subsets
Renames subsets in the dataset.
Usage:
map_subsets [-h] [-s MAPPING]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-s
,--subset
(str; repeatable) - Subset mapping of the form:src:dst
remap_labels
Changes labels in the dataset.
A label can be:
- renamed (and joined with existing) -
when
--label <old_name>:<new_name>
is specified - deleted - when
--label <name>:
is specified, or default action isdelete
and the label is not mentioned in the list. When a label is deleted, all the associated annotations are removed - kept unchanged - when
--label <name>:<name>
is specified, or default action iskeep
and the label is not mentioned in the list Annotations with no label are managed by the default action policy.
Usage:
remap_labels [-h] [-l MAPPING] [--default {keep,delete}]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-l
,--label
(str; repeatable) - Label in the form of:<src>:<dst>
--default
(one of:keep
,delete
) - Action for unspecified labels (default:keep
)
Examples:
Remove the person
label (and corresponding annotations):
datum transform -t remap_labels -- -l person: --default keep
Rename person
to pedestrian
and human
to pedestrian
, join annotations
that had different classes under the same class id for pedestrian
,
don’t touch other classes:
datum transform -t remap_labels -- \
-l person:pedestrian -l human:pedestrian --default keep
Rename person
to car
and cat
to dog
, keep bus
, remove others:
datum transform -t remap_labels -- \
-l person:car -l bus:bus -l cat:dog --default delete
project_labels
Changes the order of labels in the dataset from the existing to the desired one, removes unknown labels and adds new labels. Updates or removes the corresponding annotations.
Labels are matched by names (case dependent). Parent labels are only kept if they are present in the resulting set of labels. If new labels are added, and the dataset has mask colors defined, new labels will obtain generated colors.
Useful for merging similar datasets, whose labels need to be aligned.
Usage:
project_labels [-h] [-l DST_LABELS]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit-l
,--label
(str; repeatable) - Label name (ordered)
Examples:
Set dataset labels to [person
, cat
, dog
], remove others, add missing.
Original labels (for example): cat
, dog
, elephant
, human
.
New labels: person
(added), cat
(kept), dog
(kept).
datum transform -t project_labels -- -l person -l cat -l dog
shapes_to_boxes
Converts spatial annotations (masks, polygons, polylines, points) to enclosing bounding boxes.
Usage:
shapes_to_boxes [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
Example: Convert spatial annotations between each other
datum transform -t boxes_to_masks
datum transform -t masks_to_polygons
datum transform -t polygons_to_masks
datum transform -t shapes_to_boxes
boxes_to_masks
Converts bounding boxes to masks.
Usage:
boxes_to_masks [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
polygons_to_masks
Converts polygons to masks.
Usage:
polygons_to_masks [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
masks_to_polygons
Converts masks to polygons.
Usage:
masks_to_polygons [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
anns_to_labels
Collects all labels from annotations (of all types) and transforms
them into a set of annotations of type Label
Usage:
anns_to_labels [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
merge_instance_segments
Replaces instance masks and, optionally, polygons with a single mask. A group of annotations with the same group id is considered an “instance”. The largest annotation in the group is considered the group “head”, so the resulting mask takes properties from that annotation.
Usage:
merge_instance_segments [-h] [--include-polygons]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit--include-polygons
(flag) - Include polygons
crop_covered_segments
Sorts polygons and masks (“segments”) according to z_order
,
crops covered areas of underlying segments. If a segment is split
into several independent parts by the segments above, produces
the corresponding number of separate annotations joined into a group.
Usage:
crop_covered_segments [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit
bbox_value_decrement
Subtracts one from the coordinates of bounding boxes
Usage:
bbox_values_decrement [-h]
Optional arguments:
-h
,--help
(flag) - Show this help message and exit