Filter datasets
This command allows to extract a sub-dataset from a dataset. The new dataset includes only items satisfying some condition. The XML XPath is used as a query format.
The command can be applied to a dataset or a project build target,
a stage or the combined project
target, in which case all the project
targets will be affected. A build tree stage will be recorded
if --stage
is enabled, and the resulting dataset(-s) will be
saved if --apply
is enabled.
By default, datasets are updated in-place. The -o/--output-dir
option can be used to specify another output directory. When
updating in-place, use the --overwrite
parameter (in-place
updates fail by default to prevent data loss), unless a project
target is modified.
The current project (-p/--project
) is also used as a context for
plugins, so it can be useful for dataset paths having custom formats.
When not specified, the current project’s working tree is used.
There are several filtering modes available (the -m/--mode
parameter).
Supported modes:
i
,items
a
,annotations
i+a
,a+i
,items+annotations
,annotations+items
When filtering annotations, use the items+annotations
mode to point that annotation-less dataset items should be
removed, otherwise they will be kept in the resulting dataset.
To select an annotation, write an XPath that returns annotation
elements (see examples).
Item representations can be printed with the --dry-run
parameter:
<item>
<id>290768</id>
<subset>minival2014</subset>
<image>
<width>612</width>
<height>612</height>
<depth>3</depth>
</image>
<annotation>
<id>80154</id>
<type>bbox</type>
<label_id>39</label_id>
<x>264.59</x>
<y>150.25</y>
<w>11.19</w>
<h>42.31</h>
<area>473.87</area>
</annotation>
<annotation>
<id>669839</id>
<type>bbox</type>
<label_id>41</label_id>
<x>163.58</x>
<y>191.75</y>
<w>76.98</w>
<h>73.63</h>
<area>5668.77</area>
</annotation>
...
</item>
The command can only be applied to a project build target, a stage or the
combined project
target, in which case all the targets will be affected.
A build tree stage will be added if --stage
is enabled, and the resulting
dataset(-s) will be saved if --apply
is enabled.
Usage:
datum filter [-h] [-e FILTER] [-m MODE] [--dry-run] [--stage STAGE]
[--apply APPLY] [-o DST_DIR] [--overwrite] [-p PROJECT_DIR] [target]
Parameters:
<target>
(string) - Target dataset revpath. By default, filters all targets of the current project.-e, --filter
(string) - XML XPath filter expression for dataset items-m, --mode
(string) - The filtering mode. Default is thei
mode.--dry-run
- Print XML representations of the filtered dataset and exit.--stage
(bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and theproject
target, but not intermediate stages). Enabled by default.--apply
(bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default.-o, --output-dir
(string) - Output directory. Can be omitted for main project targets (i.e. data sources and theproject
target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace.--overwrite
- Allows to overwrite existing files in the output directory, when it is specified and is not empty.-p, --project
(string) - Directory of the project to operate on (default: current directory).-h, --help
- Print the help message and exit.
Example: extract a dataset with images with width
< height
datum filter \
-p test_project \
-e '/item[image/width < image/height]'
Example: extract a dataset with images of the train
subset
datum filter \
-p test_project \
-e '/item[subset="train"]'
Example: extract a dataset with only large annotations of the cat
class and
any non-persons
datum filter \
-p test_project \
--mode annotations \
-e '/item/annotation[(label="cat" and area > 99.5) or label!="person"]'
Example: extract a dataset with non-occluded annotations, remove empty images. Use data only from the “s1” source of the project.
datum create
datum import --format voc -i <path/to/dataset1/> --name s1
datum import --format voc -i <path/to/dataset2/> --name s2
datum filter s1 \
-m i+a -e '/item/annotation[occluded="False"]'