Download datasets

This command downloads a publicly available dataset and saves it to a local directory. In terms of syntax, this command is similar to convert, but instead of taking a local directory as the source, it takes a dataset ID. A list of supported datasets and output formats can be found in the --help output of this command.

Currently, the only source of datasets is the TensorFlow Datasets library. Therefore, to use this command you must install TensorFlow & TFDS, which you can do as follows:

pip install datumaro[tf,tfds]

To use a proxy for downloading, configure it with the conventional curl environment variables.

Usage:

datum download [-h] -i DATASET_ID [-f OUTPUT_FORMAT] [-o DST_DIR]
               [--overwrite] [-s SUBSET] [-- EXTRA_EXPORT_ARGS]

Parameters:

  • -h, --help - Print the help message and exit.
  • -i, --dataset-id (string) - ID of the dataset to download.
  • -f, --output-format (string) - Output format. By default, the format of the original dataset is used.
  • -o, --output-dir (string) - Output directory. By default, a subdirectory in the current directory is used.
  • --overwrite - Allows overwriting existing files in the output directory, when it is not empty.
  • --subset (string) - Which subset of the dataset to save. By default, all subsets are saved. Note that due to limitations of TFDS, all subsets are downloaded even if this option is specified.
  • -- <extra export args> - Additional arguments for the format writer (use -- -h for help). Must be specified after the main command arguments.

Example: download the MNIST dataset, saving it in the ImageNet text format:

datum download -i tfds:mnist -f imagenet_txt -- --save-images