Detect dataset format

This command attempts to detect the format of a dataset in a directory. Currently, only local directories are supported.

The detection result may be one of:

  • a single format being detected;
  • no formats being detected (if the dataset doesn’t match any known format);
  • multiple formats being detected (if the dataset is ambiguous).

The command outputs this result in a human-readable form and optionally as a machine-readable JSON report (see --json-report).

The format of the machine-readable report is as follows:

{
    "detected_formats": [
        "detected-format-name-1", "detected-format-name-2", ...
    ],
    "rejected_formats": {
        "rejected-format-name-1": {
            "reason": <reason-code>,
            "message": "line 1\nline 2\n...\nline N"
        },
        "rejected-format-name-2": ...,
        ...
    }
}

The <reason-code> can be one of:

  • "detection_unsupported": the corresponding format does not support detection.

  • "insufficient_confidence": the dataset matched the corresponding format, but it matched at least one other format better.

  • "unmet_requirements": the dataset didn’t meet at least one requirement of the corresponding format.

Other reason codes may be defined in the future.

Usage:

datum detect-format [-h] [-p PROJECT_DIR] [--show-rejections]
                    [--json-report JSON_REPORT]
                    url

Parameters:

  • <url> - Path to the dataset to analyse.
  • -h, --help - Print the help message and exit.
  • -p, --project (string) - Directory of the project to use as the context (default: current directory). The project might contain local plugins with custom formats, which will be used for detection.
  • --show-rejections - Describe why each supported format that wasn’t detected was rejected. This only affects the human-readable output; the machine-readable report always includes rejection information.
  • --json-report (string) - Path to which to save a JSON report describing detected and rejected formats. By default, no report is saved.

Example: detect the format of a dataset in a given directory, showing rejection information:

datum detect-format --show-rejections path/to/dataset