Validate Dataset

This command inspects annotations with respect to the task type and stores the results in JSON file.

The task types supported are classification, detection, and segmentation (the -t/--task-type parameter).

The validation result contains

  • annotation statistics based on the task type
  • validation reports, such as
    • items not having annotations
    • items having undefined annotations
    • imbalanced distribution in class/attributes
    • too small or large values
  • summary

Usage:

datum validate [-h] -t TASK [-s SUBSET_NAME] [-p PROJECT_DIR]
  [target] [-- EXTRA_ARGS]

Parameters:

  • <target> (string) - Target dataset revpath. By default, validates the current project.
  • -t, --task-type (string) - Task type for validation
  • -s, --subset (string) - Dataset subset to be validated
  • -p, --project (string) - Directory of the project to operate on (default: current directory).
  • -h, --help - Print the help message and exit.
  • <extra args> - The list of extra validation parameters. Should be passed after the -- separator after the main command arguments:
    • -fs, --few-samples-thr (number) - The threshold for giving a warning for minimum number of samples per class
    • -ir, --imbalance-ratio-thr (number) - The threshold for giving imbalance data warning
    • -m, --far-from-mean-thr (number) - The threshold for giving a warning that data is far from mean
    • -dr, --dominance-ratio-thr (number) - The threshold for giving a warning bounding box imbalance
    • -k, --topk-bins (number) - The ratio of bins with the highest number of data to total bins in the histogram

Example : give warning when imbalance ratio of data with classification task over 40

datum validate -p prj/ -t classification -- -ir 40

Here is the list of validation items(a.k.a. anomaly types).

Anomaly Type Description Task Type
MissingLabelCategories Metadata (ex. LabelCategories) should be defined common
MissingAnnotation No annotation found for an Item common
MissingAttribute An attribute key is missing for an Item common
MultiLabelAnnotations Item needs a single label classification
UndefinedLabel A label not defined in the metadata is found for an item common
UndefinedAttribute An attribute not defined in the metadata is found for an item common
LabelDefinedButNotFound A label is defined, but not found actually common
AttributeDefinedButNotFound An attribute is defined, but not found actually common
OnlyOneLabel The dataset consists of only label common
OnlyOneAttributeValue The dataset consists of only attribute value common
FewSamplesInLabel The number of samples in a label might be too low common
FewSamplesInAttribute The number of samples in an attribute might be too low common
ImbalancedLabels There is an imbalance in the label distribution common
ImbalancedAttribute There is an imbalance in the attribute distribution common
ImbalancedDistInLabel Values (ex. bbox width) are not evenly distributed for a label detection, segmentation
ImbalancedDistInAttribute Values (ex. bbox width) are not evenly distributed for an attribute detection, segmentation
NegativeLength The width or height of bounding box is negative detection
InvalidValue There’s invalid (ex. inf, nan) value for bounding box info. detection
FarFromLabelMean An annotation has an too small or large value than average for a label detection, segmentation
FarFromAttrMean An annotation has an too small or large value than average for an attribute detection, segmentation

Validation Result Format:

{
    'statistics': {
        ## common statistics
        'label_distribution': {
            'defined_labels': <dict>,   # <label:str>: <count:int>
            'undefined_labels': <dict>
            # <label:str>: {
            #     'count': <int>,
            #     'items_with_undefined_label': [<item_key>, ]
            # }
        },
        'attribute_distribution': {
            'defined_attributes': <dict>,
            # <label:str>: {
            #     <attribute:str>: {
            #         'distribution': {<attr_value:str>: <count:int>, },
            #         'items_missing_attribute': [<item_key>, ]
            #     }
            # }
            'undefined_attributes': <dict>
            # <label:str>: {
            #     <attribute:str>: {
            #         'distribution': {<attr_value:str>: <count:int>, },
            #         'items_with_undefined_attr': [<item_key>, ]
            #     }
            # }
        },
        'total_ann_count': <int>,
        'items_missing_annotation': <list>, # [<item_key>, ]

        ## statistics for classification task
        'items_with_multiple_labels': <list>, # [<item_key>, ]

        ## statistics for detection task
        'items_with_invalid_value': <dict>,
        # '<item_key>': {<ann_id:int>: [ <property:str>, ], }
        # - properties: 'x', 'y', 'width', 'height',
        #               'area(wxh)', 'ratio(w/h)', 'short', 'long'
        # - 'short' is min(w,h) and 'long' is max(w,h).
        'items_with_negative_length': <dict>,
        # '<item_key>': { <ann_id:int>: { <'width'|'height'>: <value>, }, }
        'bbox_distribution_in_label': <dict>, # <label:str>: <bbox_template>
        'bbox_distribution_in_attribute': <dict>,
        # <label:str>: {<attribute:str>: { <attr_value>: <bbox_template>, }, }
        'bbox_distribution_in_dataset_item': <dict>,
        # '<item_key>': <bbox count:int>

        ## statistics for segmentation task
        'items_with_invalid_value': <dict>,
        # '<item_key>': {<ann_id:int>: [ <property:str>, ], }
        # - properties: 'area', 'width', 'height'
        'mask_distribution_in_label': <dict>, # <label:str>: <mask_template>
        'mask_distribution_in_attribute': <dict>,
        # <label:str>: {
        #     <attribute:str>: { <attr_value>: <mask_template>, }
        # }
        'mask_distribution_in_dataset_item': <dict>,
        # '<item_key>': <mask/polygon count: int>
    },
    'validation_reports': <list>, # [ <validation_error_format>, ]
    # validation_error_format = {
    #     'anomaly_type': <str>,
    #     'description': <str>,
    #     'severity': <str>, # 'warning' or 'error'
    #     'item_id': <str>,  # optional, when it is related to a DatasetItem
    #     'subset': <str>,   # optional, when it is related to a DatasetItem
    # }
    'summary': {
        'errors': <count: int>,
        'warnings': <count: int>
    }
}

item_key is defined as,

item_key = (<DatasetItem.id:str>, <DatasetItem.subset:str>)

bbox_template and mask_template are defined as,

bbox_template = {
    'width': <numerical_stat_template>,
    'height': <numerical_stat_template>,
    'area(wxh)': <numerical_stat_template>,
    'ratio(w/h)': <numerical_stat_template>,
    'short': <numerical_stat_template>, # short = min(w, h)
    'long': <numerical_stat_template>   # long = max(w, h)
}
mask_template = {
    'area': <numerical_stat_template>,
    'width': <numerical_stat_template>,
    'height': <numerical_stat_template>
}

numerical_stat_template is defined as,

numerical_stat_template = {
    'items_far_from_mean': <dict>,
    # {'<item_key>': {<ann_id:int>: <value:float>, }, }
    'mean': <float>,
    'stddev': <float>,
    'min': <float>,
    'max': <float>,
    'median': <float>,
    'histogram': {
        'bins': <list>,   # [<float>, ]
        'counts': <list>, # [<int>, ]
    }
}