Validate Dataset

This command inspects annotations with respect to the task type and stores the results in JSON file.

The task types supported are classification, detection, and segmentation (the -t/--task-type parameter).

The validation result contains

annotation statistics based on the task type
validation reports, such as
- items not having annotations
- items having undefined annotations
- imbalanced distribution in class/attributes
- too small or large values
summary

Usage:

datum validate [-h] -t TASK [-s SUBSET_NAME] [-p PROJECT_DIR]
  [target] [-- EXTRA_ARGS]

Parameters:

<target> (string) - Target dataset revpath. By default, validates the current project.
-t, --task-type (string) - Task type for validation
-s, --subset (string) - Dataset subset to be validated
-p, --project (string) - Directory of the project to operate on (default: current directory).
-h, --help - Print the help message and exit.
<extra args> - The list of extra validation parameters. Should be passed after the -- separator after the main command arguments:
- -fs, --few-samples-thr (number) - The threshold for giving a warning for minimum number of samples per class
- -ir, --imbalance-ratio-thr (number) - The threshold for giving imbalance data warning
- -m, --far-from-mean-thr (number) - The threshold for giving a warning that data is far from mean
- -dr, --dominance-ratio-thr (number) - The threshold for giving a warning bounding box imbalance
- -k, --topk-bins (number) - The ratio of bins with the highest number of data to total bins in the histogram

Example : give warning when imbalance ratio of data with classification task over 40

datum validate -p prj/ -t classification -- -ir 40

Here is the list of validation items(a.k.a. anomaly types).

Anomaly Type	Description	Task Type
MissingLabelCategories	Metadata (ex. LabelCategories) should be defined	common
MissingAnnotation	No annotation found for an Item	common
MissingAttribute	An attribute key is missing for an Item	common
MultiLabelAnnotations	Item needs a single label	classification
UndefinedLabel	A label not defined in the metadata is found for an item	common
UndefinedAttribute	An attribute not defined in the metadata is found for an item	common
LabelDefinedButNotFound	A label is defined, but not found actually	common
AttributeDefinedButNotFound	An attribute is defined, but not found actually	common
OnlyOneLabel	The dataset consists of only label	common
OnlyOneAttributeValue	The dataset consists of only attribute value	common
FewSamplesInLabel	The number of samples in a label might be too low	common
FewSamplesInAttribute	The number of samples in an attribute might be too low	common
ImbalancedLabels	There is an imbalance in the label distribution	common
ImbalancedAttribute	There is an imbalance in the attribute distribution	common
ImbalancedDistInLabel	Values (ex. bbox width) are not evenly distributed for a label	detection, segmentation
ImbalancedDistInAttribute	Values (ex. bbox width) are not evenly distributed for an attribute	detection, segmentation
NegativeLength	The width or height of bounding box is negative	detection
InvalidValue	There’s invalid (ex. inf, nan) value for bounding box info.	detection
FarFromLabelMean	An annotation has an too small or large value than average for a label	detection, segmentation
FarFromAttrMean	An annotation has an too small or large value than average for an attribute	detection, segmentation

Validation Result Format:

{
    'statistics': {
        ## common statistics
        'label_distribution': {
            'defined_labels': <dict>,   # <label:str>: <count:int>
            'undefined_labels': <dict>
            # <label:str>: {
            #     'count': <int>,
            #     'items_with_undefined_label': [<item_key>, ]
            # }
        },
        'attribute_distribution': {
            'defined_attributes': <dict>,
            # <label:str>: {
            #     <attribute:str>: {
            #         'distribution': {<attr_value:str>: <count:int>, },
            #         'items_missing_attribute': [<item_key>, ]
            #     }
            # }
            'undefined_attributes': <dict>
            # <label:str>: {
            #     <attribute:str>: {
            #         'distribution': {<attr_value:str>: <count:int>, },
            #         'items_with_undefined_attr': [<item_key>, ]
            #     }
            # }
        },
        'total_ann_count': <int>,
        'items_missing_annotation': <list>, # [<item_key>, ]

        ## statistics for classification task
        'items_with_multiple_labels': <list>, # [<item_key>, ]

        ## statistics for detection task
        'items_with_invalid_value': <dict>,
        # '<item_key>': {<ann_id:int>: [ <property:str>, ], }
        # - properties: 'x', 'y', 'width', 'height',
        #               'area(wxh)', 'ratio(w/h)', 'short', 'long'
        # - 'short' is min(w,h) and 'long' is max(w,h).
        'items_with_negative_length': <dict>,
        # '<item_key>': { <ann_id:int>: { <'width'|'height'>: <value>, }, }
        'bbox_distribution_in_label': <dict>, # <label:str>: <bbox_template>
        'bbox_distribution_in_attribute': <dict>,
        # <label:str>: {<attribute:str>: { <attr_value>: <bbox_template>, }, }
        'bbox_distribution_in_dataset_item': <dict>,
        # '<item_key>': <bbox count:int>

        ## statistics for segmentation task
        'items_with_invalid_value': <dict>,
        # '<item_key>': {<ann_id:int>: [ <property:str>, ], }
        # - properties: 'area', 'width', 'height'
        'mask_distribution_in_label': <dict>, # <label:str>: <mask_template>
        'mask_distribution_in_attribute': <dict>,
        # <label:str>: {
        #     <attribute:str>: { <attr_value>: <mask_template>, }
        # }
        'mask_distribution_in_dataset_item': <dict>,
        # '<item_key>': <mask/polygon count: int>
    },
    'validation_reports': <list>, # [ <validation_error_format>, ]
    # validation_error_format = {
    #     'anomaly_type': <str>,
    #     'description': <str>,
    #     'severity': <str>, # 'warning' or 'error'
    #     'item_id': <str>,  # optional, when it is related to a DatasetItem
    #     'subset': <str>,   # optional, when it is related to a DatasetItem
    # }
    'summary': {
        'errors': <count: int>,
        'warnings': <count: int>
    }
}

item_key is defined as,

item_key = (<DatasetItem.id:str>, <DatasetItem.subset:str>)

bbox_template and mask_template are defined as,

bbox_template = {
    'width': <numerical_stat_template>,
    'height': <numerical_stat_template>,
    'area(wxh)': <numerical_stat_template>,
    'ratio(w/h)': <numerical_stat_template>,
    'short': <numerical_stat_template>, # short = min(w, h)
    'long': <numerical_stat_template>   # long = max(w, h)
}
mask_template = {
    'area': <numerical_stat_template>,
    'width': <numerical_stat_template>,
    'height': <numerical_stat_template>
}

numerical_stat_template is defined as,

numerical_stat_template = {
    'items_far_from_mean': <dict>,
    # {'<item_key>': {<ann_id:int>: <value:float>, }, }
    'mean': <float>,
    'stddev': <float>,
    'min': <float>,
    'max': <float>,
    'median': <float>,
    'histogram': {
        'bins': <list>,   # [<float>, ]
        'counts': <list>, # [<int>, ]
    }
}