Evaluation metrics task force

Brainstorming regarding use of metrics for purposes of evaluation and benchmarking in MONAI

Identification of metrics definitions, assessment of limitations and conditions of use for

Classification
Binary / Probabilistic / Multi Label segmentation
Regression / Generative tasks
Object detection

Requirements for evaluation suite

Comparison possible across architectures -> network agnostic
Most of the metrics can be used in different setting with different input types (1d, 2d, 3d arrays) need to accommodate for different contexts.
Confusion matrix based evaluation measures should be available for both classification and segmentation tasks
Binary vs Probabilistic
- For segmentation - binary is priority
- Probabilistic transformed in multiple binary
- Measures at population level
- Volume correlations for instance
- Compare to baseline
Need for a report
- Include result image by image (ordered by metric with associated percentile performance over full set)
- Include aggregation of results (with a flag to aggregate specific groups of loads / characteristics)
- Choose set of metrics always to be put in the report (core metrics) allow to add additional optional ones
- Basic statistics for aggregation over population
- Need guidelines on aggregation of metrics for reporting
- Implementation of the "rank" analysis (cf Medical Image Decathlon)
Importance of the documentation
- The documentation should included the contexts in which the evaluation measure is appropriate
- Synonyms and direct transformations to other usual measures should be considered
- Indication of similar / highly correlated metrics
Thresholds for different subgroups of evaluation (smal - medium - large)
- Current pb is that we don’t know what are the most appropriate thresholds - need option to be set and default suggestion
- Relevant for segmentation and object detection
Definition of specific cases of probabilistic outputs / multi thresholds analysis and multi label aggregation
- Probabilistic outputs
- Fuzzy metrics
- Choosing a set of threshold and apply binary set on each
- Choosing different value of ROC
Multilabel
- Specific multilabel metrics
- Cost according to distance -> needs additional input
- Cost of confusion -> providing typical ways of conveying the information
- Weighted / micro or macro metrics
Investigation on strategies of aggregation at population level
- Volumetric correlation
- Correlation with clinical status/measure
- Aggregating in specific groups (e.g performance across specific lesion loads)
- Statistics over results (mean / median - std / IQR) min max 5% and 95% with associated case ID -> Important to suggest publication of average best and worst results
Literature on the different metrics
List of existing sources for implementation

Still in progress Definition of what to do in edge cases (no definition of the metric) nan values Specific tasks with particular metrics Tractography Vessel segmentation Metrics for assessment of distribution / evaluation of uncertainty

Transformation of described requirements in issues

Generic

Allow the evaluation suite to take in single pair of ref/seg images or folder of matching pairs (by subject name) - Using np array in memory for the different images - (i.e. not forcing everything to be in folders) - Develop util functions to allow folder / file loading to memory - Computation should be on CPU - ensure torch tensors are converted back to numpy arrays.
Link of classical metrics to their trainable counterpart (GPU based if possible with possible backpropagation.
Allow for binary or probabilistic input
For segmentation - provide results at different thresholds (potentially predefined by user)
Allow for multi label input

Output of evaluation

Produce a report csv file for the evaluation with aggregate statistics over the different metrics - use pandas DataFrame to gather all results (save to csv/xls depending on evaluation (multi label / mono - label / probability thresholds…) Specify the output format as option - suggest one according to task
Csv/xlsx/html/ for individual subject
Potentially html for aggregation building on challengeR (going towards WebToolkit) - to discuss with dev team on best way to integrate

Task focus 1 - Segmentation

Implement dice score metrics allowing for multiple options when metrics is not defined
- Add optional epsilon to handle nans if needed (both on numerator and denominator)
- Optional function to handle nans in aggregation
- Implement nan-handling functions
- For all metrics - 2 outputs - nan_handled / not nan_handled - To discuss further
Implement Hausdorff distance using percentile as argument
Implement binary based confusion matrix metrics
Report on raw data from confusion matrix
Implement GDSC
Implement Surface dice
Implement Average surface distance.

Copyright (c) MONAI Consortium

Provide feedback

Saved searches

Use saved searches to filter your results more quickly