Task And Evaluation - HEad and neCK TumOR Lesion Segmentation, Staging and Prognosis

Task & Evaluation¶

This challenge consists of three interconnected subtasks: tumor segmentation, TN staging classification, and prognosis prediction. Each subtask is evaluated using metrics specifically chosen to reflect its clinical objectives and technical challenges.

1. Segmentation Task¶

The segmentation task focuses on delineating two clinically relevant structures: - Primary tumor (GTVp) - Nodal tumors (GTVn)

Metrics¶

Segmentation performance is evaluated using the Dice Similarity Coefficient (DSC), which measures the spatial overlap between predicted and ground-truth segmentations.

Dice is computed separately for GTVp and GTVn, and the final segmentation score is obtained by averaging both.
GTVn, which may include zero, one, or multiple lesions per patient.

Justification¶

Dice is widely used in medical image segmentation due to its robustness to class imbalance and its direct measurement of spatial overlap.

The distinction between GTVp and GTVn reflects their clinical differences: - GTVp typically consists of a single lesion and is well-suited to standard DSC evaluation. - GTVn may contain multiple or no lesions. However, underestimating or overestimating the number of lesions or their sizes directly affects the N stage in clinical settings.

2. TN Staging Task¶

The TN staging task is formulated as a multi-label, multi-class classification problem, predicting tumor (T) and nodal (N) stages.

Metrics¶

Balanced Accuracy

Justification¶

Balanced accuracy ensures that all classes are equally weighted, making it robust to class imbalance and preventing dominance by frequent stages.

Recall is included to emphasize the model’s ability to correctly identify disease stages, which is critical in clinical settings where missing a stage may significantly affect treatment decisions.

3. Prognosis Task¶

The prognosis task aims to predict patient outcomes based on survival data.

Metric¶

Concordance Index (C-index)

Justification¶

The C-index evaluates the model’s ability to correctly rank patients by risk. It is particularly suitable for survival analysis as it: - Handles censored data - Provides a global measure of discrimination - Generalizes the concept of AUC to time-to-event data

Ranking Methodology¶

Performance is first evaluated independently for each subtask:

Segmentation: Mean Dice score across GTVp and GTVn
TN Staging: Mean balanced accuracy across T and N predictions
Prognosis: C-index

Teams are ranked separately for each subtask based on these metrics. A final overall ranking is then computed using a weighted aggregation framework:

Segmentation: 0.25
TN staging: 0.35
Prognosis: 0.40

This weighting scheme reflects the increasing clinical complexity and impact of downstream tasks, with greater emphasis placed on TN staging and prognosis. While segmentation remains foundational, its influence is moderated since its performance propagates into subsequent tasks.

Tie-breaking and Statistical Analysis¶

To address potential ties, a consistency metric—defined as the difference between weighted and unweighted averages—will be used as a tie-breaker.

Additionally, statistical analyses (e.g., bootstrapping) will be conducted to estimate confidence intervals and assess whether differences between top-performing teams are statistically significant.

Handling Missing Submissions¶

Segmentation: Missing predictions are treated as empty masks (no GTVp or GTVn predicted)
TN staging: Missing predictions are treated as misclassifications
Prognosis: Missing scores are treated as non-concordant pairs

This ensures fair and consistent evaluation across all participants.