Task & Evaluation

This challenge consists of three interconnected subtasks: tumor segmentation, TN staging classification, and prognosis prediction. Each subtask is evaluated using metrics specifically chosen to reflect its clinical objectives and technical challenges.


1. Segmentation Task

The segmentation task focuses on delineating two clinically relevant structures: - Primary tumor (GTVp) - Nodal tumors (GTVn)

Metrics

Segmentation performance is evaluated using the Dice Similarity Coefficient (DSC), which measures the spatial overlap between predicted and ground-truth segmentations.

  • Dice is computed separately for GTVp and GTVn, and the final segmentation score is obtained by averaging both.
  • GTVn, which may include zero, one, or multiple lesions per patient.

Justification

Dice is widely used in medical image segmentation due to its robustness to class imbalance and its direct measurement of spatial overlap.

The distinction between GTVp and GTVn reflects their clinical differences: - GTVp typically consists of a single lesion and is well-suited to standard DSC evaluation. - GTVn may contain multiple or no lesions. However, underestimating or overestimating the number of lesions or their sizes directly affects the N stage in clinical settings.


2. TN Staging Task

The TN staging task is formulated as a multi-label, multi-class classification problem, predicting tumor (T) and nodal (N) stages.

Metrics

  • Balanced Accuracy

Justification

Balanced accuracy ensures that all classes are equally weighted, making it robust to class imbalance and preventing dominance by frequent stages.

Recall is included to emphasize the model’s ability to correctly identify disease stages, which is critical in clinical settings where missing a stage may significantly affect treatment decisions.


3. Prognosis Task

The prognosis task aims to predict patient outcomes based on survival data.

Metric

  • Concordance Index (C-index)

Justification

The C-index evaluates the model’s ability to correctly rank patients by risk. It is particularly suitable for survival analysis as it: - Handles censored data - Provides a global measure of discrimination - Generalizes the concept of AUC to time-to-event data


Ranking Methodology

Performance is first evaluated independently for each subtask:

  • Segmentation: Mean Dice score across GTVp and GTVn
  • TN Staging: Mean balanced accuracy across T and N predictions
  • Prognosis: C-index

Teams are ranked separately for each subtask based on these metrics. A final overall ranking is then computed using a weighted aggregation framework:

  • Segmentation: 0.25
  • TN staging: 0.35
  • Prognosis: 0.40

This weighting scheme reflects the increasing clinical complexity and impact of downstream tasks, with greater emphasis placed on TN staging and prognosis. While segmentation remains foundational, its influence is moderated since its performance propagates into subsequent tasks.

Tie-breaking and Statistical Analysis

To address potential ties, a consistency metric—defined as the difference between weighted and unweighted averages—will be used as a tie-breaker.

Additionally, statistical analyses (e.g., bootstrapping) will be conducted to estimate confidence intervals and assess whether differences between top-performing teams are statistically significant.

Handling Missing Submissions

  • Segmentation: Missing predictions are treated as empty masks (no GTVp or GTVn predicted)
  • TN staging: Missing predictions are treated as misclassifications
  • Prognosis: Missing scores are treated as non-concordant pairs

This ensures fair and consistent evaluation across all participants.