Pediatric critical care is a dynamic, high-stakes process involving constant monitoring and adjustments in life-saving treatments. We frame clinical decision-making as imitation learning from observational data, in which actions are not directly observed. We find that the TabPFN-based approach consistently outperforms classical baselines, supporting its use as a strong clinician-behavior baseline for pediatric ECMO decision support.

Key challenges in pediatric ECMO

Building decision support systems for this domain is complicated by the incomplete and difficult-to-formalize nature of pediatric-specific protocols, alongside the ethical and practical impossibility of experimentation on critically ill children.

  1. Irregular time between actions. Clinical observations arrive at irregular intervals; we aggregate trajectories into hourly windows.
  2. Concurrent, high-dimensional action space. Several parameters can be changed simultaneously, without explicit annotations.
  3. No explicit intervention labels. Interventions are not explicitly timestamped in the raw dataset; we infer each action by monitoring its delta over a 1-hour window.
  4. Severe data scarcity. A large number of observations for a small number of patients, i.e., imbalance of the features vs examples.
  5. No explicit reward signal. One has to infer the success of a trajectory based on the patient outcome. Evaluating new strategies directly on live patients is ethically and practically unfeasible.

Imitation learning pipeline

Unlike standard reinforcement learning, which requires active environmental interaction, offline imitation learning allows us to exploit expert demonstrations already present in historical clinical records.


Fig. 1. Pipeline for learning clinical behavior from unlabeled ECMO trajectories. Raw physiologic telemetry is processed into discrete action labels using physician-defined thresholds. These discovered actions formulate an imitation learning task: predicting clinician "knob" adjustments from patient state.


Actionable features (knobs)

A discrete action is identified when the absolute change within a 60-minute window exceeds the specified threshold.

KnobThresholdAction space
Arterial PO₂25 mmHgIncrease / Same / Decrease
Arterial PCO₂5 mmHgIncrease / Same / Decrease
SpO₂5 %Increase / Same / Decrease
FiO₂10 %Increase / Same / Decrease


Pediatric ECMO cohort

78 pediatric ECMO trajectories obtained from Children’s Medical Center, Dallas. Cases with congenital heart disease were excluded due to the complexity of specialized management.

MetricValue
Patient trajectories78
Average patient age4.66 years
Average trajectory length215 hours
State features per hour48 (23 clinical variables + deltas + ECMO type + on-ECMO indicator)
Actionable knobs4 (PO₂, PCO₂, SpO₂, FiO₂)
SourceChildren's Medical Center, Dallas
Exclusion criterionCongenital Heart Disease cases


State features

For each variable, we include the mean value within a 60-minute window and its delta relative to the previous window. Heart rate (HR) and mean arterial pressure (ARTm) are age-normalized.

CategoryFeatures
HemodynamicsMean Arterial Pressure (ARTm), Heart Rate (HR), SpO₂, Cerebral Oximetry (rSO₂-1, rSO₂-2)
ECMO CircuitBlood Flow (mL/kg/min), Sweep Gas CO₂ Flow, Sweep Gas O₂ Flow, FiO₂ – ECMO, Volume Sensor
VentilatorMean Airway Pressure (PAW), PEEP, Oxygen Concentration (FiO₂), Tidal Volume, End-tidal CO₂ (etCO₂)
LaboratoryArterial PO₂, PCO₂, pH, Base Excess, Lactate, Ionized Calcium, Total CO₂, INR


Evaluation across three questions

We evaluate generalization performance, calibration, and alignment with clinician decisions using leave-one-out (LOO) cross-validation. Given the inherent class imbalance, we quantify performance using balanced accuracy and macro-F1 scores for each action head.

Q1: Do policies generalize?

TabPFN achieves the highest mean balanced accuracy across the majority of the actions. The consistent improvement in balanced accuracy suggests that TabPFN is more resilient to the class imbalance inherent to this setting and generalizes more effectively to held-out patients. Ranking consistent across all four action knobs: PO₂, PCO₂, SpO₂, FiO₂.


Q2: Are policies calibrated?

Expected Calibration Error (ECE) summarizes the average gap between a model’s stated confidence and its observed accuracy across probability bins. Lower ECE indicates more reliable estimates of uncertainty. TabPFN consistently achieves a lower ECE than the MLP and XGBoost baselines. The posterior probabilities produced by TabPFN’s in-context inference closely match its empirical accuracy, providing reliable confidence estimates.


Q3: Where do policies disagree with clinicians?

We identify regions of the state space with the highest model-expert disagreement by training shallow decision trees on the residual errors of their predicted probabilities. The table shows the most frequent features appearing in high-disagreement states, reported as the number of times each feature was selected across all knobs.

FeatureMLPTabPFNXGBoost
FiO₂255
SpO₂153
pH232
PO₂321
PCO₂231
Lactate311


Oxygenation-related signals, specifically FiO₂ and SpO₂, consistently define the boundaries where model predictions deviate from clinician actions. The models deviate from clinician actions in different regions of the state space, motivating future expert review of high-disagreement states.

Conclusion

We presented an imitation learning framework to model clinical decision-making in pediatric ECMO. Our framework maps patient states to concurrent clinical interventions by inferring action labels from patient trajectories and employing a factorized, hierarchical policy architecture.

Our empirical evaluation finds that TabPFN consistently outperforms traditional gradient-based baselines, suggesting that it can serve as an effective clinician-behavior baseline in challenging critical care settings. TabPFN encodes a broad tabular prior from its extensive pretraining on millions of synthetic datasets, enabling probabilistic inference in a single forward pass without requiring dataset-specific gradient-based training or hyperparameter tuning.

Future work

  • Analyze whether model-clinician disagreements correlate with patient outcomes, such as risk of neurological injury.
  • Explore policy improvement by incorporating expert-defined rewards.
  • Perform a systematic analysis on the high-disagreement extracted rules and use them to elicit expert feedback for policy refinement.

Citation

If you build on or use portions of this work, please cite:

@inproceedings{golivand2026ecmo,
  title     = {Imitation Learning for Clinical Decision Support in Pediatric {ECMO}},
  author    = {Golivand Darvishvand, Fateme and Skinner, Michael and Mathur, Saurabh
               and Soni, Ameet and Reeder, Phillip and Kersting, Kristian
               and Raman, Lakshmi and Natarajan, Sriraam},
  booktitle = {Proceedings of the 24th International Conference on
               Artificial Intelligence in Medicine (AIME)},
  year      = {2026},
  note      = {Supported by NIH award R01NS133142}
}

Acknowledgements

We gratefully acknowledge the support of NIH award R01NS133142.