We asked domain experts and three LLMs to construct causal graphs for life-space mobility in rural dementia patients. They agreed on most edges, but disagreed on one that reveals a fundamentally different view of what matters most in dementia care. When tested against real data, none fit well, but they failed in different ways.

Problem: reasoning about interventions without enough data

Life-space mobility captures a person’s physical and social environment, movement, and daily activities. A decrease in life-space is associated with various measures of declining health in older adults, including cognitive decline. However, Life-Space Assessments (LSAs) have primarily focused on urban populations, limiting their applicability to rural or Indigenous settings where dementia risk is higher.

Developing interventions to improve outcomes for these populations requires causally modeling the relationship between life-space mobility and relevant demographic and environmental variables. This is a particularly hard problem, because the data needed for standard causal discovery does not exist.

Approach: four methods for causal graph construction

We consider a tiny dataset of 20 patients in rural and Indigenous communities in northern Minnesota, USA, and Ontario, Canada. For this domain, we study several approaches for building causal graphs (shown in Figure 1). We compare them structurally and evaluate their empirical validity.


Fig. 1. Causal graph construction methods, including expert elicitation (1), LLM consensus (Claude, Gemini, GPT) (2), hybrid subtractive refinement (3), and a data-driven FCI baseline (4).


Variables (9 boolean features, n=20)

Continuous variables were binarized using expert-provided or sample mean-based thresholds.

CodeVariableSourcePrevalence
LSLife-space score (high)Life-Space Assessment45%
TBTotal burden on caregiver (high)Caregiver diary40%
NDNon-routine days (above median)Caregiver diary60%
CDChallenging days (above median)Caregiver diary45%
CTCommunity type (rural)Demographic55%
PSPatient sex (female)Demographic35%
CSCaregiver sex (female)Demographic65%
PEPatient education (above threshold)Demographic35%
CECaregiver education (above threshold)Demographic50%


Findings: consensus, contradictions, and incompatibility with data

The LLMs agree on six direct causal relationships. Experts agree with five, but invert the sixth in a way that reveals a fundamental difference in modeling perspectives.


Fig. 2. Expert-constructed (left) vs. LLM consensus (right) causal graphs. Gray arrows = shared edges; blue dashed arrows = expert-only socioeconomic pathways; red arrow = LLM-only inverted key edge; greyed nodes = variables excluded by LLMs. The key structural disagreement: experts place TB→LS; LLMs place LS→TB.


Experts Say Life-space score (LS) is the final sink.
Experts treat LS as the patient outcome to improve, and include sex and education as causal drivers.

LLMs Say Caregiver burden (TB) is the final sink.
LLMs consistently make TB the ultimate outcome, and exclude socioeconomic variables entirely.

This disparity suggests a fundamental difference in modeling perspectives. It may also point to the bidirectional nature of the TB–LS relationship, requiring disaggregation across time. LLMs exclude socioeconomic factors such as patient sex, caregiver sex, patient education, and caregiver education, which the experts identify as significant causal drivers.

Empirical validation

We treat each graph as a model predicting conditional independencies (CIs) and test them against the data using the G-test. Since incorrectly assuming independence is more detrimental than missing a true independence, we focus on Precision and False Positive Rate (FPR).

ModelTotal CIsPrecisionFPR
Expert1130.730.53
GPT450.670.48
Claude450.730.39
Gemini230.700.33
LLM Consensus310.680.48
Overall Consensus390.720.52


Refinement eliminated nearly all edges, leaving fewer than three in the final graphs. The data-driven FCI baseline identified no causal edges. It identified only three undirected associations. Neither LLM-generated nor expert-constructed graphs are compatible with the data, but they are incompatible in different ways, illustrating the structural differences between expert and LLM-based causal models.

Conclusion

Our analysis of the differences among sources of causal knowledge — domain experts, LLMs, and tiny datasets — provides a foundation for hybrid causal models. The TB–LS edge dispute illustrates that LLMs and experts encode meaningfully different assumptions about what the primary patient outcome is and which variables drive it. Understanding this interplay is essential for developing robust clinical decision-support systems in data-scarce, underserved populations. Three open questions follow directly from our findings

  • Is the TB–LS edge bidirectional? The disagreement may reflect a genuine feedback loop requiring temporal disaggregation rather than a single directed edge.
  • Can high-disagreement regions guide targeted elicitation? The structural differences identified here could be used to solicit more focused expert feedback and build better causal priors.
  • How do causal models for rural and Indigenous populations differ from urban ones? Understanding these differences has direct implications for intervention design in underserved communities.

Citation

If you build on or use portions of this work, please cite:

@inproceedings{singh2026lifespace,
  title     = {Causal Models with Tiny Data: The Case of Rural People Living with Dementia},
  author    = {Singh, Ranveer and Mathur, Saurabh and Komarasamy, Kavimayil P.
               and Soni, Ameet and Whetung, Cliff and Warry, Wayne
               and Jacklin, Kristen and Blind, Melissa and Natarajan, Sriraam},
  booktitle = {Proceedings of the 24th International Conference on
               Artificial Intelligence in Medicine (AIME)},
  year      = {2026},
  note      = {Supported by NIH awards R01NS133142 and 1R21AG072566}
}

Acknowledgements

We gratefully acknowledge the support of NIH awards R01NS133142 and 1R21AG072566.