pith. machine review for the scientific record. sign in

arxiv: 2604.21076 · v1 · submitted 2026-04-22 · 💻 cs.CL · cs.AI

Recognition: unknown

Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords FHIRmedication reconciliationLLMserializationclinical datamodel scaling
0
0 comments X

The pith

Serialisation strategy has a large effect on LLM performance for medication reconciliation from FHIR data, with clinical narrative best for smaller models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares four ways to format FHIR patient records for LLMs in medication reconciliation tasks. It shows that for models up to 8 billion parameters, presenting the data as a clinical narrative yields much higher accuracy than raw JSON, with gains up to 19 F1 points. This pattern reverses for a 70B model where raw JSON performs best. The study also finds that models tend to miss medications rather than invent them, and smaller models cannot handle patients with many concurrent drugs.

Core claim

Serialisation strategy has a large, statistically significant effect on performance for models up to 8B parameters: Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B. This advantage reverses at 70B, where Raw JSON achieves the best mean F1 of 0.9956. In all cases, precision exceeds recall, with omission as the dominant error.

What carries the argument

Comparison of four FHIR serialisation strategies (Raw JSON, Markdown Table, Clinical Narrative, and Chronological Timeline) evaluated across five open-weight LLMs on 200 synthetic patient records.

Load-bearing premise

The 200 synthetic patient records sufficiently capture the distribution, noise, and edge cases of real clinical FHIR data.

What would settle it

Evaluating the same models and strategies on a set of real, de-identified clinical FHIR records would test whether the performance differences hold outside the synthetic benchmark.

Figures

Figures reproduced from arXiv: 2604.21076 by Sanjoy Pator.

Figure 2
Figure 2. Figure 2: Mean F1 per strategy across model sizes (x [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean recall by number of active medications [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Precision vs. recall for each (model, strategy) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: BioMistral-7B failure mode breakdown by strategy (200 patients each). Garbled or incoherent output dominates across all four strategies. Prompt repetition (the model echoes the system prompt instead of responding) accounts for 11–68 patients per strategy. No strategy produces a parseable JSON response [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: F1 distribution per model on best strategy. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Medication reconciliation at clinical handoffs is a high-stakes, error-prone process. Large language models are increasingly proposed to assist with this task using FHIR-structured patient records, but a fundamental and largely unstudied variable is how the FHIR data is serialised before being passed to the model. We present the first systematic comparison of four FHIR serialisation strategies (Raw JSON, Markdown Table, Clinical Narrative, and Chronological Timeline) across five open-weight models (Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, Llama-3.3-70B) on a controlled benchmark of 200 synthetic patients, totalling 4,000 inference runs. We find that serialisation strategy has a large, statistically significant effect on performance for models up to 8B parameters: Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B (r = 0.617, p < 10^{-10}). This advantage reverses at 70B, where Raw JSON achieves the best mean F1 of 0.9956. In all 20 model and strategy combinations, mean precision exceeds mean recall: omission is the dominant failure mode, with models more often missing an active medication than fabricating one, which changes how clinical safety auditing priorities should be set. Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients, the patients most at risk from reconciliation errors, systematically underserved. BioMistral-7B, a domain-pretrained model without instruction tuning, produces zero usable output in all conditions, showing that domain pretraining alone is not sufficient for structured extraction. These results offer practical, evidence-based format recommendations for clinical LLM deployment: Clinical Narrative for models up to 8B, Raw JSON for 70B and above. The complete pipeline is reproducible on open-source tools running on an AWS g6e.xlarge instance (NVIDIA L40S, 48 GB VRAM).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents the first systematic empirical comparison of four FHIR serialization strategies (Raw JSON, Markdown Table, Clinical Narrative, Chronological Timeline) across five open-weight LLMs on medication reconciliation using a controlled benchmark of 200 synthetic patients and 4000 inference runs. It claims that serialization strategy has a large, statistically significant effect on performance for models up to 8B parameters (e.g., Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B with r=0.617, p<10^{-10}), with the advantage reversing at 70B where Raw JSON achieves the highest mean F1 of 0.9956; omissions are the dominant error mode in all conditions, smaller models plateau at 7-10 concurrent medications, and BioMistral-7B produces no usable output.

Significance. If the results hold, this work provides actionable, evidence-based guidance for clinical LLM deployment by quantifying the impact of input formatting and its interaction with model scale, supported by statistical rigor and full reproducibility on open-source tools. The observation that domain pretraining without instruction tuning is insufficient is a useful negative result for the field.

major comments (1)
  1. [Abstract and Methods] Abstract and Methods: The deployment recommendations (Clinical Narrative for models ≤8B, Raw JSON for 70B+) rest on performance differences measured exclusively on 200 synthetic patient records. The manuscript provides no details on the synthetic data generator's construction, validation against real FHIR distributions, or coverage of documentation variability, missing fields, and polypharmacy edge cases. This assumption is load-bearing for generalizability, as the reported F1 gaps, precision-recall patterns, and polypharmacy limitations may not transfer to real clinical data.
minor comments (2)
  1. [Abstract] Abstract: The statement that 'the complete pipeline is reproducible on open-source tools running on an AWS g6e.xlarge instance' should include an explicit link to the code repository to facilitate verification.
  2. [Results] Results: The consistent finding that mean precision exceeds mean recall across all 20 model-strategy combinations would benefit from a brief discussion of how this affects clinical safety auditing priorities, as noted in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive evaluation of the work's significance. We address the major comment point by point below, with plans to revise the manuscript to improve clarity on the synthetic benchmark.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods: The deployment recommendations (Clinical Narrative for models ≤8B, Raw JSON for 70B+) rest on performance differences measured exclusively on 200 synthetic patient records. The manuscript provides no details on the synthetic data generator's construction, validation against real FHIR distributions, or coverage of documentation variability, missing fields, and polypharmacy edge cases. This assumption is load-bearing for generalizability, as the reported F1 gaps, precision-recall patterns, and polypharmacy limitations may not transfer to real clinical data.

    Authors: We agree that explicit details on the synthetic data generator are necessary to evaluate the scope of our findings. The current manuscript describes the benchmark as a controlled set of 200 synthetic patients but does not elaborate on its construction. In the revised manuscript we will add a new subsection in Methods that specifies: (1) the generator's design, including sampling from empirical distributions for demographics, medication counts, and common polypharmacy patterns drawn from publicly available clinical literature; (2) explicit simulation of missing fields, documentation variability, and edge cases such as duplicate entries or inactive medications; and (3) the rationale for the 7–10 medication plateau observed. We will also add a dedicated Limitations paragraph acknowledging that, while the synthetic data were constructed to reflect realistic clinical distributions, we did not conduct formal statistical validation against a specific real-world FHIR corpus. This controlled synthetic design was chosen precisely to isolate serialization effects from the confounding variability of real EHR data; we will clarify that the reported F1 differences and error patterns are therefore best interpreted as evidence of format sensitivity under standardized conditions rather than direct predictions for live clinical deployment. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper reports results from a controlled set of 4,000 inference runs comparing four serialization formats on 200 synthetic patients across five LLMs. All performance claims (F1 differences, precision-recall patterns, model-size reversals) are direct empirical measurements with no equations, fitted parameters renamed as predictions, or load-bearing self-citations. The central findings rest on observable output statistics rather than any derivation that reduces to the authors' own modeling choices or prior work by the same team.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study introduces no new mathematical axioms, free parameters, or invented entities. It relies on standard LLM inference, synthetic data generation, and conventional F1/precision/recall metrics.

pith-pipeline@v0.9.0 · 5675 in / 1155 out tokens · 63853 ms · 2026-05-10T00:13:25.976780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    JACC: Advances , volume =

    Schmiedmayer, Paul and Rao, Adrit and Zagar, Philipp and Aalami, Lauren and Ravi, Vishnu and Zahedivash, Aydin and Yao, Dong-han and Fereydooni, Arash and Aalami, Oliver , title =. JACC: Advances , volume =. 2025 , doi =

  2. [2]

    and Shinagawa, Yoshihisa and Luo, Yuan , title =

    Li, Yikuan and Wang, Hanyin and Yerebakan, Halid Z. and Shinagawa, Yoshihisa and Luo, Yuan , title =. NEJM AI , year =

  3. [3]

    Clinical Safety and Hallucination Framework for

    Asgari, Elham and Monta. Clinical Safety and Hallucination Framework for. npj Digital Medicine , year =

  4. [4]

    2024 , eprint =

    Schmiedmayer, Paul and Rao, Adrit and Zagar, Philipp and Ravi, Vishnu and Zahedivash, Aydin and Fereydooni, Arash and Aalami, Oliver , title =. 2024 , eprint =

  5. [5]

    Applied Sciences , volume =

    Delaunay, Julien and Girbes, Daniel and Cusido, Jordi , title =. Applied Sciences , volume =. 2025 , doi =

  6. [6]

    2025 , eprint =

    Kim, Yubin and Jeong, Hyewon and Chen, Shan and others , title =. 2025 , eprint =

  7. [7]

    2025 , eprint =

    Zeba, Musarrat and Al Mamun, Abdullah and Tithee, Kishoar Jahan and Sutradhar, Debopom and Raiaan, Mohaimenul Azam Khan and Mukta, Saddam , title =. 2025 , eprint =

  8. [8]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

    Kruse, Maya and Hu, Shiyue and Derby, Nicholas and Wu, Yifu and Stonbraker, Samantha and Yao, Bingsheng and Wang, Dakuo and Goldberg, Elizabeth and Gao, Yanjun , title =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

  9. [9]

    2026 , eprint =

    Wu, Kaiyuan and Nagori, Aditya and Kamaleswaran, Rishikesan , title =. 2026 , eprint =

  10. [10]

    2024 , eprint =

    Abdin, Marah and others , title =. 2024 , eprint =

  11. [11]

    and Sablayrolles, Alexandre and Mensch, Arthur and others , title =

    Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and others , title =. 2023 , eprint =

  12. [12]

    2024 , eprint =

    The. 2024 , eprint =

  13. [13]

    2024 , eprint =

    Labrak, Yanis and Bazoge, Adrien and Morin, Emmanuel and Gourraud, Pierre-Antoine and Rouvier, Mickael and Dufour, Richard , title =. 2024 , eprint =

  14. [14]

    Journal of the American Medical Informatics Association , volume =

    Walonoski, Jason and Kramer, Mark and Nichols, Joseph and Quina, Andre and Moesel, Chris and Hall, Dylan and Duffett, Carlton and Dube, Kudakwashe and Gallagher, Thomas and McLachlan, Scott , title =. Journal of the American Medical Informatics Association , volume =. 2018 , doi =

  15. [15]

    Biometrics Bulletin , volume =

    Wilcoxon, Frank , title =. Biometrics Bulletin , volume =. 1945 , doi =

  16. [16]

    Field, Andy , title =

  17. [17]

    and Lichtenstein, Richard L

    Lantz, Paula M. and Lichtenstein, Richard L. , title =. 2020 , publisher =

  18. [18]

    Ollama: Get Up and Running with Large Language Models Locally , year =

  19. [19]

    Marah Abdin and 1 others. 2024. https://arxiv.org/abs/2404.14219 Phi-3 technical report: A highly capable language model locally on your phone . Preprint, arXiv:2404.14219

  20. [20]

    Elham Asgari, Nina Monta \ n a-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, and Dominic Pimenta. 2025. https://doi.org/10.1038/s41746-025-01670-7 Clinical safety and hallucination framework for LLMs in medical text summarisation . npj Digital Medicine

  21. [21]

    Julien Delaunay, Daniel Girbes, and Jordi Cusido. 2025. https://doi.org/10.3390/app15063379 Evaluating the effectiveness of large language models in converting clinical data to FHIR format . Applied Sciences, 15(6):3379

  22. [22]

    Andy Field. 2009. Discovering Statistics Using SPSS , 3rd edition. SAGE Publications

  23. [23]

    Mistral 7B

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, and 1 others. 2023. https://arxiv.org/abs/2310.06825 Mistral 7B . Preprint, arXiv:2310.06825

  24. [24]

    Yubin Kim, Hyewon Jeong, Shan Chen, and 1 others. 2025. https://arxiv.org/abs/2503.05777 Medical hallucination in foundation models and their impact on healthcare . Preprint, arXiv:2503.05777

  25. [25]

    Maya Kruse, Shiyue Hu, Nicholas Derby, Yifu Wu, Samantha Stonbraker, Bingsheng Yao, Dakuo Wang, Elizabeth Goldberg, and Yanjun Gao. 2025. Large language models with temporal reasoning for longitudinal clinical summarisation and prediction. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20715--20735

  26. [26]

    Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. 2024. https://arxiv.org/abs/2402.10373 BioMistral : A collection of open-source pretrained large language models for medical domains . Preprint, arXiv:2402.10373

  27. [27]

    Lantz and Richard L

    Paula M. Lantz and Richard L. Lichtenstein. 2020. Prescription drug use among older adults: National Poll on Healthy Aging . National Poll on Healthy Aging

  28. [28]

    2024 , pmid =

    Yikuan Li, Hanyin Wang, Halid Z. Yerebakan, Yoshihisa Shinagawa, and Yuan Luo. 2024. https://doi.org/10.1056/aics2300301 FHIR-GPT enhances health interoperability with large language models . NEJM AI

  29. [29]

    Llama Team, AI @ Meta . 2024. https://arxiv.org/abs/2407.21783 The Llama 3 herd of models . Preprint, arXiv:2407.21783

  30. [30]

    Paul Schmiedmayer, Adrit Rao, Philipp Zagar, Lauren Aalami, Vishnu Ravi, Aydin Zahedivash, Dong-han Yao, Arash Fereydooni, and Oliver Aalami. 2025. https://doi.org/10.1016/j.jacadv.2025.101780 LLMonFHIR : A physician-validated, LLM -based mobile application for querying patient EHR data . JACC: Advances, 4(6)

  31. [31]

    Paul Schmiedmayer, Adrit Rao, Philipp Zagar, Vishnu Ravi, Aydin Zahedivash, Arash Fereydooni, and Oliver Aalami. 2024. https://arxiv.org/abs/2402.01711 LLM on FHIR : Demystifying health records . Preprint, arXiv:2402.01711

  32. [32]

    Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, and Scott McLachlan. 2018. https://doi.org/10.1093/jamia/ocx079 Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health record . Journal of the American ...

  33. [33]

    Frank Wilcoxon. 1945. https://doi.org/10.2307/3001968 Individual comparisons by ranking methods . Biometrics Bulletin, 1(6):80--83

  34. [34]

    Kaiyuan Wu, Aditya Nagori, and Rishikesan Kamaleswaran. 2026. https://arxiv.org/abs/2601.21113 Planner--auditor twin: Agentic discharge planning with FHIR -based LLM planning, guideline recall, optional caching and self-improvement . Preprint, arXiv:2601.21113

  35. [35]

    Musarrat Zeba, Abdullah Al Mamun, Kishoar Jahan Tithee, Debopom Sutradhar, Mohaimenul Azam Khan Raiaan, and Saddam Mukta. 2025. https://arxiv.org/abs/2512.16189 Mitigating hallucinations in healthcare LLMs with granular fact-checking and domain-specific adaptation . Preprint, arXiv:2512.16189