arxiv: 2605.14126 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

Marius S. Knorr , Robert M\"uller , Jan P. Bremer , Nils Schweingruber

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningtool-calling agentsFHIRhealthcare datamulti-turn reasoningCodeAct agentLLM judge

0 comments

The pith

Reinforcement learning post-training raises a Qwen3-8B model to 77 percent correctness on FHIR clinical queries, surpassing larger closed models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames answering questions over FHIR health records as a sequential decision process on a directed graph of resources. It trains a multi-turn CodeAct agent with reinforcement learning where an LLM judge supplies rewards grounded in actual execution outcomes. This yields 77 percent answer correctness on FHIR-AgentBench using the smaller Qwen3-8B model, up from 50 percent for the o4-mini baseline, while enforcing data-integrity constraints during traversals. The work supplies an end-to-end pipeline covering environment setup, harness construction, training, and evaluation. A sympathetic reader would care because the result shows RL can make cheaper open models competitive for realistic multi-step queries over structured clinical data.

Core claim

By casting FHIR reasoning as sequential decision-making over a queryable structured graph and post-training a multi-turn CodeAct agent with reinforcement learning using execution-grounded rewards from an LLM judge, the authors achieve 77 percent answer correctness on FHIR-AgentBench with a Qwen3-8B model, compared to 50 percent for o4-mini, while maintaining data-integrity constraints.

What carries the argument

The RL post-trained multi-turn CodeAct agent equipped with a custom harness and an LLM judge that supplies execution-grounded rewards for traversals across FHIR resource graphs.

If this is right

Smaller open-weight models can reach higher accuracy than larger closed models on structured healthcare question-answering tasks.
Reinforcement learning enforces data-integrity constraints during agent reasoning over clinical graphs.
An end-to-end post-training pipeline reliably improves multi-turn reasoning on queryable structured data.
Performance gains arise from reward-driven training rather than prompt engineering alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reinforcement-learning harness could be applied to other graph-structured interoperability standards outside healthcare.
Lower inference costs from using smaller models could make clinical decision-support tools more accessible in resource-limited settings.
Varying the judge model or adding human verification steps would test how sensitive the gains are to reward quality.
Combining this approach with retrieval or planning modules from other agent frameworks might further reduce error rates on complex aggregation queries.

Load-bearing premise

An LLM judge can supply reliable execution-grounded rewards without bias or systematic errors when scoring multi-step FHIR traversals.

What would settle it

Evaluating the trained agent on a fresh held-out portion of FHIR-AgentBench questions with independent ground-truth execution results that show correctness at or below 50 percent.

Figures

Figures reproduced from arXiv: 2605.14126 by Jan P. Bremer, Marius S. Knorr, Nils Schweingruber, Robert M\"uller.

**Figure 1.** Figure 1: Reinforcement learning pipeline for training a clinical FHIR reasoning agent. The agent follows a structured act–observe–think loop, with access to two tools: a retrieval tool for loading clinical resources from a FHIR server, and a Python interpreter for code execution. After one or more reasoning cycles, the agent emits a final solution, which is evaluated by a LLM-judge against ground-truth answers from… view at source ↗

**Figure 2.** Figure 2: Fast Healthcare Interoperability Resources (FHIR). (A) Resource graph of a MIMIC-FHIR patient, connected by references. (B) Clinical concepts can be mapped to FHIR: A patient (1) presents to the emergency department (2; first Encounter) with one or more conditions (3). During the visit, clinical observations (4) were recorded (e.g. heart rate, blood pressure), each linked to the procedure that generated th… view at source ↗

**Figure 3.** Figure 3: Answer correctness on FHIR-AgentBench (Lee et al., 2025) of vanilla Qwen3 models in different sizes (4B to 32B), APIbased models (closed/open-weights), and after RL training (step 3000). Small crosses indicate performance restricted to questions where the agent successfully submitted an answer. 2. Like 1., but with asymmetric clipping (“clip-higher”; ϵlow = 0.2, ϵhigh = 0.28) instead of symmetric clip. 3.… view at source ↗

**Figure 4.** Figure 4: Training curves for Qwen3-8B. Curves show the answer correctness on the FHIR-AgentBench (Lee et al., 2025) validation subset (n=424) with a 6 turn budget. Recipes: vanilla GRPO (blue), +clip-higher (orange), +DAPO-filter (green). Dashed horizontal lines mark each run’s score at step 3000 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 7.** Figure 7: Our best RL run (GRPO + clip-higher + DAPO filter), broken down by FHIR resource type over training steps. The model first learns to return "Empty" on questions with no matching resources, which is the cheapest reward to get, before improving on the other resource types. The dotted line shows the overall mean. More broadly, our findings indicate that execution-grounded RL post-training is a viable path to … view at source ↗

**Figure 8.** Figure 8: Example trajectory. The agent first retrieves all MedicationRequest resources. Then it filters for "iv", and sorts the resources. It identifies the correct MedicationRequest resource, and then resolves the reference to Medication. Medication holds the actual medication that the question is looking for. In this plot, Python tool outputs were collapsed for brevity. 0% 20% 40% 60% 80% 100% Qwen3-8B Qwen3-8B +… view at source ↗

**Figure 9.** Figure 9: Accuracy by FHIR resource type for Qwen3 (zero-shot and trained), and API-baselines. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL post-training lifts a small CodeAct agent to 77% on FHIR-AgentBench but the LLM judge reward lacks any reported calibration.

read the letter

The core result is that RL post-training on a Qwen3-8B CodeAct agent raises answer correctness from 50% (o4-mini baseline) to 77% on FHIR-AgentBench while also trying to enforce traversal constraints. They treat FHIR resource graphs as a sequential decision problem and ship an end-to-end harness with custom tools for multi-turn querying, filtering, and aggregation. That combination of a concrete benchmark lift plus a smaller, cheaper model is the practical takeaway for anyone building agents over clinical data stores. The pipeline description (environment construction, training loop, evaluation) is straightforward and directly usable by groups already working with tool-augmented LLMs. The citation pattern is standard and points back to CodeAct and RL post-training literature without obvious gaps. The main soft spot is the reward signal. An LLM judge supplies the execution-grounded rewards, yet the abstract supplies no human agreement numbers, judge calibration data, or ablation on prompt sensitivity for multi-step FHIR paths. If the judge systematically favors certain resource selections or overlooks aggregation errors, the 27-point gain could be partly artifactual. Training curves, exact baseline implementations, and statistical significance are also absent from the summary, so robustness is hard to judge from what is shown. This paper is aimed at researchers who need reliable multi-turn reasoning over structured healthcare graphs. It has enough of a working pipeline and measurable improvement to merit peer review rather than desk rejection, but only if the full manuscript adds judge validation and the missing method details. I would send it out.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an end-to-end pipeline for post-training a Qwen3-8B model using reinforcement learning for multi-turn tool-calling agents in FHIR data querying. Framing the problem as sequential decision-making over a structured graph, they employ a CodeAct agent and an LLM Judge to provide execution-grounded rewards, reporting an improvement in answer correctness from 50% (o4-mini baseline) to 77% on the FHIR-AgentBench benchmark.

Significance. If the results are robust, this work would be significant for demonstrating that RL can enhance smaller open models to outperform larger proprietary ones on complex, constraint-heavy reasoning tasks in healthcare interoperability. It provides a practical pipeline for building reliable agents over clinical data graphs, potentially reducing costs and improving accessibility while enforcing data integrity.

major comments (2)

[Reward Design] The central empirical claim (50% to 77% correctness) depends on the LLM Judge supplying reliable rewards for multi-step FHIR traversals. However, the manuscript provides no calibration data, human agreement metrics, or ablation studies on the judge's prompting or consistency, making it impossible to rule out systematic biases as the source of the performance lift rather than the RL training itself.
[Experimental Setup] The abstract reports a clear numerical gain but the full methods lack details on training curves, reward shaping specifics, baseline implementations, statistical significance testing, or error analysis, which are necessary to assess the robustness of the 27-point improvement.

minor comments (1)

[Methods] Clarify the exact definition of the CodeAct agent and how it interfaces with the FHIR environment in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions planned for the next version.

read point-by-point responses

Referee: [Reward Design] The central empirical claim (50% to 77% correctness) depends on the LLM Judge supplying reliable rewards for multi-step FHIR traversals. However, the manuscript provides no calibration data, human agreement metrics, or ablation studies on the judge's prompting or consistency, making it impossible to rule out systematic biases as the source of the performance lift rather than the RL training itself.

Authors: We agree that explicit validation of the LLM Judge is required to support the central claim. In the revised manuscript we will add a dedicated subsection reporting calibration results: human-expert agreement on a random sample of 150 multi-turn trajectories (Cohen's kappa and raw agreement), plus ablations on judge prompt phrasing and temperature. These data will be placed in the main text with full details in the appendix. We note that rewards remain execution-grounded (FHIR query success and constraint violations are verified by the environment), which limits the scope for judge-induced bias, but the requested metrics will allow readers to quantify any residual judge variance. revision: yes
Referee: [Experimental Setup] The abstract reports a clear numerical gain but the full methods lack details on training curves, reward shaping specifics, baseline implementations, statistical significance testing, or error analysis, which are necessary to assess the robustness of the 27-point improvement.

Authors: We accept that the current Methods section is insufficiently detailed for reproducibility and robustness assessment. The revision will expand this section to include: (i) training curves for both reward and correctness metrics across epochs, (ii) the exact reward-shaping formula with component weights, (iii) verbatim prompt templates and tool configurations used for the o4-mini baseline, (iv) statistical significance results (bootstrap 95% CI and McNemar test on paired outcomes), and (v) a categorized error analysis of the remaining 23% failures. These additions will be integrated into the main paper and will directly address concerns about the reported 27-point gain. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical result rests on external benchmark

full rationale

The paper frames FHIR reasoning as a sequential decision problem and reports an empirical performance lift (50% to 77% on FHIR-AgentBench) obtained by RL post-training of Qwen3-8B with rewards supplied by an LLM judge. No equations, derivations, or parameter-fitting steps are described that would reduce the reported correctness metric to the inputs by construction. The central claim depends on an external benchmark and the unvalidated assumption that the LLM judge supplies reliable execution-grounded rewards; this is an empirical assumption rather than a self-referential definition or a prediction forced by fitted parameters. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The derivation chain is therefore self-contained as a standard RL experiment on a public benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL assumptions plus the domain-specific premise that an LLM can serve as a trustworthy reward signal for FHIR query correctness.

axioms (1)

domain assumption LLM Judge supplies execution-grounded rewards that correlate with true answer correctness
Invoked to generate training signal for the RL post-training step

pith-pipeline@v0.9.0 · 5541 in / 1147 out tokens · 25587 ms · 2026-05-15T04:53:35.494733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 8 internal anchors

[1]

Journal of the American Medical Informatics Association , author =

Electronic health record adoption in. Journal of the American Medical Informatics Association , author =. 2017 , pages =. doi:10.1093/jamia/ocx080 , abstract =

work page doi:10.1093/jamia/ocx080 2017
[2]

Gut and Liver , author =

Challenges in and. Gut and Liver , author =. 2024 , pmid =. doi:10.5009/gnl230272 , abstract =

work page doi:10.5009/gnl230272 2024
[3]

AMA journal of ethics , author =

Language,. AMA journal of ethics , author =. 2017 , pmid =. doi:10.1001/journalofethics.2017.19.3.stas1-1703 , abstract =

work page doi:10.1001/journalofethics.2017.19.3.stas1-1703 2017
[4]

WIREs Computational Statistics , author =

Challenges and opportunities beyond structured data in analysis of electronic health records , volume =. WIREs Computational Statistics , author =. 2021 , note =. doi:10.1002/wics.1549 , abstract =

work page doi:10.1002/wics.1549 2021
[5]

iScience , author =

The application of large language models in medicine:. iScience , author =. 2024 , pmid =. doi:10.1016/j.isci.2024.109713 , abstract =

work page doi:10.1016/j.isci.2024.109713 2024
[6]

2024 , pmid =

NEJM AI , author =. 2024 , pmid =. doi:10.1056/aics2300301 , abstract =

work page doi:10.1056/aics2300301 2024
[7]

Conference on Empirical Methods in Natural Language Processing , author =

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing , author =. 2024 , pmid =. doi:10.18653/v1/2024.emnlp-main.1245 , abstract =

work page doi:10.18653/v1/2024.emnlp-main.1245 2024
[8]

DeepSeek-AI and Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bocha...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
[9]

BMC Medical Research Methodology , author =

External validation of a. BMC Medical Research Methodology , author =. 2013 , keywords =. doi:10.1186/1471-2288-13-33 , abstract =

work page doi:10.1186/1471-2288-13-33 2013
[10]

and Lemeshow, Stanley and May, Susanne , month = feb, year =

Hosmer, David W. and Lemeshow, Stanley and May, Susanne , month = feb, year =. Applied

work page
[11]

BMC Medical Research Methodology , author =

Survival prediction models: an introduction to discrete-time modeling , volume =. BMC Medical Research Methodology , author =. 2022 , keywords =. doi:10.1186/s12874-022-01679-6 , abstract =

work page doi:10.1186/s12874-022-01679-6 2022
[12]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2005
[13]

Semantic

Müller, Robert , month = jul, year =. Semantic. doi:10.48550/arXiv.2507.10820 , abstract =

work page doi:10.48550/arxiv.2507.10820
[14]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Feng, Jiazhan and Huang, Shijue and Qu, Xingwei and Zhang, Ge and Qin, Yujia and Zhong, Baoquan and Jiang, Chengquan and Chi, Jinxin and Zhong, Wanjun , month = apr, year =. doi:10.48550/arXiv.2504.11536 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.11536
[15]

Singh, Joykirat and Magazine, Raghav and Pandya, Yash and Nambi, Akshay , month = apr, year =. Agentic. doi:10.48550/arXiv.2505.01441 , abstract =

work page doi:10.48550/arxiv.2505.01441
[16]

Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning

Zhang, Shaokun and Dong, Yi and Zhang, Jieyu and Kautz, Jan and Catanzaro, Bryan and Tao, Andrew and Wu, Qingyun and Yu, Zhiding and Liu, Guilin , month = may, year =. Nemotron-. doi:10.48550/arXiv.2505.00024 , abstract =

work page doi:10.48550/arxiv.2505.00024
[17]

ToolRL: Reward is All Tool Learning Needs

Qian, Cheng and Acikgoz, Emre Can and He, Qi and Wang, Hongru and Chen, Xiusi and Hakkani-Tür, Dilek and Tur, Gokhan and Ji, Heng , month = apr, year =. doi:10.48550/arXiv.2504.13958 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.13958
[18]

2017 , pmid =

Pharmacy and Therapeutics , author =. 2017 , pmid =

work page 2017
[19]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Zichen and Chen, Changyu and Li, Wenjun and Qi, Penghui and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min , month = mar, year =. Understanding. doi:10.48550/arXiv.2503.20783 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20783
[20]

2018 , keywords =

BMC Medical Research Methodology , author =. 2018 , keywords =. doi:10.1186/s12874-018-0482-1 , abstract =

work page doi:10.1186/s12874-018-0482-1 2018
[21]

JAMA , author =

Evaluating the yield of medical tests , volume =. JAMA , author =. 1982 , pmid =

work page 1982
[22]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Liu, Xin and Lin, Haibin and Lin, Zhiqi and Ma, Bole and Sheng, Guangming and Tong, Yuxuan and Zhang, Chi and Zhang, Mofan and Zhang, Wang and Zhu, Hang and Zhu, Jinhua and Chen, Jiaze and Chen,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476
[23]

GitHub repository , howpublished =

Brad Hilton and Kyle Corbitt and David Corbitt and Saumya Gandhi and Angky William and Bohdan Kovalenskyi and Andie Jones , title =. GitHub repository , howpublished =. 2025 , publisher =

work page 2025
[24]

2025 , url =

RULER: Relative Universal LLM-Elicited Rewards , author =. 2025 , url =

work page 2025
[25]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[26]

2025 , eprint=

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs , author=. 2025 , eprint=

work page 2025
[27]

2025 , eprint=

Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning , author=. 2025 , eprint=

work page 2025
[28]

2025 , eprint=

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[29]

2025 , eprint=

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay , author=. 2025 , eprint=

work page 2025
[30]

2025 , eprint=

ToolRL: Reward is All Tool Learning Needs , author=. 2025 , eprint=

work page 2025
[31]

2025 , eprint=

Semantic Context for Tool Orchestration , author=. 2025 , eprint=

work page 2025
[32]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024
[33]

2025 , eprint=

MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data , author=. 2025 , eprint=

work page 2025
[34]

2025 , note=

SkyRL-SQL: Matching GPT-4o and o4-mini on Text2SQL with Multi-Turn RL , author=. 2025 , note=

work page 2025
[35]

2025 , month = jun, howpublished =

Evolving SkyRL into a Highly-Modular RL Framework , author =. 2025 , month = jun, howpublished =

work page 2025
[36]

2024 , eprint=

Executable Code Actions Elicit Better LLM Agents , author=. 2024 , eprint=

work page 2024
[37]

International Conference on Learning Representations (ICLR) , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

work page
[38]

PAL: Program-aided Language Models

PAL: Program-aided Language Models , author=. arXiv preprint arXiv:2211.10435 , year=

work page internal anchor Pith review arXiv
[39]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. arXiv preprint arXiv:2211.12588 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

2025 , eprint=

The Art of Scaling Reinforcement Learning Compute for LLMs , author=. 2025 , eprint=

work page 2025
[41]

TRL GRPO Trainer Documentation , author=

work page
[42]

2025 , eprint=

VL Norm: Rethink Loss Aggregation in RLVR , author=. 2025 , eprint=

work page 2025
[43]

arXiv preprint arXiv:2506.10446 , year=

Efficient Reasoning via Powered Length Penalty , author=. arXiv preprint arXiv:2506.10446 , year=

work page arXiv
[44]

2025 , eprint=

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[45]

2019 , month = oct, day =

work page 2019
[46]

The Heat is On: US Caught FHIR in 2019 , howpublished =

work page 2019
[47]

2025 , eprint=

FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering , author=. 2025 , eprint=

work page 2025
[48]

2025 , eprint=

Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case , author=. 2025 , eprint=

work page 2025
[49]

Journal of Medical Internet Research , author =

Using a. Journal of Medical Internet Research , author =. 2025 , pmid =. doi:10.2196/73540 , abstract =

work page doi:10.2196/73540 2025
[50]

2024 , eprint=

LLM on FHIR -- Demystifying Health Records , author=. 2024 , eprint=

work page 2024
[51]

2025 , eprint=

Question Answering on Patient Medical Records with Private Fine-Tuned LLMs , author=. 2025 , eprint=

work page 2025
[52]

2025 , eprint=

MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents , author=. 2025 , eprint=

work page 2025
[53]

2025 , note =

Johann Schopplich , title =. 2025 , note =

work page 2025
[54]

2025 , version =

Blaze: A FHIR. 2025 , version =

work page 2025
[55]

and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E

Patil, Shishir G. and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E. , booktitle =. The. 2025 , url =

work page 2025
[56]

2024 , eprint =

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , author =. 2024 , eprint =

work page 2024
[57]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Butterfly Effects in Toolchains: Categorizing and Analyzing Failures in Parameter Filling for LLMs , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

work page 2025
[58]

2026 , eprint =

Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models , author =. 2026 , eprint =

work page 2026
[59]

2025 , eprint =

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning , author =. 2025 , eprint =

work page 2025
[60]

2026 , eprint =

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors , author =. 2026 , eprint =

work page 2026
[61]

2025 , eprint =

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration , author =. 2025 , eprint =

work page 2025
[62]

2026 , eprint =

ToolGate: Contract-Grounded and Verified Tool Execution for LLMs , author =. 2026 , eprint =

work page 2026
[63]

2026 , eprint =

From Failure to Mastery: Generating Hard Samples for Tool-use Agents , author =. 2026 , eprint =

work page 2026
[64]

2026 , eprint =

Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents , author =. 2026 , eprint =

work page 2026
[65]

2026 , eprint =

Emerging from Ground: Addressing Intent Deviation in Tool-Using Agents via Deriving Real Calls into Virtual Trajectories , author =. 2026 , eprint =

work page 2026
[66]

AMIA Annual Symposium Proceedings , author =

Reducing. AMIA Annual Symposium Proceedings , author =. 2023 , pmid =

work page 2023