arxiv: 2605.11533 · v2 · submitted 2026-05-12 · 💻 cs.CL · cs.CV

Recognition: unknown

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

Amir Atapour-Abarghouei, Jialin Yu, Kevin Qinghong Lin, Philip Torr, Shuang Chen, Sike Xiang, Yijia Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:12 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords multimodal datasetaction card generationclinical check-up reportspatient-oriented actionsLLM benchmarkstructured generationmedical AI evaluation

0 comments

The pith

A dataset of 2000 multimodal check-up reports benchmarks structured action card generation for patients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes Checkup2Action as a new benchmark dataset containing 2000 real-world multimodal check-up reports. It defines structured Action Cards that break down each report into individual issues with priority, department, timing, explanations, and questions. Testing various large language models on this task reveals consistent trade-offs between thorough issue coverage, action accuracy, brevity, and safety. This setup allows systematic measurement of how well AI can help patients understand and act on their check-up results without overstepping into diagnosis.

Core claim

Checkup2Action introduces a dataset of 2000 de-identified multimodal clinical check-up reports paired with structured Action Cards. Each card addresses one issue by specifying its priority level, recommended medical department, follow-up time window, a patient-accessible explanation, and relevant questions for clinicians, without including diagnoses or treatments. The benchmark evaluates models on metrics including issue coverage, priority consistency, recommendation accuracy, complexity, usefulness, readability, and safety. Experiments highlight trade-offs in performance across general and medical LLMs.

What carries the argument

The Action Card format, a structured per-issue output listing priority, department, time window, patient explanation, and clinician questions derived from multimodal report elements.

If this is right

Models can be compared directly on safety compliance and issue coverage using a shared protocol.
The dataset supports development of constrained generation methods suited to clinical evidence.
Evaluation metrics provide standardized ways to assess patient-oriented medical summarization.
Reveals specific limitations in current LLMs when handling heterogeneous inputs like tables, numbers, and images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could support apps that convert reports into personalized follow-up reminders for patients.
Integration with electronic health records might increase actual follow-up completion rates.
The per-issue card structure may extend to other document types such as discharge summaries or test result packets.

Load-bearing premise

The manually created action cards used as ground truth are clinically accurate and safe.

What would settle it

A study in which multiple independent clinicians generate action cards for the same reports and measure agreement with the dataset annotations or model outputs.

Figures

Figures reproduced from arXiv: 2605.11533 by Amir Atapour-Abarghouei, Jialin Yu, Kevin Qinghong Lin, Philip Torr, Shuang Chen, Sike Xiang, Yijia Sun.

**Figure 2.** Figure 2: Overview of the C2A dataset. The dataset is built from real-world multimodal clin [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Checkup2Action pipeline. A multimodal check-up report is first [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Subjective evaluation on the Checkup2Action benchmark across problem rele [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 4.** Figure 4: Subjective evaluation on the Checkup2Action benchmark across problem rele [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between an unconstrained and a safety-constrained Checkup2Action [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 5.** Figure 5: Comparison between an unconstrained and a safety-constrained Checkup2Action [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Structured metric comparison between safety-constrained and unconstrained. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Structured metric comparison between GPT-5.1, Gemini-3-pro-preview and MedGemma-27B. 6 Conclusion This paper introduces C2A (Checkup2Action), a real-world multimodal clinical checkup report dataset and benchmark for patient-facing Action Card generation. By framing check-up interpretation as a structured report-to-action task, C2A evaluates whether systems can organise heterogeneous evidence from multi-pa… view at source ↗

read the original abstract

Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbf{Checkup2Action}, a multimodal clinical check-up report dataset and benchmark for structured \textit{Action Card} generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, and imaging-related evidence. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Checkup2Action releases a 2,000-report multimodal dataset for patient action cards with LLM benchmarks, but thin details on how the ground-truth cards were made limit how much we can trust the evaluations.

read the letter

The core contribution is a new dataset of 2,000 real-world, de-identified check-up reports paired with structured action cards. Each card covers one issue, its priority, the right department, a time window, a plain-language explanation, and questions for the doctor, while staying clear of any diagnostic claims. The reports mix text, tables, numbers, and imaging notes, which matches the messy input doctors and patients actually see. That setup is new enough to stand apart from existing medical summarization or triage collections. The experiments run several general and medical LLMs through the task and lay out the usual trade-offs: some models cover more issues but drop safety or conciseness, others stay safer but miss things. Those comparisons are straightforward and give a reader a sense of where current models sit on this specific output format. The evaluation protocol tries to measure coverage, priority match, department accuracy, readability, and safety compliance, which is a reasonable starting list for this kind of work. The main weakness is the ground truth itself. The cards are described as manually created, yet the paper gives almost no information on who wrote them, how many reviewers were involved, what their clinical background was, or whether any agreement checks were run. Without those numbers the metrics become harder to interpret; any consistent bias in the reference cards would affect every model comparison. De-identification steps are also light on detail. This paper is aimed at groups building patient-facing medical tools or running multimodal clinical benchmarks. Anyone testing LLMs for safe, actionable output from check-up data would find the dataset and the reported trade-offs worth looking at. It deserves peer review because the dataset release is concrete and the task framing is useful, but the reviewers will need to press for clearer annotation protocols and any available agreement statistics before the benchmark can be treated as solid.

Referee Report

1 major / 2 minor

Summary. The paper introduces Checkup2Action, a multimodal dataset of 2,000 de-identified real-world clinical check-up reports paired with manually created structured Action Cards. Each card specifies one clinically relevant issue along with its priority, recommended department, follow-up time window, patient-facing explanation, and clinician questions. The work formulates checkup-to-action generation as a constrained structured generation task, defines an evaluation protocol covering issue coverage, priority consistency, department/time accuracy, action complexity, usefulness, readability, and safety compliance, and benchmarks general-purpose and medical LLMs to reveal trade-offs among coverage, correctness, conciseness, and safety alignment.

Significance. If the reference Action Cards prove clinically reliable, the dataset and benchmark would fill a clear gap in patient-oriented clinical NLP by providing the first large-scale multimodal resource for translating heterogeneous check-up reports (layouts, tables, biomarkers, imaging) into safe, prioritized, non-diagnostic actions. The explicit release of the dataset together with a reproducible evaluation protocol is a concrete strength that enables community follow-up work.

major comments (1)

[§3] §3 (Dataset Construction): The 2,000 reference Action Cards are described as 'manually created' yet the manuscript supplies no annotation protocol details—number or qualifications of clinicians, blinding, adjudication process, or inter-annotator agreement statistics. Because every reported metric (issue coverage, priority consistency, safety compliance) is computed against these cards, the absence of agreement or validation data makes it impossible to assess whether systematic annotator bias or error affects the LLM comparisons.

minor comments (2)

[Abstract] The abstract and §4 could more explicitly quantify the key experimental findings (e.g., which model achieved the highest safety compliance score) rather than only stating that 'clear trade-offs' exist.
[§5] Figure captions and axis labels in the result plots should be enlarged for readability; several current labels are difficult to read at standard print size.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for identifying an important gap in the transparency of our dataset construction. We address the major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The 2,000 reference Action Cards are described as 'manually created' yet the manuscript supplies no annotation protocol details—number or qualifications of clinicians, blinding, adjudication process, or inter-annotator agreement statistics. Because every reported metric (issue coverage, priority consistency, safety compliance) is computed against these cards, the absence of agreement or validation data makes it impossible to assess whether systematic annotator bias or error affects the LLM comparisons.

Authors: We agree that the current description of the annotation process is insufficient for readers to evaluate the reliability of the reference Action Cards. In the revised manuscript we will expand §3 with a dedicated subsection on annotation protocol. This will specify: the number and clinical qualifications of the annotators (board-certified physicians across relevant specialties), the use of blinding, the step-by-step annotation guidelines, the adjudication procedure for resolving disagreements, and inter-annotator agreement statistics (Cohen’s kappa for priority, department, and time-window fields). These details reflect the actual process used to create the 2,000 cards and will enable direct assessment of potential bias or noise in the reference set. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release and empirical benchmark with no derivations or self-referential predictions

full rationale

The paper introduces a new multimodal dataset of 2,000 check-up reports paired with manually created Action Cards and evaluates LLMs on a structured generation task using custom metrics. No equations, parameter fitting, or predictive derivations appear in the provided text. Claims rest on the dataset construction and experimental results rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The work is self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the contribution is a curated dataset and evaluation framework rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5578 in / 1015 out tokens · 39911 ms · 2026-05-14T21:12:54.601060+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 5 internal anchors

[1]

G. C. Araujo, C. B. Ribeiro, M. C. M. Costa, M. L. P. Evangelista, M. F. Lima, M. C. De Paula, V . L. Ferreira, and F. A. G. D. R. Araujo. Evidence-based periodic health examinations for adults: A practical guide.Cureus, 17(3):e79963, 2025. doi: 10.7759/ cureus.79963. URLhttps://doi.org/10.7759/cureus.79963

work page doi:10.7759/cureus.79963 2025
[2]

Scientific evidence for clinical text summarization using large language models: Scoping review.Journal of Medical Internet Research, 27:e68998, 2025

Lydie Bednarczyk, Daniel Reichenpfader, Christophe Gaudet-Blavignac, Amon Kenna Ette, Jamil Zaghir, Yuanyuan Zheng, Adel Bensahla, Mina Bjelogrlic, and Christian Lovis. Scientific evidence for clinical text summarization using large language models: Scoping review.Journal of Medical Internet Research, 27:e68998, 2025. ISSN 1438-

work page 2025
[3]

URLhttps://doi.org/10.2196/68998

doi: 10.2196/68998. URLhttps://doi.org/10.2196/68998

work page doi:10.2196/68998
[4]

Lan- glotz, Michael Krauthammer, and Farhad Nooralahzadeh

Christian Bluethgen, Dave Van Veen, Daniel Truhn, Jakob Nikolas Kather, Michael Moor, Malgorzata Polacin, Akshay Chaudhari, Thomas Frauenfelder, Curtis P. Lan- glotz, Michael Krauthammer, and Farhad Nooralahzadeh. Agentic systems in ra- diology: Design, applications, evaluation, and challenges, 2025. URLhttps: //arxiv.org/abs/2510.09404

work page arXiv 2025
[5]

Mind2Web: Towards a generalist agent for the web,

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/abs/2306.06070

work page arXiv 2023
[6]

Esr paper on structured report- ing in radiology-update 2023.Insights into Imaging, 14(1):199, 2023

European Society of Radiology (ESR). Esr paper on structured report- ing in radiology-update 2023.Insights into Imaging, 14(1):199, 2023. doi: 10.1186/s13244-023-01560-0. URLhttps://doi.org/10.1186/ s13244-023-01560-0

work page doi:10.1186/s13244-023-01560-0 2023
[7]

Screening for hypertension in adults: Us preventive services task force reaffirmation recommendation statement.JAMA, 325(16):1650– 1656, 04 2021

US Preventive Services Task Force. Screening for hypertension in adults: Us preventive services task force reaffirmation recommendation statement.JAMA, 325(16):1650– 1656, 04 2021. ISSN 0098-7484. doi: 10.1001/jama.2021.4987. URLhttps:// doi.org/10.1001/jama.2021.4987. 16STUDENT, PROF, COLLABORA TOR: BMVC AUTHOR GUIDELINES

work page doi:10.1001/jama.2021.4987 2021
[8]

Collaco, Nadia G

Cesar Abraham Gomez-Cabello, Srinivasagam Prabha, Syed Ali Haider, Ariana Gen- ovese, Bernardo G. Collaco, Nadia G. Wood, Sanjay Bagaria, and Antonio Jorge Forte. Comparative evaluation of advanced chunking for retrieval-augmented gen- eration in large language models for clinical decision support.Bioengineering, 12 (11), 2025. ISSN 2306-5354. doi: 10.339...

work page doi:10.3390/bioengineering12111194 2025
[9]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020. URLhttps://arxiv.org/abs/ 2009.13081

work page arXiv 2020
[10]

Pub- MedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pub- MedQA: A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro...

work page doi:10.18653/v1/d19-1259 2019
[11]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019

work page 2019
[12]

Kadhim, Zachary Green, Iman Nazari, Jonathan Baker, Michael George, Ash- ley Heinson, Bhumita Vadgama, Matt Stammers, Christopher M

Alex Z. Kadhim, Zachary Green, Iman Nazari, Jonathan Baker, Michael George, Ash- ley Heinson, Bhumita Vadgama, Matt Stammers, Christopher M. Kipps, R. Mark Beat- tie, James J. Ashton, and Sarah Ennis. Application of generative artificial intelligence to utilize unstructured clinical data for acceleration of inflammatory bowel disease re- search.Med, 7(1):...

work page arXiv 2026
[13]

When positive is negative: Health literacy barriers to patient access to clinical laboratory test results.The Journal of Applied Laboratory Medicine, 8(6): 1133–1147, 09 2023

Gerardo Lazaro. When positive is negative: Health literacy barriers to patient access to clinical laboratory test results.The Journal of Applied Laboratory Medicine, 8(6): 1133–1147, 09 2023. ISSN 2576-9456. doi: 10.1093/jalm/jfad045. URLhttps: //doi.org/10.1093/jalm/jfad045

work page doi:10.1093/jalm/jfad045 2023
[14]

TriageAgent: Towards bet- ter multi-agents collaborations for large language model-based clinical triage

Meng Lu, Brandon Ho, Dennis Ren, and Xuan Wang. TriageAgent: Towards bet- ter multi-agents collaborations for large language model-based clinical triage. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the As- sociation for Computational Linguistics: EMNLP 2024, pages 5747–5764, Miami, Florida, USA, November 2024. Association for...

work page doi:10.18653/v1/2024.findings-emnlp.329 2024
[15]

Lustria, Obianuju Aliche, Michael O

Mia Liza A. Lustria, Obianuju Aliche, Michael O. Killian, and Zhe He. Enhancing patient engagement and understanding: Is providing direct access to laboratory results through patient portals adequate?JAMIA Open, 8(2):ooaf009, 2025. doi: 10.1093/ jamiaopen/ooaf009. STUDENT, PROF, COLLABORA TOR: BMVC AUTHOR GUIDELINES17

work page 2025
[16]

Meuth, Lennert Böhm, and Marc Pawlitzki

Lars Masanneck, Linea Schmidt, Antonia Seifert, Tristan Kölsche, Niklas Huntemann, Robin Jansen, Mohammed Mehsin, Michael Bernhard, Sven G. Meuth, Lennert Böhm, and Marc Pawlitzki. Triage performance across large language models, chatgpt, and untrained doctors in emergency medicine: Comparative study.Journal of Medical Internet Research, 26:e53297, 2024. ...

work page doi:10.2196/53297 2024
[17]

Petrovskaya, A

O. Petrovskaya, A. Karpman, J. Schilling, S. Singh, L. Wegren, V . Caine, E. Kusi- Appiah, and W. Geen. Patient and health care provider perspectives on patient access to test results via web portals: Scoping review.Journal of Medical Internet Research, 25: e43765, 2023. doi: 10.2196/43765. URLhttps://doi.org/10.2196/43765

work page doi:10.2196/43765 2023
[18]

Toolformer: Language models can teach themselves to use tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URLhttps://arxiv.org/abs/2302. 04761

work page 2023
[19]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Tra- verse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Steitz, Robert W

Bryan D. Steitz, Robert W. Turer, Chen-Tan Lin, Scott MacDonald, Liz Salmi, Adam Wright, Christoph U. Lehmann, Karen Langford, Samuel A. McDonald, Thomas J. Reese, Paul Sternberg, Qingxia Chen, S. Trent Rosenbloom, and Catherine M. DesRoches. Perspectives of patients about immediate access to test results through an online patient portal.JAMA Network Open...

work page doi:10.1001/jamanetworkopen.2023.3572 2023
[21]

Steitz, Robert W

Bryan D. Steitz, Robert W. Turer, Liz Salmi, Uday Suresh, Scott MacDonald, Cather- ine M. DesRoches, Adam Wright, Jeremy Louissaint, and S. Trent Rosenbloom. Re- peated access to patient portal while awaiting test results and patient-initiated mes- saging.JAMA Network Open, 8(4):e254019–e254019, 04 2025. ISSN 2574-3805. doi: 10.1001/jamanetworkopen.2025.4...

work page doi:10.1001/jamanetworkopen.2025.4019 2025
[22]

N. W. Sterling, F. Brann, S. O. Frisch, and J. D. Schrager. Patient-readable radiology report summaries generated via large language model: Safety and quality.Journal of Patient Experience, 11, 2024. doi: 10.1177/23743735241259477

work page doi:10.1177/23743735241259477 2024
[23]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022. URLhttps://arxiv.org/abs/2009.01325

work page arXiv 2022
[24]

Cacciamani, Cong Sun, Yifan Peng, and Yan- shan Wang

Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, and Yan- shan Wang. A framework for human evaluation of large language models in healthcare derived from literature review, 2...

work page 2024
[25]

N. E. Timbrell. The role and limitations of the reference interval within clinical chem- istry and its reliability for disease detection.British Journal of Biomedical Science, 81:12339, 2024. doi: 10.3389/bjbs.2024.12339. URLhttps://doi.org/10. 3389/bjbs.2024.12339

work page doi:10.3389/bjbs.2024.12339 2024
[26]

F. A. M. van der Mee, F. Schaper, J. Jansen, J. A. P. Bons, S. J. R. Meex, and J. W. L. Cals. Enhancing patient understanding of laboratory test results: Systematic review of presentation formats and their impact on perception, decision, action, and memory. Journal of Medical Internet Research, 26:e53993, 2024. doi: 10.2196/53993. URL https://doi.org/10.2...

work page doi:10.2196/53993 2024
[27]

Large language models in medical and healthcare fields: applications, advances, and challenges.Artificial Intelligence Review, 57(11): 299, 2024

Dandan Wang and Shiqing Zhang. Large language models in medical and healthcare fields: applications, advances, and challenges.Artificial Intelligence Review, 57(11): 299, 2024. ISSN 1573-7462. doi: 10.1007/s10462-024-10921-0. URLhttps: //doi.org/10.1007/s10462-024-10921-0

work page doi:10.1007/s10462-024-10921-0 2024
[28]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Data augmentation for radi- ology report simplification

Ziyu Yang, Santhosh Cherian, and Slobodan Vucetic. Data augmentation for radi- ology report simplification. In Andreas Vlachos and Isabelle Augenstein, editors, Findings of the Association for Computational Linguistics: EACL 2023, pages 1922– 1932, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-eacl...

work page doi:10.18653/v1/2023.findings-eacl.144 2023
[31]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.03629. STUDENT, PROF, COLLABORA TOR: BMVC AUTHOR GUIDELINES19

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Blecker, and Jonah Feld- man

Jonah Zaretsky, Jeong Min Kim, Samuel Baskharoun, Yunan Zhao, Jonathan Aus- trian, Yindalon Aphinyanaphongs, Ravi Gupta, Saul B. Blecker, and Jonah Feld- man. Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format.JAMA Network Open, 7(3):e240357–e240357, 03 2024. ISSN 2574-3805. doi: 10.1001/j...

work page doi:10.1001/jamanetworkopen.2024.0357 2024
[33]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xi- anyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024