arxiv: 2605.01474 · v1 · submitted 2026-05-02 · 💻 cs.CL

Recognition: unknown

ReMedi: Reasoner for Medical Clinical Prediction

Yushi Cao , Yiming Chen , Hongchao Jiang , Hung-yi Lee , Robby T. Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords electronic health recordsclinical outcome predictionlarge language modelsreasoning enhancementfine-tuningpreference tuningmedical AI

0 comments

The pith

ReMedi improves prediction of clinical outcomes from electronic health records by training language models on rationale-answer pairs regenerated with ground-truth hints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes ReMedi as a way to strengthen how large language models predict future medical events from patient records. It does so by creating training examples that include both reasoning steps and answers, regenerated in a way that uses the correct outcome as a hint for difficult cases. The model is then fine-tuned and aligned using these examples. A sympathetic reader would care because better predictions could support earlier interventions in healthcare. The reported experiments show gains of up to 19.9 percent in F1 score over prior methods on several tasks.

Core claim

ReMedi generates rationale-answer pairs using a challenging sample regeneration mechanism for complex clinical questions, which leverages ground-truth answers as hints to enhance reasoning for further fine-tuning and preference tuning. ReMedi integrates ground-truth outcome guidance into the preference data construction loop, regenerating rationale-answer variants. By tuning on these rationale-answer pairs, the model improves its predictive performance.

What carries the argument

The challenging sample regeneration mechanism that creates rationale-answer pairs by using ground-truth outcomes as hints during regeneration for complex cases.

If this is right

The approach yields substantial performance gains, reaching up to 19.9 percent higher F1 scores on EHR prediction tasks.
It applies across multiple different clinical outcome prediction tasks from electronic health records.
Preference tuning on the generated pairs helps the model better interpret contextual patient information.
Overall effectiveness is shown in real-world clinical prediction scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hint-based regeneration could be tested in non-medical prediction tasks to see if reasoning improves without domain-specific knowledge.
Checking performance on data where hints are not used at all during training would clarify if the gains come from true reasoning or from exposure to answers.
The method might enable more efficient use of limited medical datasets by focusing on reasoning enhancement rather than knowledge addition.

Load-bearing premise

Regenerating rationale-answer pairs with ground-truth hints genuinely improves the model's reasoning capability rather than introducing bias, leakage, or overfitting to the provided hints.

What would settle it

Evaluating the fine-tuned model on a completely new set of electronic health records where no ground-truth answers are available at any stage, to check whether the F1 score improvements remain.

Figures

Figures reproduced from arXiv: 2605.01474 by Hongchao Jiang, Hung-yi Lee, Robby T. Tan, Yiming Chen, Yushi Cao.

**Figure 1.** Figure 1: Overview of ReMedi, which operates iteratively across three stages: (1) Sample Generation, (2) Challenging Sample Re-Generation, and (3) Model Training. The dotted orange line represents the data processing pipeline for DPO, while the solid blue line denotes the pipeline for SFT. provement process optimized via SFT and DPO. (2) ReMedi incorporates challenging sample regeneration to focus learning on diffi… view at source ↗

read the original abstract

Predicting future clinical outcomes from electronic health records (EHR) remains challenging due to the complexity and heterogeneity of patient data. LLMs have shown strong potential for such predictive tasks, yet existing approaches mainly focus on enhancing medical knowledge through distillation or RAG while relying on the model's internal ability to interpret contextual information. In this work, we present ReMedi (Reasoner for Medical Clinical Prediction), a framework for improving clinical outcome prediction from EHR. ReMedi generates rationale-answer pairs using a challenging sample regeneration mechanism for complex clinical questions, which leverages ground-truth answers as hints to enhance reasoning for further fine-tuning and preference tuning. ReMedi integrates ground-truth outcome guidance into the preference data construction loop, regenerating rationale-answer variants. By tuning on these rationale-answer pairs, the model improves its predictive performance. Experiments on multiple EHR prediction tasks demonstrate substantial gains of up to 19.9 percent over state-of-the-art baselines in terms of F1 score, underscoring ReMedi's effectiveness in real-world clinical prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReMedi’s regeneration of rationale-answer pairs using ground-truth hints as input looks like it risks label leakage, which undercuts the claimed reasoning gains.

read the letter

ReMedi generates rationale-answer pairs for fine-tuning and preference tuning by feeding ground-truth clinical outcomes as hints into a challenging sample regeneration step. That is the core of the method and the main thing to know up front. The reported gains reach 19.9% F1 over baselines on several EHR prediction tasks, but the setup needs close scrutiny before anyone treats the improvement as evidence of better reasoning from the records alone.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReMedi, a framework for clinical outcome prediction from EHR data using LLMs. It proposes generating rationale-answer pairs via a challenging sample regeneration mechanism that incorporates ground-truth answers as hints, followed by fine-tuning and preference tuning on these pairs. Experiments on multiple EHR tasks report gains of up to 19.9% F1 over SOTA baselines, attributing improvements to enhanced reasoning.

Significance. If the reported gains can be shown to stem from genuine reasoning improvements rather than label exposure, the work would offer a practical approach to boosting LLM performance on heterogeneous medical prediction tasks. The integration of ground-truth guidance into preference data construction is a distinctive element that, if validated, could inform future methods for handling complex clinical reasoning.

major comments (2)

[Abstract] Abstract and the description of the challenging sample regeneration mechanism: the method explicitly leverages ground-truth answers as hints to regenerate rationale-answer pairs for both fine-tuning and preference tuning. This embeds correct clinical outcomes into the training data, creating a risk of label leakage that is unavailable at inference time. No controls (e.g., hint-free regeneration, label-free validation splits, or ablation removing the hints) are described to isolate whether gains arise from reasoning or from direct exposure to test labels, directly undermining the central claim of up to 19.9% F1 improvement over baselines.
[Experiments] Experiments section: the headline performance claim rests on the assumption that regenerated pairs improve reasoning capability from EHR context alone. Without reporting results from a control condition that regenerates pairs without ground-truth hints, or providing error analysis showing that predictions on held-out data do not benefit from leaked information, the 19.9% F1 delta cannot be confidently attributed to the proposed reasoning enhancement rather than data contamination.

minor comments (2)

[Abstract] The abstract provides no summary of datasets, number of tasks, baseline methods, or statistical significance testing; these details should be added to the abstract or a dedicated experimental summary paragraph for clarity.
[Method] Notation for the regeneration mechanism and preference tuning loop is introduced without a clear algorithmic pseudocode or diagram; adding one would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the importance of distinguishing between reasoning improvements and potential label leakage in our ReMedi framework. We address each major comment below and commit to revisions that include additional controls and clarifications to strengthen the claims.

read point-by-point responses

Referee: [Abstract] Abstract and the description of the challenging sample regeneration mechanism: the method explicitly leverages ground-truth answers as hints to regenerate rationale-answer pairs for both fine-tuning and preference tuning. This embeds correct clinical outcomes into the training data, creating a risk of label leakage that is unavailable at inference time. No controls (e.g., hint-free regeneration, label-free validation splits, or ablation removing the hints) are described to isolate whether gains arise from reasoning or from direct exposure to test labels, directly undermining the central claim of up to 19.9% F1 improvement over baselines.

Authors: The ground-truth answers serve as hints exclusively during the offline regeneration of rationale-answer pairs from the training set. This process aims to produce higher-quality rationales that better explain the clinical outcomes based on the EHR data. The resulting pairs are used for fine-tuning and preference tuning, where the model learns to generate appropriate rationales and predictions from the EHR context and question alone. At inference time, the model operates without any ground-truth hints, relying solely on the input EHR data. We note that this setup follows standard supervised fine-tuning practices for prediction tasks, where labels guide training but are absent at test. However, to rigorously isolate the effect of the hints, we will include an ablation study in the revised manuscript that compares hint-guided regeneration against hint-free regeneration. We will also add error analysis to demonstrate that improvements are not due to leaked information on held-out test data. revision: yes
Referee: [Experiments] Experiments section: the headline performance claim rests on the assumption that regenerated pairs improve reasoning capability from EHR context alone. Without reporting results from a control condition that regenerates pairs without ground-truth hints, or providing error analysis showing that predictions on held-out data do not benefit from leaked information, the 19.9% F1 delta cannot be confidently attributed to the proposed reasoning enhancement rather than data contamination.

Authors: We agree that additional controls would strengthen the attribution of gains to reasoning improvements. In the revised version, we will report results from a control experiment where rationale-answer pairs are regenerated without using ground-truth hints. Furthermore, we will provide a detailed error analysis on the held-out test sets to show that the model's predictions do not rely on any form of data contamination or label leakage. This will help confirm that the observed F1 improvements, up to 19.9%, arise from the enhanced reasoning capabilities fostered by ReMedi. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical method

full rationale

The paper describes an empirical framework that regenerates rationale-answer pairs for LLM fine-tuning and preference tuning on EHR tasks by using ground-truth answers as hints inside a challenging-sample mechanism. No mathematical derivations, equations, or self-referential constructions are present in the provided text. The reported performance gains (up to 19.9% F1) are presented as experimental outcomes on multiple prediction tasks rather than quantities that reduce by construction to the training inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior work appear. Potential concerns about label leakage during data construction affect validity and generalization but do not create a circular derivation chain; the central claim remains an externally measurable empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard LLM fine-tuning and preference tuning practices whose details are not specified here.

pith-pipeline@v0.9.0 · 5479 in / 1083 out tokens · 68452 ms · 2026-05-09T14:16:31.617337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 37 canonical work pages · 11 internal anchors

[1]

The Thirteenth International Conference on Learning Representations , year=

Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval , author=. The Thirteenth International Conference on Learning Representations , year=
[2]

The Twelfth International Conference on Learning Representations , year =

GraphCare: Enhancing Healthcare Predictions with Personalized Knowledge Graphs , author =. The Twelfth International Conference on Learning Representations , year =
[3]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Empirical evaluation of gated recurrent neural networks on sequence modeling , author=. arXiv preprint arXiv:1412.3555 , year=

work page internal anchor Pith review arXiv
[4]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[5]

Advances in neural information processing systems , volume=

Retain: An interpretable predictive model for healthcare using reverse time attention mechanism , author=. Advances in neural information processing systems , volume=
[6]

Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

GRAM: graph-based attention model for healthcare representation learning , author=. Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining , pages=
[7]

IEEE journal of biomedical and health informatics , volume=

Deepr: a convolutional net for medical records , author=. IEEE journal of biomedical and health informatics , volume=. 2016 , publisher=

2016
[8]

Proceedings of the web conference 2020 , pages=

Stagenet: Stage-aware neural networks for health risk prediction , author=. Proceedings of the web conference 2020 , pages=

2020
[9]

Scientific data , volume=

MIMIC-IV, a freely accessible electronic health record dataset , author=. Scientific data , volume=. 2023 , publisher=

2023
[10]

Journal of biomedical informatics , volume=

Language models are an effective representation learning technique for electronic health record data , author=. Journal of biomedical informatics , volume=. 2021 , publisher=

2021
[11]

Scientific Reports , volume=

EHR foundation models improve robustness in the presence of temporal distribution shift , author=. Scientific Reports , volume=. 2023 , publisher=

2023
[12]

arXiv preprint arXiv:2410.13351 , year=

Representation learning of structured data for medical foundation models , author=. arXiv preprint arXiv:2410.13351 , year=

work page arXiv
[13]

NPJ digital medicine , volume=

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction , author=. NPJ digital medicine , volume=. 2021 , publisher=

2021
[14]

Scientific reports , volume=

BEHRT: transformer for electronic health records , author=. Scientific reports , volume=. 2020 , publisher=

2020
[15]

Advances in Neural Information Processing Systems , volume=

Ehrshot: An ehr benchmark for few-shot evaluation of foundation models , author=. Advances in Neural Information Processing Systems , volume=
[16]

Scientific data , volume=

MIMIC-III, a freely accessible critical care database , author=. Scientific data , volume=. 2016 , publisher=

2016
[17]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Claude 3.5 , howpublished =
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
[23]

Large Language Models Cannot Self-Correct Reasoning Yet

Large language models cannot self-correct reasoning yet , author=. arXiv preprint arXiv:2310.01798 , year=

work page internal anchor Pith review arXiv
[24]

Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages=

EMERGE: Enhancing Multimodal Electronic Health Records Predictive Modeling with Retrieval-Augmented Generation , author=. Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages=
[25]

RAM - EHR : Retrieval Augmentation Meets Clinical Predictions on Electronic Health Records

Xu, Ran and Shi, Wenqi and Yu, Yue and Zhuang, Yuchen and Jin, Bowen and Wang, May Dongmei and Ho, Joyce and Yang, Carl. RAM - EHR : Retrieval Augmentation Meets Clinical Predictions on Electronic Health Records. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.acl...

work page doi:10.18653/v1/2024.acl-short.68 2024
[26]

Machine Learning for Health , pages=

CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks , author=. Machine Learning for Health , pages=. 2021 , organization=

2021
[27]

Cehr-gpt: Generating electronic health records with chronological patient timelines.arXiv preprint arXiv:2402.04400, 2024

CEHR-GPT: Generating electronic health records with chronological patient timelines , author=. arXiv preprint arXiv:2402.04400 , year=

work page arXiv
[28]

NPJ digital medicine , volume=

A large language model for electronic health records , author=. NPJ digital medicine , volume=. 2022 , publisher=

2022
[29]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[30]

V-STaR: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457,

V-star: Training verifiers for self-taught reasoners , author=. arXiv preprint arXiv:2402.06457 , year=

work page arXiv
[31]

Orr Zohar and Xiaohan Wang and Yonatan Bitton and Idan Szpektor and Serena Yeung-Levy , booktitle=. Video-. 2025 , url=

2025
[32]

arXiv preprint arXiv:2502.13550 , year=

STaR-SQL: Self-Taught Reasoner for Text-to-SQL , author=. arXiv preprint arXiv:2502.13550 , year=

work page arXiv
[33]

Huatuogpt-o1, towards medical com- plex reasoning with llms.arXiv:2412.18925, 2024

Huatuogpt-o1, towards medical complex reasoning with llms , author=. arXiv preprint arXiv:2412.18925 , year=

work page arXiv
[34]

Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs , author=. arXiv preprint arXiv:2504.00993 , year=

work page arXiv
[35]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. arXiv preprint arXiv:2312.08935 , year=

work page internal anchor Pith review arXiv
[36]

npj Digital Medicine , volume=

Small language models learn enhanced reasoning skills from medical textbooks , author=. npj Digital Medicine , volume=. 2025 , publisher=

2025
[37]

U mls BERT : Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the U nified M edical L anguage S ystem M etathesaurus

Michalopoulos, George and Wang, Yuanxin and Kaka, Hussam and Chen, Helen and Wong, Alexander. U mls BERT : Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the U nified M edical L anguage S ystem M etathesaurus. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan...

work page doi:10.18653/v1/2021.naacl-main.139 2021
[38]

Knowledge-Rich Self-Supervision for Biomedical Entity Linking

Zhang, Sheng and Cheng, Hao and Vashishth, Shikhar and Wong, Cliff and Xiao, Jinfeng and Liu, Xiaodong and Naumann, Tristan and Gao, Jianfeng and Poon, Hoifung. Knowledge-Rich Self-Supervision for Biomedical Entity Linking. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.61

work page doi:10.18653/v1/2022.findings-emnlp.61 2022
[39]

Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining , pages=

Medto: Medical data to ontology matching using hybrid graph neural networks , author=. Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining , pages=
[40]

Advances in neural information processing systems , volume=

Mathematical capabilities of chatgpt , author=. Advances in neural information processing systems , volume=
[41]

Are your llms capable of stable reasoning?CoRR, abs/2412.13147, 2024

Are Your LLMs Capable of Stable Reasoning? , author=. arXiv preprint arXiv:2412.13147 , year=

work page arXiv
[42]

many: Comprehending accurate information from multiple erroneous and inconsistent AI generations , author=

One vs. many: Comprehending accurate information from multiple erroneous and inconsistent AI generations , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

2024
[43]

Proceedings of the AAAI conference on artificial intelligence , volume=

Learning the graphical structure of electronic health records with graph convolutional transformer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[44]

Proceedings of the 28th ACM international conference on information and knowledge management , pages=

EHR coding with multi-scale feature attention and structured knowledge graph propagation , author=. Proceedings of the 28th ACM international conference on information and knowledge management , pages=
[45]

Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=

Medretriever: Target-driven interpretable health risk prediction via retrieving unstructured medical text , author=. Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=
[46]

Information Fusion , volume=

EHR-KnowGen: Knowledge-enhanced multimodal learning for disease diagnosis generation , author=. Information Fusion , volume=. 2024 , publisher=

2024
[47]

arXiv preprint arXiv:2502.12671 , year=

Baichuan-m1: Pushing the medical capability of large language models , author=. arXiv preprint arXiv:2502.12671 , year=

work page arXiv
[48]

arXiv preprint arXiv:2501.12051 , year=

MedS3: Towards Medical Small Language Models with Self-Evolved Slow Thinking , author=. arXiv preprint arXiv:2501.12051 , year=

work page arXiv
[49]

GitHub repository , howpublished =

Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec , title =. GitHub repository , howpublished =. 2020 , publisher =

2020
[50]

Dao, Tri , booktitle=. Flash
[51]

s1: Simple test-time scaling

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

work page internal anchor Pith review arXiv
[52]

Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=
[53]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Large Language Models Can Self-Improve , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[56]

International Conference on Machine Learning , pages=

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024
[57]

Transactions on Machine Learning Research , year=

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models , author=. Transactions on Machine Learning Research , year=
[58]

Proceedings of the 41st International Conference on Machine Learning , pages=

Weak-to-strong generalization: eliciting strong capabilities with weak supervision , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
[59]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[60]

Advances in Neural Information Processing Systems , volume=

Self-playing adversarial language game enhances llm reasoning , author=. Advances in Neural Information Processing Systems , volume=
[61]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[62]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i22.34518 , number=

work page doi:10.1609/aaai.v39i22.34518 2025
[63]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. The Twelfth International Conference on Learning Representations , year=
[64]

Advances in Neural Information Processing Systems , volume=

Language models can solve computer tasks , author=. Advances in Neural Information Processing Systems , volume=
[65]

Selfcheck: Using llms to zero-shot check their own step-by-step reasoning.arXiv preprint arXiv:2308.00436, 2023

Selfcheck: Using llms to zero-shot check their own step-by-step reasoning , author=. arXiv preprint arXiv:2308.00436 , year=

work page arXiv
[66]

2023 , eprint=

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models , author=. 2023 , eprint=

2023
[67]

arXiv preprint arXiv:2304.06767 , year=

Raft: Reward ranked finetuning for generative foundation model alignment , author=. arXiv preprint arXiv:2304.06767 , year=

work page arXiv
[68]

Advances in Neural Information Processing Systems , volume=

Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=
[69]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=. 2024 , publisher=

2024
[70]

arXiv preprint arXiv:2310.10021 , year=

Bootstrap your own skills: Learning to solve new tasks with large language model guidance , author=. arXiv preprint arXiv:2310.10021 , year=

work page arXiv
[71]

MedGemma Technical Report

MedGemma Technical Report , author =. arXiv preprint arXiv:2507.05201 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Proceedings of the 1st international workshop on large language models for code , pages=

Llms for relational reasoning: How far are we? , author=. Proceedings of the 1st international workshop on large language models for code , pages=
[73]

CodeJudgeBench: Benchmarking LLM-as-a-judge for coding tasks, 2025

Codejudgebench: Benchmarking llm-as-a-judge for coding tasks , author=. arXiv preprint arXiv:2507.10535 , year=

work page arXiv
[74]

Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025). 2025

2025
[75]

Chain-of- M eta W riting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts

Buhnila, Ioana and Cislaru, Georgeta and Todirascu, Amalia. Chain-of- M eta W riting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts. 2025

2025
[76]

Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities

Shi, Ken and Penn, Gerald. Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities. 2025

2025
[77]

Reading Between the Lines: A dataset and a study on why some texts are tougher than others

Khallaf, Nouran and Eugeni, Carlo and Sharoff, Serge. Reading Between the Lines: A dataset and a study on why some texts are tougher than others. 2025

2025
[78]

P ara R ev : Building a dataset for Scientific Paragraph Revision annotated with revision instruction

Jourdan, L \'e ane and Boudin, Florian and Dufour, Richard and Hernandez, Nicolas and Aizawa, Akiko. P ara R ev : Building a dataset for Scientific Paragraph Revision annotated with revision instruction. 2025

2025
[79]

Towards an operative definition of creative writing: a preliminary assessment of creativeness in AI and human texts

Maggi, Chiara and Vitaletti, Andrea. Towards an operative definition of creative writing: a preliminary assessment of creativeness in AI and human texts. 2025

2025
[80]

Decoding Semantic Representations in the Brain Under Language Stimuli with Large Language Models

Sato, Anna and Kobayashi, Ichiro. Decoding Semantic Representations in the Brain Under Language Stimuli with Large Language Models. 2025

2025

Showing first 80 references.