Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models

Anastasiia Molodnitskaia; Victor Lavrenko

arxiv: 2606.21121 · v1 · pith:5QNYDESOnew · submitted 2026-06-19 · 💻 cs.AI · cs.CL

Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models

Victor Lavrenko , Anastasiia Molodnitskaia This is my paper

Pith reviewed 2026-06-26 14:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords answer engineeringlocal trajectory editingprotocol compliancelarge language modelsclinical decision makingSSNHL benchmarkreasoning trajectoryruntime intervention

0 comments

The pith

Local rule-guided edits to an LLM's visible reasoning trajectory raise balanced accuracy on a clinical protocol benchmark from 42% to 80.7%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Answer Engineering as a runtime layer that intervenes locally on an LLM's step-by-step reasoning path to enforce medical protocols. Standard chain-of-thought generation actually lowered compliance on the SSNHL task, but targeted edits during autoregressive generation restored high adherence without retraining or global search. A sympathetic reader would care because many high-stakes domains require outputs that follow explicit rules the base model cannot reliably internalize from training data alone. The work shows that auditable, deterministic control at the trajectory level can close the gap between confident generation and protocol-valid decisions.

Core claim

Answer Engineering applies localized rule-guided interventions to the visible reasoning trajectory during standard autoregressive generation. On a controlled clinical benchmark for sudden sensorineural hearing loss, step-by-step reasoning shifted rather than eliminated errors, dropping SSNHL compliance from 54.5% to 25.1% while raising acceptance on the conductive contrast condition from 1.6% to 58.9%. The editing layer raised SSNHL compliance to 83.5% and conductive-case adherence to 77.9%, lifting balanced accuracy from 42.0% under reasoning-only generation to 80.7%.

What carries the argument

Answer Engineering, a deterministic runtime and authoring layer that applies localized rule-guided interventions to the visible reasoning trajectory during autoregressive generation.

If this is right

Protocol adherence can be improved through auditable runtime control of reasoning trajectories rather than model retraining.
Step-by-step reasoning can shift errors rather than eliminate them in protocol-constrained domains.
Limitations in the approach stem from rule coverage, trigger reliability, and persistent diagnosis-first generation dynamics.
The method leaves diagnosis-first biases intact while correcting downstream management steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-editing pattern could be applied to other rule-heavy domains such as legal document drafting or financial compliance checks.
Human-authored rule sets may prove more maintainable than fine-tuning when protocols change frequently.
Persistent model tendencies like early diagnosis suggest that trajectory control may need to operate at multiple points along the generation path.

Load-bearing premise

Rule-guided local interventions can be authored and triggered reliably enough to cover the relevant protocol constraints without introducing new inconsistencies or missing cases.

What would settle it

Running the same editing rules on a fresh set of clinical protocols whose constraints overlap or require more context than the current authoring interface supports, then measuring whether net compliance falls below the no-editing baseline.

Figures

Figures reproduced from arXiv: 2606.21121 by Anastasiia Molodnitskaia, Victor Lavrenko.

**Figure 1.** Figure 1: Retroactive span editing. A triggered local span is replaced with the highest-scoring protocol-valid candidate, then decoding resumes from the rebuilt prefix. recently generated trajectory x1 x2 x3 x4 x5 x6 x7 x8 · · · guard scope rollback to edit scope start trigger trigger generated candidate continuations g (1) 1 g (1) 2 g (1) 3 score s1, invalid g (2) 1 g (2) 2 g (2) 3 g (2) 4 score s2, valid g (∗) 1 g… view at source ↗

**Figure 2.** Figure 2: Local rollback and continuation probing. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Forced future insertion. When a rule requires a mandatory future statement, the controller appends the highest-scoring valid candidate and continues decoding from the enforced prefix. 5 Runtime Framework for Answer Engineering Answer Engineering is implemented as a decodingtime runtime controller that operates alongside standard autoregressive language model inference. The method does not modify model pa… view at source ↗

**Figure 4.** Figure 4: Conceptual runtime control loop. Autoregressive decoding produces tokens while the runtime monitors the trajectory for rule triggers. When a rule fires, candidate trajectory edits are generated (possibly via beam-style probing), evaluated under the model likelihood, and the selected intervention is applied before decoding continues from the modified prefix [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Two representative trajectories. Left: trajectory editing revises an SSNHL case by enforcing protocol-consistent interpretation of tuning fork findings, leading to protocol-consistent management. Right: in conductive hearing loss cases, trajectory editing can also improve reasoning fidelity to the stem, preserving the conductive diagnosis while reducing protocol-inconsistent detours. SSNHL Conductive Balan… view at source ↗

**Figure 6.** Figure 6: Comparison between explicit decision-tree logic and local trajectory constraint. Expert systems encode the diagnostic path directly, whereas Answer Engineering only removes invalid local continuations and leaves diagnosis generation to the language model. Failure pattern Typical incorrect continuation Intervention Contralateral Weber misinterpretation Weber lateralizes to the opposite ear, followed by a c… view at source ↗

read the original abstract

Large language models can produce confident but protocol-invalid answers in domains where procedural compliance is critical. This paper presents Answer Engineering, a deterministic runtime and authoring layer that applies localized rule-guided interventions to the visible reasoning trajectory during standard autoregressive generation, without retraining, modifying model weights, or performing global search. The method is evaluated on a controlled clinical benchmark for sudden sensorineural hearing loss (SSNHL), where correct management depends on protocol-consistent interpretation of symptom timing, Weber/Rinne tuning-fork findings, and otoscopic findings. In the benchmark, step-by-step reasoning shifted rather than eliminated errors: compliant outcomes for SSNHL decreased from 54.5% under unguided generation to 25.1%, while acceptance on the conductive contrast condition increased from 1.6% to 58.9%. Local trajectory editing increased SSNHL compliance to 83.5% and conductive-case adherence to 77.9%, raising balanced accuracy from 42.0% under reasoning-only generation to 80.7%. The results support a systems-level view in which protocol adherence can be improved through auditable runtime control of reasoning trajectories, while also identifying limitations caused by rule coverage, trigger reliability, and persistent diagnosis-first generation dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Answer Engineering shows a workable runtime edit for LLM protocol compliance on one clinical benchmark but the writeup gives almost no methods or verification details.

read the letter

The core contribution here is a deterministic, local editing layer that intervenes on the visible reasoning trace during ordinary generation to enforce domain rules. On the SSNHL benchmark it moves balanced accuracy from 42% (reasoning-only) to 80.7%, with SSNHL compliance at 83.5% and conductive-case adherence at 77.9%. That is a clear empirical lift over both unguided and reasoning baselines, and the idea of targeted trajectory edits without retraining or global search is distinct from the approaches cited.

What works is the demonstration that chain-of-thought can actively hurt protocol adherence on this task and that a lightweight authoring layer can recover most of the gap. The systems-level framing—that protocol control can be handled at runtime—is reasonable and worth testing.

The soft spots are exactly where the stress-test note flags them. The abstract itself lists rule coverage, trigger reliability, and diagnosis-first dynamics as limitations, yet supplies no counts of rules written, no coverage metrics, no trigger error rates, and no ablation on rule completeness. Dataset construction, prompt templates, and verification steps are also absent, so the numerical claims cannot be assessed for robustness or generality. Without those pieces it is impossible to tell whether the gains come from broad mechanism or from careful patching of a narrow error set.

This paper is for groups working on runtime safeguards for LLMs in regulated settings. It is coherent on its own terms and shows honest engagement with the practical problem, so it deserves a serious referee to see the methods and any additional experiments. I would send it to review rather than desk-reject.

Referee Report

1 major / 0 minor

Summary. The paper introduces Answer Engineering, a deterministic runtime layer for local rule-guided editing of LLM reasoning trajectories during autoregressive generation to enforce protocol compliance without retraining or global search. On a controlled clinical benchmark for sudden sensorineural hearing loss (SSNHL) involving symptom timing, Weber/Rinne findings, and otoscopic interpretation, it reports that reasoning-only generation reduces SSNHL compliance to 25.1% (from 54.5% unguided) while increasing conductive-case acceptance to 58.9%; local editing then raises SSNHL compliance to 83.5%, conductive adherence to 77.9%, and balanced accuracy from 42.0% to 80.7%. The work frames this as a systems-level approach to auditable protocol adherence while noting limitations in rule coverage, trigger reliability, and diagnosis-first dynamics.

Significance. If the results hold, the contribution lies in demonstrating a practical, weight-agnostic method for runtime trajectory control that yields substantial lifts on a domain-specific benchmark with clear baseline comparisons. The explicit identification of limitations and focus on deterministic, auditable interventions provide a concrete starting point for protocol-constrained applications in medicine and similar fields. The empirical numbers on a controlled task offer falsifiable predictions that can be stress-tested in follow-up work.

major comments (1)

[Abstract] Abstract: The central empirical claims (SSNHL compliance rising to 83.5%, balanced accuracy to 80.7%) rest on the effectiveness of the authored rules covering protocol elements such as symptom timing, Weber/Rinne, and otoscopic findings without missing cases or introducing inconsistencies; however, no counts of rules, coverage analysis, trigger false-positive/negative rates, or ablation on rule completeness are supplied, leaving the gains vulnerable to the possibility that they reflect patching of a narrow error distribution rather than a general solution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the recommendation for major revision. The concern about transparency in the rule set is well-taken and will be addressed by expanding the manuscript with the requested quantitative details on rule authoring and coverage.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims (SSNHL compliance rising to 83.5%, balanced accuracy to 80.7%) rest on the effectiveness of the authored rules covering protocol elements such as symptom timing, Weber/Rinne, and otoscopic findings without missing cases or introducing inconsistencies; however, no counts of rules, coverage analysis, trigger false-positive/negative rates, or ablation on rule completeness are supplied, leaving the gains vulnerable to the possibility that they reflect patching of a narrow error distribution rather than a general solution.

Authors: We agree that the manuscript would benefit from explicit quantification of the rule set to support the reported gains. The current version emphasizes the local-editing mechanism and the controlled benchmark results while noting limitations in rule coverage and trigger reliability; it does not include rule counts, coverage tables, trigger error rates, or ablations. In the revision we will add: (i) the total number of authored rules and their breakdown by protocol element (symptom timing, Weber/Rinne, otoscopy), (ii) a coverage matrix indicating which protocol requirements are addressed and any identified gaps, (iii) observed trigger activation statistics including false-positive and false-negative rates measured on the benchmark traces, and (iv) a short discussion of rule completeness that stops short of a full ablation study. These additions will make clear that the interventions target the specific error modes documented in the reasoning-only baseline rather than constituting narrow, post-hoc patches. Because the rules are deterministic and human-authored, the requested statistics can be supplied without new experiments or changes to the core method. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with no definitional or fitted reductions.

full rationale

The paper reports measured compliance rates (e.g., 83.5% SSNHL, 80.7% balanced accuracy) from direct application of rule-guided edits on a fixed clinical benchmark. No equations, parameters fitted to the target metrics, self-citations used as load-bearing uniqueness theorems, or renamings of known results appear in the provided text. The derivation chain consists of an external benchmark comparison rather than any self-referential construction. Limitations on rule coverage are noted but do not create circularity in the reported outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the proposed method itself is the main addition but lacks supporting ledger details.

pith-pipeline@v0.9.1-grok · 5756 in / 1090 out tokens · 36637 ms · 2026-06-26T14:10:57.637142+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 8 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2307.13702 , year =

Measuring Faithfulness in Chain-of-Thought Reasoning , author =. arXiv preprint arXiv:2307.13702 , year =

Pith/arXiv arXiv
[2]

Faithful Chain-of-Thought Reasoning , author =. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages =. 2023 , organization =. doi:10.18653/v1/2023.ijcnlp-main.20 , url =

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023
[3]

arXiv preprint arXiv:2503.08679 , year =

Chain-of-Thought Reasoning in the Wild Is Not Always Faithful , author =. arXiv preprint arXiv:2503.08679 , year =

Pith/arXiv arXiv
[4]

LLM s cannot find reasoning errors, but can correct them given the error location

Tyen, Gladys and Mansoor, Hassan and C. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , organization =. doi:10.18653/v1/2024.findings-acl.826 , url =

work page doi:10.18653/v1/2024.findings-acl.826 2024
[5]

arXiv preprint arXiv:2311.09101 , year =

Towards a Unified View of Answer Calibration for Multi-Step Reasoning , author =. arXiv preprint arXiv:2311.09101 , year =

arXiv
[6]

Advances in Neural Information Processing Systems , year =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , year =
[7]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron , journal =. Constitutional. 2022 , note =

2022
[8]

2024 , note =

Srivatsa, Vikranth and He, Zijian and Abhyankar, Reyna and Zhang, Hao and Zhang, Yiying , journal =. 2024 , note =

2024
[9]

arXiv preprint arXiv:2305.14739 , year =

Trusting Your Evidence: Hallucinate Less with Context-Aware Decoding , author =. arXiv preprint arXiv:2305.14739 , year =

arXiv
[10]

2021 , organization =

Yang, Kevin and Klein, Dan , booktitle =. 2021 , organization =

2021
[11]

arXiv preprint arXiv:2305.10601 , year =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. arXiv preprint arXiv:2305.10601 , year =

Pith/arXiv arXiv
[12]

2023 , note =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , note =

2023
[13]

2023 , organization =

Wang, Peifeng and Wang, Zhengyang and Li, Zheng and Gao, Yifan and Yin, Bing and Ren, Xiang , booktitle =. 2023 , organization =. doi:10.18653/v1/2023.acl-long.304 , url =

work page doi:10.18653/v1/2023.acl-long.304 2023
[14]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment , author =. arXiv preprint arXiv:2308.05374 , year =. doi:10.48550/arXiv.2308.05374 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.05374
[15]

From Implicit Exploration to Structured Reasoning: Guideline and Refinement for

Chen, Jiaxiang and Wang, Zhuo and Zou, Mingxi and Li, Zhucong and Zhou, Zhijian and Wang, Song and Xu, Zenglin , booktitle =. From Implicit Exploration to Structured Reasoning: Guideline and Refinement for. 2025 , isbn =. doi:10.18653/v1/2025.findings-emnlp.196 , url =

work page doi:10.18653/v1/2025.findings-emnlp.196 2025
[16]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[17]

International Conference on Learning Representations (ICLR) , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. International Conference on Learning Representations (ICLR) , year =
[18]

The Lancet , volume =

Sudden Sensorineural Hearing Loss , author =. The Lancet , volume =. 2009 , doi =

2009
[19]

ACM Computing Surveys , year=

A Survey of Reasoning with Large Language Models , author=. ACM Computing Surveys , year=
[20]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

The Impact of Format Restrictions on Performance of Large Language Models , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

2024
[21]

Proceedings of the Conference on Language Modeling (COLM) , year =

Automata-based Constraints for Language Model Decoding , author =. Proceedings of the Conference on Language Modeling (COLM) , year =
[22]

arXiv preprint arXiv:2408.12599 , year =

Controllable Text Generation for Large Language Models: A Survey , author =. arXiv preprint arXiv:2408.12599 , year =

arXiv
[23]

Advances in Neural Information Processing Systems , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =
[24]

A Survey on Large Language Model Acceleration based on

Li, Haoyang and Li, Yiming and Tian, Anxin and Tang, Tianhao and Xu, Zhanchao and Chen, Xuejia and Hu, Nicole and Dong, Wei and Li, Qing and Chen, Lei , journal =. A Survey on Large Language Model Acceleration based on. 2025 , url =

2025
[25]

and Salakhutdinov, Ruslan , booktitle =

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , booktitle =. Transformer-. 2019 , pages =

2019
[26]

arXiv preprint arXiv:2207.06881 , year =

Recurrent Memory Transformer , author =. arXiv preprint arXiv:2207.06881 , year =

arXiv
[27]

and Stoica, Ion and Gonzalez, Joseph E

Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E. , journal =. 2023 , url =

2023
[28]

arXiv preprint arXiv:2405.17935 , year =

Tool Learning with Large Language Models: A Survey , author =. arXiv preprint arXiv:2405.17935 , year =

arXiv
[29]

arXiv preprint arXiv:2305.04388 , year =

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author =. arXiv preprint arXiv:2305.04388 , year =

Pith/arXiv arXiv
[30]

2025 , howpublished =

Reasoning Models Don't Always Say What They Think , author =. 2025 , howpublished =

2025
[31]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

2025
[32]

International Conference on Learning Representations (ICLR) , year =

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , author =. International Conference on Learning Representations (ICLR) , year =
[33]

International Conference on Learning Representations (ICLR) , year =

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning , author =. International Conference on Learning Representations (ICLR) , year =
[34]

Expert Systems with Applications , volume =

Measuring the Complexity of Rule-Based Expert Systems , author =. Expert Systems with Applications , volume =. 1994 , doi =

1994
[35]

Decision Support Systems , volume =

An Approach to Improving the Maintainability of Existing Rule Bases , author =. Decision Support Systems , volume =. 1996 , doi =

1996
[36]

Proceedings of the 2022 ACM Southeast Conference , year =

From Past to Present: A Comprehensive Technical Review of Rule-Based Expert Systems from 1980--2021 , author =. Proceedings of the 2022 ACM Southeast Conference , year =. doi:10.1145/3476883.3520211 , url =

work page doi:10.1145/3476883.3520211 1980
[37]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , year =

Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search , author =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , year =
[38]

arXiv preprint arXiv:2309.15071 , year =

A Survey of Constrained Text Generation for Large Language Models , author =. arXiv preprint arXiv:2309.15071 , year =

arXiv
[39]

Otolaryngology--Head and Neck Surgery , volume =

Clinical Practice Guideline: Sudden Hearing Loss (Update) , author =. Otolaryngology--Head and Neck Surgery , volume =. 2019 , month = aug, doi =

2019
[40]

Nature , volume =

Large Language Models Encode Clinical Knowledge , author =. Nature , volume =. 2023 , month = aug, doi =

2023
[41]

Applied Sciences , volume =

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author =. Applied Sciences , volume =. 2021 , month = jul, doi =

2021
[42]

arXiv preprint arXiv:2303.17651 , year =

Self-Refine: Iterative Refinement with Self-Feedback , author =. arXiv preprint arXiv:2303.17651 , year =

Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2212.08073 , year =

Constitutional AI: Harmlessness from AI Feedback , author =. arXiv preprint arXiv:2212.08073 , year =

Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2308.11462 , year =

LegalBench: A Collaboratively Built Benchmark for Legal Reasoning , author =. arXiv preprint arXiv:2308.11462 , year =

arXiv
[45]

arXiv preprint arXiv:2302.04761 , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. arXiv preprint arXiv:2302.04761 , year =

Pith/arXiv arXiv
[46]

2025 , howpublished =

Answer Engineering Architecture Overview , author =. 2025 , howpublished =

2025
[47]

2025 , howpublished =

Answer Engineering Language Developer Specification v0.1 , author =. 2025 , howpublished =

2025
[48]

2025 , howpublished =

Answer Engineering Extension Points , author =. 2025 , howpublished =

2025
[49]

2021 , journal =

Training Verifiers to Solve Math Word Problems , author =. 2021 , journal =. 2110.14168 , archivePrefix=

Pith/arXiv arXiv 2021
[50]

2021 , eprint =

Scratchpads for Intermediate Computation with Language Models , author =. 2021 , eprint =

2021
[51]

2012 , institution =

International Standards on Combating Money Laundering and the Financing of Terrorism & Proliferation , author =. 2012 , institution =

2012
[52]

2016 , institution =

Recommended Practices for Safety and Health Programs , author =. 2016 , institution =

2016
[53]

OpenMeditron/Meditron3-8B , year =
[54]

and Chandrasekhar, Sujana S

Stachler, Robert J. and Chandrasekhar, Sujana S. and Archer, Sharon M. and Rosenfeld, Richard M. and Schwartz, Seth R. and Barrs, David M. and Brown, Stephen R. and Fife, Terry D. and Ford, Paula and Ganiats, Theodore G. and Hollingsworth, David B. and Lewandowski, Cary A. and Montano, Joseph J. and Saunders, Joseph E. and Tucci, Debara L. and Valente, Mi...

work page doi:10.1177/0194599812436449 2012
[55]

Nature Medicine , volume =

Evaluation and Mitigation of the Limitations of Large Language Models in Clinical Decision-Making , author =. Nature Medicine , volume =. 2024 , doi =

2024
[56]

npj Digital Medicine , volume =

Autonomous Medical Evaluation for Guideline Adherence of Large Language Models , author =. npj Digital Medicine , volume =. 2024 , doi =

2024
[57]

JMIRx Med , volume =

Assessing the Limitations of Large Language Models in Clinical Practice Guideline-Concordant Treatment Decision-Making on Real-World Data: Retrospective Study , author =. JMIRx Med , volume =. 2025 , month = nov, doi =

2025
[58]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.1242 , url =

work page doi:10.18653/v1/2025.emnlp-main.1242 2025
[59]

OpenReview , year =

Curse of Instructions: Large Language Models Cannot Follow Multiple Instructions at Once , author =. OpenReview , year =
[60]

Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

2024
[61]

Findings of the Association for Computational Linguistics: EMNLP 2025 , year =

Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , year =

2025
[62]

2023 , eprint =

Let’s Verify Step by Step , author =. 2023 , eprint =

2023
[63]

2019 , url =

Sudden Hearing Loss Guideline Issue , organization =. 2019 , url =

2019
[64]

and Flaherty, Alexander and Zhang, Jason A

Leung, Michael A. and Flaherty, Alexander and Zhang, Jason A. and Hara, Joseph , title =. Canadian Family Physician , volume =. 2016 , url =

2016
[65]

Hearing Loss in Adults: Quality Standard 2 -- Sudden Onset of Hearing Loss , year =
[66]

Hearing Loss in Adults: Assessment and Management , year =
[67]

Plausibility: On the (Un)Reliability of Explanations from Large Language Models , author =

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models , author =. arXiv preprint arXiv:2402.04614 , year =

arXiv
[68]

arXiv preprint , year =

STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning , author =. arXiv preprint , year =. 2203.14465 , eprinttype =

arXiv
[69]

Proceedings of the ACM Symposium on Operating Systems Principles , year =

Efficient Memory Management for Large Language Model Serving with PagedAttention , author =. Proceedings of the ACM Symposium on Operating Systems Principles , year =
[70]

arXiv preprint arXiv:2312.07104 , year =

SGLang: Efficient Execution of Structured Language Model Programs , author =. arXiv preprint arXiv:2312.07104 , year =

Pith/arXiv arXiv
[71]

arXiv preprint arXiv:2307.11760 , year =

Guidance: A Language for Controlling Large Language Models , author =. arXiv preprint arXiv:2307.11760 , year =

arXiv
[72]

arXiv preprint arXiv:2308.12372 , year =

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications , author =. arXiv preprint arXiv:2308.12372 , year =

arXiv
[73]

Proceedings of the ACM on Programming Languages , year =

LMQL: A Programming Language for Large Language Models , author =. Proceedings of the ACM on Programming Languages , year =
[74]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
[75]

International Conference on Learning Representations (ICLR) , year=

Improving Instruction-Following in Language Models through Activation Steering , author=. International Conference on Learning Representations (ICLR) , year=
[76]

arXiv preprint arXiv:2410.12877 , year=

Improving Instruction-Following in Language Models through Activation Steering , author=. arXiv preprint arXiv:2410.12877 , year=

arXiv

[1] [1]

arXiv preprint arXiv:2307.13702 , year =

Measuring Faithfulness in Chain-of-Thought Reasoning , author =. arXiv preprint arXiv:2307.13702 , year =

Pith/arXiv arXiv

[2] [2]

Faithful Chain-of-Thought Reasoning , author =. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages =. 2023 , organization =. doi:10.18653/v1/2023.ijcnlp-main.20 , url =

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023

[3] [3]

arXiv preprint arXiv:2503.08679 , year =

Chain-of-Thought Reasoning in the Wild Is Not Always Faithful , author =. arXiv preprint arXiv:2503.08679 , year =

Pith/arXiv arXiv

[4] [4]

LLM s cannot find reasoning errors, but can correct them given the error location

Tyen, Gladys and Mansoor, Hassan and C. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , organization =. doi:10.18653/v1/2024.findings-acl.826 , url =

work page doi:10.18653/v1/2024.findings-acl.826 2024

[5] [5]

arXiv preprint arXiv:2311.09101 , year =

Towards a Unified View of Answer Calibration for Multi-Step Reasoning , author =. arXiv preprint arXiv:2311.09101 , year =

arXiv

[6] [6]

Advances in Neural Information Processing Systems , year =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , year =

[7] [7]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron , journal =. Constitutional. 2022 , note =

2022

[8] [8]

2024 , note =

Srivatsa, Vikranth and He, Zijian and Abhyankar, Reyna and Zhang, Hao and Zhang, Yiying , journal =. 2024 , note =

2024

[9] [9]

arXiv preprint arXiv:2305.14739 , year =

Trusting Your Evidence: Hallucinate Less with Context-Aware Decoding , author =. arXiv preprint arXiv:2305.14739 , year =

arXiv

[10] [10]

2021 , organization =

Yang, Kevin and Klein, Dan , booktitle =. 2021 , organization =

2021

[11] [11]

arXiv preprint arXiv:2305.10601 , year =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. arXiv preprint arXiv:2305.10601 , year =

Pith/arXiv arXiv

[12] [12]

2023 , note =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , note =

2023

[13] [13]

2023 , organization =

Wang, Peifeng and Wang, Zhengyang and Li, Zheng and Gao, Yifan and Yin, Bing and Ren, Xiang , booktitle =. 2023 , organization =. doi:10.18653/v1/2023.acl-long.304 , url =

work page doi:10.18653/v1/2023.acl-long.304 2023

[14] [14]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment , author =. arXiv preprint arXiv:2308.05374 , year =. doi:10.48550/arXiv.2308.05374 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.05374

[15] [15]

From Implicit Exploration to Structured Reasoning: Guideline and Refinement for

Chen, Jiaxiang and Wang, Zhuo and Zou, Mingxi and Li, Zhucong and Zhou, Zhijian and Wang, Song and Xu, Zenglin , booktitle =. From Implicit Exploration to Structured Reasoning: Guideline and Refinement for. 2025 , isbn =. doi:10.18653/v1/2025.findings-emnlp.196 , url =

work page doi:10.18653/v1/2025.findings-emnlp.196 2025

[16] [16]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[17] [17]

International Conference on Learning Representations (ICLR) , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. International Conference on Learning Representations (ICLR) , year =

[18] [18]

The Lancet , volume =

Sudden Sensorineural Hearing Loss , author =. The Lancet , volume =. 2009 , doi =

2009

[19] [19]

ACM Computing Surveys , year=

A Survey of Reasoning with Large Language Models , author=. ACM Computing Surveys , year=

[20] [20]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

The Impact of Format Restrictions on Performance of Large Language Models , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

2024

[21] [21]

Proceedings of the Conference on Language Modeling (COLM) , year =

Automata-based Constraints for Language Model Decoding , author =. Proceedings of the Conference on Language Modeling (COLM) , year =

[22] [22]

arXiv preprint arXiv:2408.12599 , year =

Controllable Text Generation for Large Language Models: A Survey , author =. arXiv preprint arXiv:2408.12599 , year =

arXiv

[23] [23]

Advances in Neural Information Processing Systems , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =

[24] [24]

A Survey on Large Language Model Acceleration based on

Li, Haoyang and Li, Yiming and Tian, Anxin and Tang, Tianhao and Xu, Zhanchao and Chen, Xuejia and Hu, Nicole and Dong, Wei and Li, Qing and Chen, Lei , journal =. A Survey on Large Language Model Acceleration based on. 2025 , url =

2025

[25] [25]

and Salakhutdinov, Ruslan , booktitle =

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , booktitle =. Transformer-. 2019 , pages =

2019

[26] [26]

arXiv preprint arXiv:2207.06881 , year =

Recurrent Memory Transformer , author =. arXiv preprint arXiv:2207.06881 , year =

arXiv

[27] [27]

and Stoica, Ion and Gonzalez, Joseph E

Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E. , journal =. 2023 , url =

2023

[28] [28]

arXiv preprint arXiv:2405.17935 , year =

Tool Learning with Large Language Models: A Survey , author =. arXiv preprint arXiv:2405.17935 , year =

arXiv

[29] [29]

arXiv preprint arXiv:2305.04388 , year =

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author =. arXiv preprint arXiv:2305.04388 , year =

Pith/arXiv arXiv

[30] [30]

2025 , howpublished =

Reasoning Models Don't Always Say What They Think , author =. 2025 , howpublished =

2025

[31] [31]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

2025

[32] [32]

International Conference on Learning Representations (ICLR) , year =

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , author =. International Conference on Learning Representations (ICLR) , year =

[33] [33]

International Conference on Learning Representations (ICLR) , year =

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning , author =. International Conference on Learning Representations (ICLR) , year =

[34] [34]

Expert Systems with Applications , volume =

Measuring the Complexity of Rule-Based Expert Systems , author =. Expert Systems with Applications , volume =. 1994 , doi =

1994

[35] [35]

Decision Support Systems , volume =

An Approach to Improving the Maintainability of Existing Rule Bases , author =. Decision Support Systems , volume =. 1996 , doi =

1996

[36] [36]

Proceedings of the 2022 ACM Southeast Conference , year =

From Past to Present: A Comprehensive Technical Review of Rule-Based Expert Systems from 1980--2021 , author =. Proceedings of the 2022 ACM Southeast Conference , year =. doi:10.1145/3476883.3520211 , url =

work page doi:10.1145/3476883.3520211 1980

[37] [37]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , year =

Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search , author =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , year =

[38] [38]

arXiv preprint arXiv:2309.15071 , year =

A Survey of Constrained Text Generation for Large Language Models , author =. arXiv preprint arXiv:2309.15071 , year =

arXiv

[39] [39]

Otolaryngology--Head and Neck Surgery , volume =

Clinical Practice Guideline: Sudden Hearing Loss (Update) , author =. Otolaryngology--Head and Neck Surgery , volume =. 2019 , month = aug, doi =

2019

[40] [40]

Nature , volume =

Large Language Models Encode Clinical Knowledge , author =. Nature , volume =. 2023 , month = aug, doi =

2023

[41] [41]

Applied Sciences , volume =

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author =. Applied Sciences , volume =. 2021 , month = jul, doi =

2021

[42] [42]

arXiv preprint arXiv:2303.17651 , year =

Self-Refine: Iterative Refinement with Self-Feedback , author =. arXiv preprint arXiv:2303.17651 , year =

Pith/arXiv arXiv

[43] [43]

arXiv preprint arXiv:2212.08073 , year =

Constitutional AI: Harmlessness from AI Feedback , author =. arXiv preprint arXiv:2212.08073 , year =

Pith/arXiv arXiv

[44] [44]

arXiv preprint arXiv:2308.11462 , year =

LegalBench: A Collaboratively Built Benchmark for Legal Reasoning , author =. arXiv preprint arXiv:2308.11462 , year =

arXiv

[45] [45]

arXiv preprint arXiv:2302.04761 , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. arXiv preprint arXiv:2302.04761 , year =

Pith/arXiv arXiv

[46] [46]

2025 , howpublished =

Answer Engineering Architecture Overview , author =. 2025 , howpublished =

2025

[47] [47]

2025 , howpublished =

Answer Engineering Language Developer Specification v0.1 , author =. 2025 , howpublished =

2025

[48] [48]

2025 , howpublished =

Answer Engineering Extension Points , author =. 2025 , howpublished =

2025

[49] [49]

2021 , journal =

Training Verifiers to Solve Math Word Problems , author =. 2021 , journal =. 2110.14168 , archivePrefix=

Pith/arXiv arXiv 2021

[50] [50]

2021 , eprint =

Scratchpads for Intermediate Computation with Language Models , author =. 2021 , eprint =

2021

[51] [51]

2012 , institution =

International Standards on Combating Money Laundering and the Financing of Terrorism & Proliferation , author =. 2012 , institution =

2012

[52] [52]

2016 , institution =

Recommended Practices for Safety and Health Programs , author =. 2016 , institution =

2016

[53] [53]

OpenMeditron/Meditron3-8B , year =

[54] [54]

and Chandrasekhar, Sujana S

Stachler, Robert J. and Chandrasekhar, Sujana S. and Archer, Sharon M. and Rosenfeld, Richard M. and Schwartz, Seth R. and Barrs, David M. and Brown, Stephen R. and Fife, Terry D. and Ford, Paula and Ganiats, Theodore G. and Hollingsworth, David B. and Lewandowski, Cary A. and Montano, Joseph J. and Saunders, Joseph E. and Tucci, Debara L. and Valente, Mi...

work page doi:10.1177/0194599812436449 2012

[55] [55]

Nature Medicine , volume =

Evaluation and Mitigation of the Limitations of Large Language Models in Clinical Decision-Making , author =. Nature Medicine , volume =. 2024 , doi =

2024

[56] [56]

npj Digital Medicine , volume =

Autonomous Medical Evaluation for Guideline Adherence of Large Language Models , author =. npj Digital Medicine , volume =. 2024 , doi =

2024

[57] [57]

JMIRx Med , volume =

Assessing the Limitations of Large Language Models in Clinical Practice Guideline-Concordant Treatment Decision-Making on Real-World Data: Retrospective Study , author =. JMIRx Med , volume =. 2025 , month = nov, doi =

2025

[58] [58]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.1242 , url =

work page doi:10.18653/v1/2025.emnlp-main.1242 2025

[59] [59]

OpenReview , year =

Curse of Instructions: Large Language Models Cannot Follow Multiple Instructions at Once , author =. OpenReview , year =

[60] [60]

Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

2024

[61] [61]

Findings of the Association for Computational Linguistics: EMNLP 2025 , year =

Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , year =

2025

[62] [62]

2023 , eprint =

Let’s Verify Step by Step , author =. 2023 , eprint =

2023

[63] [63]

2019 , url =

Sudden Hearing Loss Guideline Issue , organization =. 2019 , url =

2019

[64] [64]

and Flaherty, Alexander and Zhang, Jason A

Leung, Michael A. and Flaherty, Alexander and Zhang, Jason A. and Hara, Joseph , title =. Canadian Family Physician , volume =. 2016 , url =

2016

[65] [65]

Hearing Loss in Adults: Quality Standard 2 -- Sudden Onset of Hearing Loss , year =

[66] [66]

Hearing Loss in Adults: Assessment and Management , year =

[67] [67]

Plausibility: On the (Un)Reliability of Explanations from Large Language Models , author =

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models , author =. arXiv preprint arXiv:2402.04614 , year =

arXiv

[68] [68]

arXiv preprint , year =

STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning , author =. arXiv preprint , year =. 2203.14465 , eprinttype =

arXiv

[69] [69]

Proceedings of the ACM Symposium on Operating Systems Principles , year =

Efficient Memory Management for Large Language Model Serving with PagedAttention , author =. Proceedings of the ACM Symposium on Operating Systems Principles , year =

[70] [70]

arXiv preprint arXiv:2312.07104 , year =

SGLang: Efficient Execution of Structured Language Model Programs , author =. arXiv preprint arXiv:2312.07104 , year =

Pith/arXiv arXiv

[71] [71]

arXiv preprint arXiv:2307.11760 , year =

Guidance: A Language for Controlling Large Language Models , author =. arXiv preprint arXiv:2307.11760 , year =

arXiv

[72] [72]

arXiv preprint arXiv:2308.12372 , year =

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications , author =. arXiv preprint arXiv:2308.12372 , year =

arXiv

[73] [73]

Proceedings of the ACM on Programming Languages , year =

LMQL: A Programming Language for Large Language Models , author =. Proceedings of the ACM on Programming Languages , year =

[74] [74]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

[75] [75]

International Conference on Learning Representations (ICLR) , year=

Improving Instruction-Following in Language Models through Activation Steering , author=. International Conference on Learning Representations (ICLR) , year=

[76] [76]

arXiv preprint arXiv:2410.12877 , year=

Improving Instruction-Following in Language Models through Activation Steering , author=. arXiv preprint arXiv:2410.12877 , year=

arXiv