Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

Haizhou Xia

arxiv: 2605.24613 · v1 · pith:GVLMMHVDnew · submitted 2026-05-23 · 💻 cs.CL · cs.AI· cs.SE

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

Haizhou Xia This is my paper

Pith reviewed 2026-06-30 13:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE

keywords post-hoc repairLLM mathematical reasoningselective replacementverification guardsharm-aware repairGSM8KASDivanswer verification

0 comments

The pith

GuardedRepair selectively replaces LLM math reasoning traces only when verification guards confirm the replacement is safer, improving accuracy on GSM8K from 95.60% to 96.89% without breaking correct answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the asymmetric risk in post-hoc repair of LLM mathematical reasoning, where fixing an incorrect trace helps but replacing a correct one harms performance. It introduces GuardedRepair as a selective framework that diagnoses cached traces, triggers repair only when needed, and accepts an answer change solely when deterministic verification guards support the swap. The method combines symbolic checks with conservative policies and shows accuracy gains over both the initial reasoner and direct regeneration baselines that break many correct traces. This frames post-hoc repair as harm-aware selective replacement rather than open-ended re-solving.

Core claim

GuardedRepair is a guarded best-of-N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer-changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic-risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken-correct cases in the main run. On a weak-reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%.

What carries the argument

The guarded best-of-N repair framework with deterministic verification guards and conservative acceptance policies that decides whether a repaired candidate is safer than preserving the original cached trace.

If this is right

Accuracy on GSM8K rises from 95.60% to 96.89% by fixing 17 errors without breaking any correct traces.
On ASDiv with a weak reasoner, accuracy rises from 78.40% to 87.60%.
Direct regeneration of all examples lowers accuracy to 93.03% and breaks 47 correct answers on GSM8K.
Guarded repair substantially improves the fixed to broken ratio compared to unconstrained baselines.
Replacement risk is reduced rather than eliminated by the guarded approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar guarded selection could apply to other LLM tasks such as code generation where preserving already-correct outputs matters.
Production systems could cache initial correct answers and apply this selective update only when guards approve.
Testing the guards on adversarial or out-of-distribution math problems would measure how robust the deterministic checks remain.

Load-bearing premise

The deterministic verification guards and conservative acceptance policies can be implemented to correctly distinguish safe replacements from harmful ones.

What would settle it

A new test set where at least one initially correct trace is broken by an accepted replacement despite the guards, or where the method produces no net accuracy gain.

Figures

Figures reproduced from arXiv: 2605.24613 by Haizhou Xia.

**Figure 1.** Figure 1: Guarded best-of-N post-hoc repair. The default action is to keep the cached trace; replacement occurs only when a triggered repair candidate passes deterministic guards. The default action is to keep the original trace. A replacement is made only when a candidate passes the guarded acceptance policy: rf = ( rc, if a candidate repair passes all gates, r0, otherwise. (2) This framing decomposes repair into d… view at source ↗

read the original abstract

Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best-of-N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer-changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic-risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken-correct cases in the main run. On a weak-reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger-model re-solving alone: re-solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post-hoc repair as harm-aware selective replacement rather than unconstrained re-solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GuardedRepair shows selective repair can lift GSM8K accuracy from 95.6% to 96.89% with no broken-correct cases reported, beating blind regeneration, but the zero-harm result depends on unshown guard details.

read the letter

The main thing to know is that GuardedRepair gives a practical way to selectively fix LLM mathematical reasoning errors while trying to avoid harming already correct traces, with reported gains on standard benchmarks.

The paper introduces the GuardedRepair framework under a selective replacement setting. It diagnoses cached traces, triggers repair selectively, and only accepts answer-changing candidates when deterministic verification guards support it. The combination of symbolic checks, semantic-risk diagnostics, bounded generation, and conservative policies is presented as the key. On GSM8K with a strong initial reasoner at 95.60%, it reaches 96.89% by fixing 17 errors with no measured broken-correct cases. On ASDiv with a weaker reasoner, it goes from 78.40% to 87.60%. The regeneration baseline is shown to lower accuracy to 93.03% on GSM8K and break 47 correct answers. Additional analyses indicate the guarded method improves the fixed/broken tradeoff.

This work does well in providing a direct empirical comparison that highlights the advantage of selectivity over unconstrained re-solving. The framing as harm-aware rather than automatic repair is a useful perspective, and the results are reported on full test sets with explicit numbers.

The soft spots are around the implementation of the guards. The central claim of no broken-correct cases and reduced replacement risk relies on the guards working correctly to identify safe replacements. The abstract describes what the guards do at a high level but does not include details on their exact operation, thresholds, or how they handle potential edge cases. Without that, it's difficult to fully verify if the zero count comes from effective harm detection or from a very restrictive policy. The paper acknowledges that risk is reduced rather than eliminated, which is appropriate.

This paper is for researchers focused on improving LLM reasoning through post-hoc methods and those concerned with the risks of automatic repair. A reader looking for empirical evidence on selective repair strategies would find value in the benchmark results and baseline comparisons. It shows clear thinking on the problem and engages with the practical issues, so it deserves a serious referee even if revisions are needed for more guard details.

I recommend engaging with the work and sending it for peer review.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces GuardedRepair, a guarded best-of-N framework for harm-aware post-hoc replacement of LLM mathematical reasoning traces. It uses deterministic verification guards, symbolic checks, surface semantic-risk diagnostics, and conservative acceptance policies to selectively repair incorrect cached traces while avoiding changes to correct ones. On the full GSM8K test set the method raises accuracy from 95.60% to 96.89% (fixing 17 of 58 errors with zero measured broken-correct cases); on a weak-reasoner ASDiv setting accuracy rises from 78.40% to 87.60%. Direct regeneration baselines are shown to break many initially correct answers, supporting the claim that guarded selective replacement improves the fixed/broken tradeoff.

Significance. If the guard mechanisms can be shown to reliably distinguish safe replacements, the work would be significant for high-accuracy LLM reasoning deployments: it reframes post-hoc repair as a controlled, harm-aware decision rather than unconstrained re-solving and supplies concrete empirical evidence that selective replacement can yield net gains without the breakage observed in regeneration.

major comments (2)

[Abstract] Abstract: the headline result of 95.60% → 96.89% accuracy on GSM8K with zero broken-correct cases rests on the unshown correctness of the deterministic verification guards and conservative acceptance policies; without definitions, pseudocode, or edge-case coverage for these components it is impossible to determine whether the zero count reflects reliable harm detection or simply an overly conservative rejection policy that accepts few candidates.
[Methods / Experimental Protocol] The experimental protocol section (implementation of guards, symbolic checks, and acceptance criteria): the absence of concrete implementation details or pseudocode for the guards leaves the central no-broken-correct claim only partially verifiable, which is load-bearing for the paper's distinction between harm-aware repair and plain regeneration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the importance of detailed guard specifications. We address the major comments point-by-point below and will incorporate additional implementation details in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline result of 95.60% → 96.89% accuracy on GSM8K with zero broken-correct cases rests on the unshown correctness of the deterministic verification guards and conservative acceptance policies; without definitions, pseudocode, or edge-case coverage for these components it is impossible to determine whether the zero count reflects reliable harm detection or simply an overly conservative rejection policy that accepts few candidates.

Authors: The abstract summarizes the key empirical outcome, but the full manuscript provides descriptions of the guard components. To make the zero broken-correct claim more verifiable, we will add explicit definitions, pseudocode for the acceptance policy, and discussion of edge cases in a new subsection of the Methods section. This revision will clarify the conditions under which candidates are accepted or rejected. revision: yes
Referee: [Methods / Experimental Protocol] The experimental protocol section (implementation of guards, symbolic checks, and acceptance criteria): the absence of concrete implementation details or pseudocode for the guards leaves the central no-broken-correct claim only partially verifiable, which is load-bearing for the paper's distinction between harm-aware repair and plain regeneration.

Authors: We agree that the current presentation lacks sufficient concrete details for full verification. The manuscript describes the framework at a high level, combining symbolic checks and conservative policies, but does not include pseudocode. In the revision, we will provide pseudocode for the guard logic, symbolic verification steps, and the decision criteria for accepting a replacement. This will strengthen the distinction from regeneration baselines. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical framework with benchmark measurements.

full rationale

The paper describes an empirical selective-repair framework (GuardedRepair) and reports accuracy deltas on public benchmarks (GSM8K, ASDiv) against regeneration baselines. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. The central claims rest on measured outcomes (e.g., 95.60% → 96.89% with zero observed broken-correct cases) rather than any derivation that reduces to its own inputs by construction. The analysis is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that lightweight symbolic checks and surface semantic-risk diagnostics can be defined to enable reliable conservative acceptance without introducing fitted parameters or new entities.

axioms (1)

domain assumption Deterministic verification guards and conservative acceptance policies can be implemented to correctly identify safe replacements
The reported absence of broken-correct cases and the claim of reduced replacement risk depend on this assumption holding.

pith-pipeline@v0.9.1-grok · 5777 in / 1300 out tokens · 64739 ms · 2026-06-30T13:35:30.766610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 4 canonical work pages · 2 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems

2022
[4]

Le, and Ed H

Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-most prompting enables complex reasoning in large language models. In International Conference on Learning Representations

2023
[5]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let's verify step by step. arXiv preprint arXiv:2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative refinement with self-feedback. In Advances in Neural Information Process...

2023
[8]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations

2023
[9]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: Program-aided language models. In Proceedings of the 40th International Conference on Machine Learning

2023
[10]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research

2023
[11]

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing

2014
[12]

Arkadiy Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of NAACL-HLT

2021
[13]

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

2020
[14]

Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. 2025. R-PRM: Reasoning-driven process reward modeling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

2025
[15]

Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, and Meng Jiang. 2025. Enhancing mathematical reasoning in LLMs by stepwise correction. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

2025
[16]

Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang. 2025. Self-rewarding correction for mathematical reasoning. arXiv preprint arXiv:2502.19613

work page arXiv 2025
[17]

Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, and Bo An. 2025. Incentivizing LLMs to self-verify their answers. arXiv preprint arXiv:2506.01369

work page arXiv 2025
[18]

Ran El-Yaniv and Yair Wiener. 2010. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11:1605--1641

2010
[19]

Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems

2017

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[3] [3]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems

2022

[4] [4]

Le, and Ed H

Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-most prompting enables complex reasoning in large language models. In International Conference on Learning Representations

2023

[5] [5]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let's verify step by step. arXiv preprint arXiv:2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative refinement with self-feedback. In Advances in Neural Information Process...

2023

[8] [8]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations

2023

[9] [9]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: Program-aided language models. In Proceedings of the 40th International Conference on Machine Learning

2023

[10] [10]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research

2023

[11] [11]

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing

2014

[12] [12]

Arkadiy Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of NAACL-HLT

2021

[13] [13]

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

2020

[14] [14]

Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. 2025. R-PRM: Reasoning-driven process reward modeling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

2025

[15] [15]

Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, and Meng Jiang. 2025. Enhancing mathematical reasoning in LLMs by stepwise correction. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

2025

[16] [16]

Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang. 2025. Self-rewarding correction for mathematical reasoning. arXiv preprint arXiv:2502.19613

work page arXiv 2025

[17] [17]

Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, and Bo An. 2025. Incentivizing LLMs to self-verify their answers. arXiv preprint arXiv:2506.01369

work page arXiv 2025

[18] [18]

Ran El-Yaniv and Yair Wiener. 2010. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11:1605--1641

2010

[19] [19]

Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems

2017