Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

Abdul Rehman; Muhammad Talha Sharif

arxiv: 2606.05704 · v1 · pith:C65CN6V5new · submitted 2026-06-04 · 💻 cs.AI · cs.LG

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

Muhammad Talha Sharif , Abdul Rehman This is my paper

Pith reviewed 2026-06-28 01:25 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords multi-agent LLMmathematical reasoningcritic feedbackerror correctionGSM8K benchmarkheterogeneous agentsvalidator framework

0 comments

The pith

A critic-guided multi-agent framework improves LLM math reasoning accuracy by up to 13 percent on GSM8K through adaptive error correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a system of heterogeneous LLM agents that collaborate on math problems, with one agent acting as validator to critique intermediate steps and direct the generator toward corrected solutions. This generator-validator loop is meant to catch hallucinations and prevent error cascades that plague single-pass reasoning. A sympathetic reader would care because it offers a path to more dependable outputs from language models without requiring ever-larger single models. Experiments indicate the main lift comes from the critique mechanism rather than raw model size, and smaller agents reach parity when paired with the feedback process. The result frames reliable reasoning as an outcome of structured collaboration and correction rather than isolated generation.

Core claim

The central claim is that a heterogeneous multi-agent setup built around a generator-validator framework, where the validator both judges correctness and supplies critiques for regeneration, produces reliable mathematical reasoning by enabling adaptive error correction and blocking cascading mistakes, yielding up to 13 percent higher accuracy on GSM8K than single-shot or non-critic baselines while letting smaller models perform comparably to larger ones.

What carries the argument

The generator-validator framework with critic-driven adaptive feedback, in which the validator assesses intermediate reasoning and supplies guidance for solution regeneration.

If this is right

Heterogeneity across agent specialties plus the critique loop reduces dependence on large model scale.
Ablation results indicate performance gains trace primarily to the critic feedback mechanism rather than model size.
The approach yields more interpretable reasoning traces because each correction step is guided by explicit validator comments.
Smaller models reach parity with larger ones when the adaptive correction loop is active.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same validator-critique pattern could be tested on non-math reasoning tasks such as code generation if the critic is specialized accordingly.
Computational cost might drop further by routing easy problems to smaller agents and reserving critique only for hard cases.
Longer reasoning chains could reveal whether critic bias accumulates across multiple regeneration rounds.
Pairing the framework with external symbolic checkers might strengthen the validator without increasing LLM calls.

Load-bearing premise

The validator can reliably spot reasoning errors and deliver critiques that improve the next attempt without adding new hallucinations or systematic biases of its own.

What would settle it

Run the same GSM8K problems with the critic disabled or replaced by a random feedback generator and measure whether accuracy drops back to single-shot levels or rises further.

read the original abstract

Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning. This framework incorporates several LLM agents of different specialties and employs a critic-driven adaptive learning system to assess and guide the reasoning process based on intermediate feedback. The system adopts a generator-validator framework, with the validator not only determining correctness but also offering critiques to guide regeneration of solutions. This allows for adaptive error correction and prevents error cascading. Our experiments on the GSM8K benchmark show that the proposed method achieves up to 13% accuracy improvement over single-shot and non-critic models. Additionally, findings suggest that heterogeneity and critique reduce the need for large models, allowing smaller models to perform on par. Ablation studies reveal the main performance gains are due to the critic-based feedback loop and not model size. In summary, the proposed approach showcases the benefits of combining heterogeneous multi-agent collaboration and critique to obtain reliable and interpretable reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a multi-agent critic loop that lifts GSM8K accuracy by 13% and helps smaller models, but the experimental reporting is too thin to judge whether the gains are stable.

read the letter

The main takeaway is a generator-validator system with heterogeneous LLM agents where one produces reasoning steps and a critic agent checks intermediate results and suggests fixes. They test this on GSM8K and report up to 13% better accuracy than single-shot or non-critic baselines, plus evidence that the setup reduces reliance on model scale.

The ablations are the clearest part of the work. They indicate that the feedback loop, not just larger models, accounts for most of the lift. That is a practical observation for anyone trying to improve reliability without extra compute.

The soft spots sit in the evaluation. The abstract gives no error bars, run counts, statistical tests, or precise baseline implementations, so it is hard to tell whether the 13% reflects consistent improvement or selection effects. The claim that the critic reliably catches errors also rests on an untested assumption that the critic itself does not introduce new systematic mistakes.

This paper is for researchers building agent-based reasoning tools for math or education applications. Someone looking for concrete patterns in multi-agent critique would find usable architecture details, even if the numbers need independent checks.

I would send it to peer review once the authors add proper statistical reporting and controls. The core idea is straightforward and the empirical angle is there, so referees could help tighten the evidence without starting from scratch.

Referee Report

3 major / 2 minor

Summary. The paper introduces a critic-guided heterogeneous multi-agent framework for LLM-based mathematical reasoning. It uses specialized generator and validator agents in an adaptive loop where the validator provides critiques to correct intermediate errors, preventing cascading mistakes. Experiments on GSM8K report up to 13% accuracy gains over single-shot and non-critic baselines, with ablations attributing gains to the feedback mechanism rather than model scale, and suggesting smaller models can match larger ones under this setup.

Significance. If the reported gains are robust, the work would provide empirical support for critique-driven multi-agent collaboration as a path to more reliable reasoning without relying solely on scale. The ablation results, if properly controlled, would strengthen the case that feedback loops are a key driver of performance.

major comments (3)

[Experimental results section] Experimental results section: the central 13% accuracy improvement claim lacks any description of the experimental protocol, including data splits on GSM8K, number of runs, error bars, statistical significance tests, or exact baseline implementations (e.g., whether single-shot and non-critic models use identical model sizes and prompting). This information is load-bearing for interpreting whether the delta reflects genuine generalization.
[Ablation studies paragraph] Ablation studies paragraph: the claim that 'main performance gains are due to the critic-based feedback loop and not model size' requires explicit compute-matched or parameter-matched controls; without them, the heterogeneity benefit cannot be isolated from potential differences in total inference steps or agent configurations.
[Method description] Validator/critic description: the framework assumes the critic reliably detects errors and provides useful guidance without introducing new hallucinations or systematic biases, but no dedicated evaluation (e.g., critic accuracy on held-out error cases or failure mode analysis) is reported to support this weakest assumption.

minor comments (2)

[Abstract] Abstract and introduction: the term 'heterogeneous' is used without a precise definition of how agent specialties differ (e.g., distinct model families, fine-tunes, or prompt roles).
[Results] The manuscript would benefit from a table summarizing all compared methods, model sizes, and exact accuracy numbers with standard deviations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. We agree that additional experimental details and controls are needed and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experimental results section] Experimental results section: the central 13% accuracy improvement claim lacks any description of the experimental protocol, including data splits on GSM8K, number of runs, error bars, statistical significance tests, or exact baseline implementations (e.g., whether single-shot and non-critic models use identical model sizes and prompting). This information is load-bearing for interpreting whether the delta reflects genuine generalization.

Authors: We agree that these protocol details were insufficiently described. In the revised manuscript we will add an 'Experimental Setup' subsection that specifies the standard GSM8K train/test split, five independent runs with different seeds, mean accuracy ± standard deviation, paired t-test p-values against baselines, and confirmation that all single-shot and non-critic baselines use identical model sizes and the same base prompting template (adapted only for agent roles). revision: yes
Referee: [Ablation studies paragraph] Ablation studies paragraph: the claim that 'main performance gains are due to the critic-based feedback loop and not model size' requires explicit compute-matched or parameter-matched controls; without them, the heterogeneity benefit cannot be isolated from potential differences in total inference steps or agent configurations.

Authors: The referee correctly identifies that our current ablations do not fully isolate the feedback-loop contribution from possible differences in total inference compute. We will add new compute-matched ablation experiments in which non-critic baselines are configured to use an equivalent total number of LLM calls or token budget, allowing a clearer attribution of gains to the critic mechanism. revision: yes
Referee: [Method description] Validator/critic description: the framework assumes the critic reliably detects errors and provides useful guidance without introducing new hallucinations or systematic biases, but no dedicated evaluation (e.g., critic accuracy on held-out error cases or failure mode analysis) is reported to support this weakest assumption.

Authors: We acknowledge that a direct evaluation of critic reliability was not provided. In the revision we will include a new analysis subsection that reports critic precision and recall on a held-out set of 200 manually annotated error cases, together with qualitative examples of any detected hallucinated critiques or systematic biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical study

full rationale

The work describes an empirical multi-agent framework evaluated on GSM8K with reported accuracy gains and ablations. No mathematical derivation chain, first-principles results, or equations are present that could reduce to inputs by construction. Claims rest on experimental outcomes rather than any self-definitional, fitted-prediction, or self-citation load-bearing steps. This is the expected outcome for a non-derivational empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, no parameter counts, and no explicit modeling assumptions beyond the high-level description of agents and critic.

pith-pipeline@v0.9.1-grok · 5724 in / 1096 out tokens · 23613 ms · 2026-06-28T01:25:53.292874+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages · 1 internal anchor

[1]

& Arık, S

Zhang, Y ., Sun, R., Chen, Y ., Pﬁster, T., Zhang, R. & Arık, S. Chain of agents: Large language models collaborating on long-con text tasks. Advances In Neural Information Processing Systems . 37 pp. 132208- 132237 (2024)

2024
[2]

& Wen, J

Bo, X., Zhang, Z., Dai, Q., Feng, X., Wang, L., Li, R., Chen , X. & Wen, J. Reﬂective multi-agent collaboration based on large language models. Advances In Neural Information Processing Systems . 37 pp. 138595-138631 (2024)

2024
[3]

& Chen, Y

Zhou, Y . & Chen, Y . Adaptive heterogeneous multi-agent d ebate for enhanced educational and factual reasoning in large langua ge models. Journal Of King Saud University Computer And Information Sc iences. 37, 330 (2025)

2025
[4]

& Zhang, M

Tian, C., Wang, Y ., Liu, X., Wang, Z., Ding, L., Zhang, M. & Zhang, M. AgentInit: Initializing LLM-based Multi-Agent Systems vi a Diversity and Expertise Orchestration for Effective and Efﬁcient Col laboration. Findings Of The Association F or Computational Linguistics : EMNLP
[5]

11870-11902 (2025)

pp. 11870-11902 (2025)

2025
[6]

Zhang, H., Cui, Z., Chen, J., Wang, X., Zhang, Q., Wang, Z. , Wu, D. & Hu, S. Position: Stop Overvaluing Multi-Agent Debate-We Mu st Rethink Evaluation and Embrace Model Heterogeneity. (2025)

2025
[7]

& Parvez, M

Islam, M., Ali, M. & Parvez, M. Mapcoder: Multi-agent cod e generation for competitive problem solving. Proceedings Of The 62nd Annual Meeting Of The Association F or Computational Linguistics ( V olume 1: Long Papers). pp. 4912-4944 (2024)

2024
[8]

& Zhang, Z

Mao, K., Hu, B., Lin, R., Li, Z., Lu, G. & Zhang, Z. Blueprin t2Code: a multi-agent pipeline for reliable code generation via blue print planning and repair. Frontiers In Artiﬁcial Intelligence . 8 pp. 1660912 (2025)

2025
[9]

Autosafecoder: A multi-agent framework for securing llm code generation through static analysis and fuzz testing,

Nunez, A., Islam, N., Jha, S. & Najaﬁrad, P . Autosafecode r: A multi- agent framework for securing llm code generation through st atic analysis and fuzz testing. ArXiv Preprint ArXiv:2409.10737 . (2024)

work page arXiv 2024
[10]

& Palade, V

Akinseloyin, O., Jiang, X. & Palade, V . LLM-based Multi- Agent Collab- oration for Abstract Screening towards Automated Systemat ic Reviews. Biology Methods And Protocols . pp. bpag006 (2026)

2026
[11]

& Kotsiantis, S

Kostopoulos, G., Gkamas, V ., Rigou, M. & Kotsiantis, S. Agentic AI in education: State of the art and future directions. IEEE Access . (2025)

2025
[12]

Y ue, Y ., Zhang, G., Liu, B., Wan, G., Wang, K., Cheng, D. & Qi, Y . Masrouter: Learning to route llms for multi-agent systems, 2025. URL Https://arxiv. Org/abs/2502.11133

work page arXiv 2025
[13]

& Chan, H

Zhou, H. & Chan, H. ORCH: many analyses, one merge-a dete rministic multi-agent orchestrator for discrete-choice reasoning w ith EMA-guided routing. ArXiv Preprint ArXiv:2602.01797 . (2026)

work page arXiv 2026
[14]

& Butt, A

Qasim, K., Zhang, J., Alsahﬁ, T. & Butt, A. Recursive dec omposition of logical thoughts: Framework for superior reasoning and k nowledge propagation in large language models. Journal Of Artiﬁcial Intelligence Research. 83 (2025)

2025
[15]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H. , Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C. & Schulman, J. Training V eriﬁers to Solve Math Word Problems. ArXiv Preprint ArXiv:2110.14168. (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Groq: AI Inference at Lightning Speed

Groq, Inc. Groq: AI Inference at Lightning Speed. (2026 ), https://groq.com/, Accessed: 2026-04-26

2026

[1] [1]

& Arık, S

Zhang, Y ., Sun, R., Chen, Y ., Pﬁster, T., Zhang, R. & Arık, S. Chain of agents: Large language models collaborating on long-con text tasks. Advances In Neural Information Processing Systems . 37 pp. 132208- 132237 (2024)

2024

[2] [2]

& Wen, J

Bo, X., Zhang, Z., Dai, Q., Feng, X., Wang, L., Li, R., Chen , X. & Wen, J. Reﬂective multi-agent collaboration based on large language models. Advances In Neural Information Processing Systems . 37 pp. 138595-138631 (2024)

2024

[3] [3]

& Chen, Y

Zhou, Y . & Chen, Y . Adaptive heterogeneous multi-agent d ebate for enhanced educational and factual reasoning in large langua ge models. Journal Of King Saud University Computer And Information Sc iences. 37, 330 (2025)

2025

[4] [4]

& Zhang, M

Tian, C., Wang, Y ., Liu, X., Wang, Z., Ding, L., Zhang, M. & Zhang, M. AgentInit: Initializing LLM-based Multi-Agent Systems vi a Diversity and Expertise Orchestration for Effective and Efﬁcient Col laboration. Findings Of The Association F or Computational Linguistics : EMNLP

[5] [5]

11870-11902 (2025)

pp. 11870-11902 (2025)

2025

[6] [6]

Zhang, H., Cui, Z., Chen, J., Wang, X., Zhang, Q., Wang, Z. , Wu, D. & Hu, S. Position: Stop Overvaluing Multi-Agent Debate-We Mu st Rethink Evaluation and Embrace Model Heterogeneity. (2025)

2025

[7] [7]

& Parvez, M

Islam, M., Ali, M. & Parvez, M. Mapcoder: Multi-agent cod e generation for competitive problem solving. Proceedings Of The 62nd Annual Meeting Of The Association F or Computational Linguistics ( V olume 1: Long Papers). pp. 4912-4944 (2024)

2024

[8] [8]

& Zhang, Z

Mao, K., Hu, B., Lin, R., Li, Z., Lu, G. & Zhang, Z. Blueprin t2Code: a multi-agent pipeline for reliable code generation via blue print planning and repair. Frontiers In Artiﬁcial Intelligence . 8 pp. 1660912 (2025)

2025

[9] [9]

Autosafecoder: A multi-agent framework for securing llm code generation through static analysis and fuzz testing,

Nunez, A., Islam, N., Jha, S. & Najaﬁrad, P . Autosafecode r: A multi- agent framework for securing llm code generation through st atic analysis and fuzz testing. ArXiv Preprint ArXiv:2409.10737 . (2024)

work page arXiv 2024

[10] [10]

& Palade, V

Akinseloyin, O., Jiang, X. & Palade, V . LLM-based Multi- Agent Collab- oration for Abstract Screening towards Automated Systemat ic Reviews. Biology Methods And Protocols . pp. bpag006 (2026)

2026

[11] [11]

& Kotsiantis, S

Kostopoulos, G., Gkamas, V ., Rigou, M. & Kotsiantis, S. Agentic AI in education: State of the art and future directions. IEEE Access . (2025)

2025

[12] [12]

Y ue, Y ., Zhang, G., Liu, B., Wan, G., Wang, K., Cheng, D. & Qi, Y . Masrouter: Learning to route llms for multi-agent systems, 2025. URL Https://arxiv. Org/abs/2502.11133

work page arXiv 2025

[13] [13]

& Chan, H

Zhou, H. & Chan, H. ORCH: many analyses, one merge-a dete rministic multi-agent orchestrator for discrete-choice reasoning w ith EMA-guided routing. ArXiv Preprint ArXiv:2602.01797 . (2026)

work page arXiv 2026

[14] [14]

& Butt, A

Qasim, K., Zhang, J., Alsahﬁ, T. & Butt, A. Recursive dec omposition of logical thoughts: Framework for superior reasoning and k nowledge propagation in large language models. Journal Of Artiﬁcial Intelligence Research. 83 (2025)

2025

[15] [15]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H. , Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C. & Schulman, J. Training V eriﬁers to Solve Math Word Problems. ArXiv Preprint ArXiv:2110.14168. (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Groq: AI Inference at Lightning Speed

Groq, Inc. Groq: AI Inference at Lightning Speed. (2026 ), https://groq.com/, Accessed: 2026-04-26

2026