ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization

Xinglei Yu; Yixu Huang; Zhongyu Wei

arxiv: 2605.16299 · v2 · pith:7JNFRSSZnew · submitted 2026-04-17 · 💻 cs.SE · cs.AI

ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization

Yixu Huang , Xinglei Yu , Zhongyu Wei This is my paper

Pith reviewed 2026-05-22 10:11 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords self-evolving code generationadversarial unit testsexecution-based supervisionpreference optimizationsolver-adversary frameworkLLM coding

0 comments

The pith

A single LLM improves its own code generation by creating adversarial tests that expose failures, using only execution outcomes for supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ACE, a framework in which one LLM alternates between producing candidate programs and generating adversarial unit test inputs designed to trigger runtime errors, exceptions, or non-termination. Execution results alone supply the signal for selecting robust programs for fine-tuning and for optimizing the test generator via preference optimization, eliminating any need for ground-truth code or external reward models. This directly counters the degradation seen in solver-verifier systems once models reach moderate strength, because the adversary keeps hunting for remaining weaknesses instead of confirming correctness. Experiments show consistent 3-7% gains in pass@1 across CodeContests, MBPP, and LiveCodeBench, with larger lifts on out-of-distribution sets and no loss in inference speed.

Core claim

ACE shows that a solver-adversary loop inside a single LLM can drive self-evolution in code generation: the model produces both candidate solutions and adversarial tests that are optimized, through execution-derived preferences and Kahneman-Tversky Optimization, to maximize the discovery of failure modes, allowing robust programs to be reinforced without any external correctness labels.

What carries the argument

Solver-adversary architecture in which the same LLM generates candidate code and then produces adversarial unit tests optimized to induce execution failures, with supervision drawn solely from pass/fail outcomes.

If this is right

Code-generation training loops can run indefinitely using only program execution as the source of preference signals.
Out-of-distribution coding performance improves more than in-distribution performance, indicating the method strengthens generalization.
Inference cost remains competitive or lower while accuracy rises, because the same model handles both generation and test creation.
The approach removes dependence on large annotated solution datasets or separate verifier models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same execution-only preference loop could be tested on other domains that have cheap simulators, such as simple game agents or symbolic math.
If the adversary component continues to scale with model size, the framework might reduce the need for human-written test suites in future training pipelines.
A natural next measurement would be to track how the diversity of discovered failure modes changes across training iterations.

Load-bearing premise

Adversarial unit tests produced by the LLM itself will keep exposing genuine remaining failures rather than mostly confirming that the current code is already correct.

What would settle it

Measure whether the generated adversarial tests still produce high failure rates on successively stronger versions of the model, or whether their failure rate drops to near zero while pass@1 on held-out benchmarks stays flat.

Figures

Figures reproduced from arXiv: 2605.16299 by Xinglei Yu, Yixu Huang, Zhongyu Wei.

**Figure 2.** Figure 2: We first describe the adversarial unit test generation process, followed by the construction [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 2.** Figure 2: Overview of ACE Method. A single LLM alternates between the Solver role, generating [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of verifiable and adversarial unit [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy over ACE training rounds on the training set and the CodeContests benchmark. Both Qwen3-4B-ACE and Qwen2.5-7B-ACE exhibit consistent performance improvements as training progresses and converge after several rounds. In-distribution Performance. On CodeContests, ACE achieves the best performance across all pass@k metrics and both model scales. With the 4B backbone, pass@1 increases from 41.7 for … view at source ↗

read the original abstract

Large Language Models (LLMs) excel at code generation but remain heavily reliant on large-scale annotated solutions and verification-based supervision, which constrains scalability and hinders sustained self-improvement. Recent solver--verifier frameworks exploit program execution as an automatic supervision signal, but their effectiveness degrades as solvers become moderately strong: verifier-generated tests increasingly confirm semantic correctness rather than exposing the remaining failure modes. We propose \textbf{ACE}, a self-evolving code generation framework based on a solver--adversary architecture that prioritizes active failure discovery through execution-centric supervision. A single LLM alternates between generating candidate programs and producing adversarial unit test inputs optimized to induce execution-level failures, such as runtime errors, exceptions, or non-termination. Supervision is derived solely from execution outcomes: robust programs are selected for supervised fine-tuning, while adversarial tests are optimized via Kahneman--Tversky Optimization using execution-derived preferences. Notably, the entire training loop requires no ground-truth code or external reward models. Experiments on CodeContests, MBPP, and LiveCodeBench demonstrate that ACE consistently outperforms strong solver--verifier baselines, achieving 3--7\% absolute gains in pass@1, with larger improvements on out-of-distribution benchmarks, while maintaining competitive or improved inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACE's solver-adversary loop has one LLM generate both code and tests meant to trigger execution failures, then trains on the outcomes via SFT and KTO without ground truth, but the same-model setup risks the adversary running out of new failures to expose.

read the letter

The main thing here is a single-LLM solver-adversary architecture. The model alternates between writing candidate programs and creating unit tests optimized to cause runtime errors, exceptions, or non-termination. Execution results alone decide which programs get supervised fine-tuning and supply the preference pairs for Kahneman-Tversky Optimization. No annotated solutions or external reward models are used at any point.

Referee Report

2 major / 2 minor

Summary. The paper proposes ACE, a self-evolving LLM coding framework using a single model that alternates between generating candidate code solutions and adversarial unit test inputs designed to trigger runtime errors, exceptions, or non-termination. Execution outcomes alone provide supervision: successful programs undergo supervised fine-tuning while preference optimization via Kahneman-Tversky Optimization (KTO) is applied to execution-derived signals. No ground-truth code or external reward models are used. Experiments on CodeContests, MBPP, and LiveCodeBench report 3-7% absolute pass@1 gains over solver-verifier baselines, with larger improvements on out-of-distribution sets and maintained inference efficiency.

Significance. If the central claims hold after verification, ACE would offer a scalable path to self-improvement in code generation by turning execution feedback into an active adversarial loop, addressing the degradation of verifier signals as solvers strengthen. This could reduce dependence on annotated data and external verifiers, with potential implications for autonomous coding agents.

major comments (2)

[Abstract, §4] Abstract and §4 (Experiments): The central performance claim of consistent 3-7% pass@1 gains (with larger OOD improvements) is presented without reported statistical significance tests, exact hyperparameter choices for the KTO loop, or controls for post-hoc selection of adversarial tests; this information is load-bearing for assessing whether the gains exceed noise or baseline variance.
[§3] §3 (Method): The assumption that the same LLM acting as adversary will continue to generate tests exposing new failure modes (rather than confirming correctness) as the solver improves via iterative SFT+KTO is not supported by any analysis or ablation; if this assumption fails, execution outcomes supply only weak or self-confirming preference signals, directly undermining the no-ground-truth self-evolution premise.

minor comments (2)

[§3] Notation for the adversary generation objective and the precise KTO preference construction from execution outcomes (pass/fail, timeout, exception) should be formalized with equations for reproducibility.
[§4] Figure or table reporting per-benchmark pass@1 with and without the adversarial component would clarify the contribution of the solver-adversary loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and outline specific revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): The central performance claim of consistent 3-7% pass@1 gains (with larger OOD improvements) is presented without reported statistical significance tests, exact hyperparameter choices for the KTO loop, or controls for post-hoc selection of adversarial tests; this information is load-bearing for assessing whether the gains exceed noise or baseline variance.

Authors: We agree that statistical significance testing, precise hyperparameter reporting, and explicit controls are essential for validating the reported gains. In the revised manuscript we will add bootstrap confidence intervals and paired t-tests (across 5 random seeds) for all pass@1 improvements in §4 and the appendix. We will also list the complete KTO hyperparameter set (learning rate, β, number of preference pairs per iteration, and early-stopping criterion) used in all experiments. Finally, we will clarify the fixed, non-post-hoc procedure for adversarial test selection (top-k by execution failure rate, with a deterministic tie-breaker) to rule out selection bias. These changes will be incorporated in the next version. revision: yes
Referee: [§3] §3 (Method): The assumption that the same LLM acting as adversary will continue to generate tests exposing new failure modes (rather than confirming correctness) as the solver improves via iterative SFT+KTO is not supported by any analysis or ablation; if this assumption fails, execution outcomes supply only weak or self-confirming preference signals, directly undermining the no-ground-truth self-evolution premise.

Authors: We recognize that the continued effectiveness of the adversary is a central assumption and that the current manuscript lacks direct supporting analysis. We will add a new ablation subsection (or appendix) that tracks adversary strength over iterations by measuring (i) the fraction of generated tests that induce runtime errors or non-termination on the current solver and (ii) the diversity of failure modes (via error-type histograms). This will demonstrate that the adversary does not collapse to trivial or confirming tests. The analysis will be performed on the same training trajectories reported in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: execution provides independent external labels

full rationale

The described ACE loop derives supervision exclusively from program execution outcomes (runtime errors, exceptions, non-termination) rather than from model-generated signals or fitted parameters. Candidate programs are selected for SFT and preferences for KTO are labeled by whether execution succeeds or fails on the generated tests; these labels are produced by an external interpreter and do not reduce to the LLM's own outputs by construction. No equations, uniqueness theorems, or self-citations are invoked to force the result, and the empirical gains are measured against external benchmarks (CodeContests, MBPP, LiveCodeBench). The framework therefore remains self-contained against independent oracles.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only: the framework assumes execution outcomes (errors, exceptions, non-termination) provide sufficient and unbiased supervision signals for selecting robust programs and optimizing adversarial tests without external rewards or ground-truth code.

axioms (1)

domain assumption Execution outcomes alone suffice as supervision for both program selection and adversarial test optimization in LLM fine-tuning.
Stated in abstract as the basis for the self-evolving loop without ground-truth code.

pith-pipeline@v0.9.0 · 5757 in / 1256 out tokens · 32177 ms · 2026-05-22T10:11:09.391423+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A single LLM alternates between generating candidate programs and producing adversarial unit test inputs optimized to induce execution-level failures... Supervision is derived solely from execution outcomes
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

adversarial tests are optimized via Kahneman–Tversky Optimization using execution-derived preferences

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 8 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language 8 Published as a Workshop Paper at RSI, ICLR 2026 models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Uagent: Adversarial co-evolution for targeted bug revelation in unit testing

Yulong Fu and Yingtong Zhang. Uagent: Adversarial co-evolution for targeted bug revelation in unit testing. InProceedings of the 2025 Workshop on Recent Advances in Resilient and Trustworthy MAchine learning-driveN systems, pp. 51–56,

work page 2025
[5]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,

Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,

work page arXiv
[6]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Reveal: Self-evolving code agents via iterative generation-verification.arXiv preprint arXiv:2506.11442,

Yiyang Jin, Kunzhao Xu, Hang Li, Xueting Han, Yanmin Zhou, Cheng Li, and Jing Bai. Reveal: Self-evolving code agents via iterative generation-verification.arXiv preprint arXiv:2506.11442,

work page arXiv
[8]

Atgen: Adversarial reinforcement learning for test case generation.arXiv preprint arXiv:2510.14635,

Qingyao Li, Xinyi Dai, Weiwen Liu, Xiangyang Li, Yasheng Wang, Ruiming Tang, Yong Yu, and Weinan Zhang. Atgen: Adversarial reinforcement learning for test case generation.arXiv preprint arXiv:2510.14635,

work page arXiv
[9]

Learning to solve and verify: A self-play framework for code and test generation.arXiv preprint arXiv:2502.14948,

Zi Lin, Sheng Shen, Jingbo Shang, Jason Weston, and Yixin Nie. Learning to solve and verify: A self-play framework for code and test generation.arXiv preprint arXiv:2502.14948,

work page arXiv
[10]

Dstc: Direct preference learning with only self-generated tests and code to improve code lms.arXiv preprint arXiv:2411.13611,

Zhihan Liu, Shenao Zhang, Yongfei Liu, Boyi Liu, Yingxiang Yang, and Zhaoran Wang. Dstc: Direct preference learning with only self-generated tests and code to improve code lms.arXiv preprint arXiv:2411.13611,

work page arXiv
[11]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2404.14387 , year=

9 Published as a Workshop Paper at RSI, ICLR 2026 Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387,

work page arXiv 2026
[13]

Co-evolving llm coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136,

Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. Co-evolving llm coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136,

work page arXiv
[14]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Self-rewarding correction for mathematical reasoning.arXiv preprint arXiv:2502.19613,

Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang. Self-rewarding correction for mathematical reasoning.arXiv preprint arXiv:2502.19613,

work page arXiv
[16]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Unless otherwise specified, all hyperparameters are fixed across experiments

A APPENDIX A.1 TRAININGDETAIL We adopt parameter-efficient fine-tuning for both solver and adversary optimization using LoRA. Unless otherwise specified, all hyperparameters are fixed across experiments. LoRA Setup.We apply LoRA to the backbone model with rankr= 64for all fine-tuning stages. We use separate LoRA adapters for the solver and adversary on to...

work page 2026

[1] [1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language 8 Published as a Workshop Paper at RSI, ICLR 2026 models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Uagent: Adversarial co-evolution for targeted bug revelation in unit testing

Yulong Fu and Yingtong Zhang. Uagent: Adversarial co-evolution for targeted bug revelation in unit testing. InProceedings of the 2025 Workshop on Recent Advances in Resilient and Trustworthy MAchine learning-driveN systems, pp. 51–56,

work page 2025

[5] [5]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,

Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,

work page arXiv

[6] [6]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Reveal: Self-evolving code agents via iterative generation-verification.arXiv preprint arXiv:2506.11442,

Yiyang Jin, Kunzhao Xu, Hang Li, Xueting Han, Yanmin Zhou, Cheng Li, and Jing Bai. Reveal: Self-evolving code agents via iterative generation-verification.arXiv preprint arXiv:2506.11442,

work page arXiv

[8] [8]

Atgen: Adversarial reinforcement learning for test case generation.arXiv preprint arXiv:2510.14635,

Qingyao Li, Xinyi Dai, Weiwen Liu, Xiangyang Li, Yasheng Wang, Ruiming Tang, Yong Yu, and Weinan Zhang. Atgen: Adversarial reinforcement learning for test case generation.arXiv preprint arXiv:2510.14635,

work page arXiv

[9] [9]

Learning to solve and verify: A self-play framework for code and test generation.arXiv preprint arXiv:2502.14948,

Zi Lin, Sheng Shen, Jingbo Shang, Jason Weston, and Yixin Nie. Learning to solve and verify: A self-play framework for code and test generation.arXiv preprint arXiv:2502.14948,

work page arXiv

[10] [10]

Dstc: Direct preference learning with only self-generated tests and code to improve code lms.arXiv preprint arXiv:2411.13611,

Zhihan Liu, Shenao Zhang, Yongfei Liu, Boyi Liu, Yingxiang Yang, and Zhaoran Wang. Dstc: Direct preference learning with only self-generated tests and code to improve code lms.arXiv preprint arXiv:2411.13611,

work page arXiv

[11] [11]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2404.14387 , year=

9 Published as a Workshop Paper at RSI, ICLR 2026 Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387,

work page arXiv 2026

[13] [13]

Co-evolving llm coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136,

Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. Co-evolving llm coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136,

work page arXiv

[14] [14]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Self-rewarding correction for mathematical reasoning.arXiv preprint arXiv:2502.19613,

Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang. Self-rewarding correction for mathematical reasoning.arXiv preprint arXiv:2502.19613,

work page arXiv

[16] [16]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Unless otherwise specified, all hyperparameters are fixed across experiments

A APPENDIX A.1 TRAININGDETAIL We adopt parameter-efficient fine-tuning for both solver and adversary optimization using LoRA. Unless otherwise specified, all hyperparameters are fixed across experiments. LoRA Setup.We apply LoRA to the backbone model with rankr= 64for all fine-tuning stages. We use separate LoRA adapters for the solver and adversary on to...

work page 2026