arxiv: 2604.11188 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI

Recognition: unknown

MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis

Zixiong Yu , Jun Rao , Guhan Chen , Songtao Tian , Bohan Li , Jiansheng Wei , Min Zhang , Xiaojun Meng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mathematical reasoningdata synthesisconstraint graphsadversarial evolutionfine-tuningout-of-distribution generalizationbenchmark evaluation

0 comments

The pith

Adversarial evolution of constraint graphs synthesizes math reasoning data that lets 1K fine-tuning samples beat standard datasets on eight benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats mathematical reasoning data synthesis as an unsupervised optimization problem over constraint graphs rather than direct text generation or simple seed mutation. One component, the Legislator, evolves structured logical blueprints adversarially while the Executor converts those blueprints into varied natural-language problems. This separation lets the process prioritize complex logical structures over linguistic variety. Experiments show that fine-tuning models on only one thousand samples produced this way yields stronger results than established datasets of similar size, with better generalization to new problems.

Core claim

Formulating data synthesis as adversarial optimization over constraint graphs in a Legislator-Executor setup produces training examples with higher logical complexity and diversity than prior mutation or prompting methods, so that models fine-tuned on one thousand such examples outperform models trained on LIMO or s1K across eight mathematical reasoning benchmarks while showing improved out-of-distribution performance.

What carries the argument

The Legislator-Executor paradigm in which the Legislator adversarially evolves constraint graphs as generation blueprints encoding problem constraints, and the Executor instantiates those graphs into natural language scenarios.

If this is right

Smaller synthesized datasets can replace or exceed larger human-curated or mutated ones for mathematical fine-tuning.
Data synthesis can be reframed as optimization over logical constraint structures instead of direct text generation.
The resulting models exhibit stronger generalization on unseen mathematical problems compared with baselines.
The approach scales across multiple model families including Qwen, Llama, Mistral, and Gemma.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-evolution loop could be tested on code or scientific reasoning tasks to check whether logical structure quality transfers.
If constraint graphs capture the essential reasoning skeleton, they may reduce the volume of human-annotated data needed for training capable reasoners.
Running the synthesis at even smaller scales, such as a few hundred samples, would test the lower bound on data volume required for strong benchmark gains.

Load-bearing premise

Adversarially evolving constraint graphs without human priors will reliably produce more complex and diverse logical structures than seed mutation or prompt engineering.

What would settle it

Fine-tuning the same models on one thousand samples from this method and finding no improvement over LIMO or s1K scores on the eight benchmarks, especially on out-of-distribution tests.

Figures

Figures reproduced from arXiv: 2604.11188 by Bohan Li, Guhan Chen, Jiansheng Wei, Jun Rao, Min Zhang, Songtao Tian, Xiaojun Meng, Zixiong Yu.

**Figure 1.** Figure 1: The MathAgent Framework. The framework consists of two decoupled phases: (1) Meta-Level Structural Evolution, where a tri-agent Legislator system (Proposer, Critic, and Moderator) iteratively optimizes a Constraint Graph G based on Style Tokens S; and (2) Base-Level Semantic Instantiation, where the Executor grounds the optimized structural blueprint into natural language problems Q and reasoning chains A.… view at source ↗

**Figure 2.** Figure 2: Quality and Difficulty Distributions. Quality and difficulty increase from left to right. Our method shows a significant advantage in generating high-quality, high-difficulty mathematical problems. thetic data being evaluated in its raw state, whereas OpenR1-Math has undergone post-filtering. Data Difficulty [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 4.** Figure 4: t-SNE Visualization of Knowledge Points. The extensive coverage of the dark blue points (representing our method) demonstrates the significant diversity of the generated mathematical problems. 5.3 Data Scaling In this section, we investigate the scaling laws governing our approach by analyzing the correlation 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Performance Scaling Analysis. The x-axis is plotted on a logarithmic scale for clarity. While performance generally improves with increased data scale, our method maintains a consistent and significant performance advantage over the baselines. between model performance and dataset size. We employ Qwen2.5-Math-7B as the base model and adjust the training schedule to 3 epochs to facilitate efficient experi… view at source ↗

**Figure 6.** Figure 6: ). Simple Prompt Question: {input} Answer: Let's think step by step. Complex Prompt <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user {input} Please reason step by step, and put your final answer within \\boxed{}.<|im_end|> <|im_start|>assistant {output} [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Average Maximum Similarity Between Training and Test Datasets: A lower similarity score indicates a reduced likelihood of data leakage. This figure demonstrates that our synthetic data does not carry a higher risk of data leakage compared to LIMO and s1K. training epochs, vary across models; these specific configurations are detailed in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Case Study. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for the Proposer. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for the Critic. Prompt for Legislator - Moderator (AM) You are the Moderator (AM). You adjudicate the state of graph Gt+1 based on the Critic’s report. ### Data for Decision: - Critic’s Report: {CRITIC_REPORT} - Style Tokens (S): {STYLE_TOKENS_INPUT} - Proposed Graph (Gt+1): {PROPOSED_GRAPH_DATA} ### Decision Logic: - Adaptive Truncation: If Gt+1 satisfies S and the potential for further gain is ma… view at source ↗

**Figure 11.** Figure 11: Prompt for the Moderator 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for the Question Synthesizer. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Prompts for Evaluating Mathematical Problem Quality and Difficulty. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

read the original abstract

Synthesizing high-quality mathematical reasoning data without human priors remains a significant challenge. Current approaches typically rely on seed data mutation or simple prompt engineering, often suffering from mode collapse and limited logical complexity. This paper proposes a hierarchical synthesis framework that formulates data synthesis as an unsupervised optimization problem over a constraint graph followed by semantic instantiation, rather than treating it as a direct text generation task. We introduce a Legislator-Executor paradigm: The Legislator adversarially evolves structured generation blueprints encoding the constraints of the problem, while the Executor instantiates these specifications into diverse natural language scenarios. This decoupling of skeleton design from linguistic realization enables a prioritized focus on constructing complex and diverse logical structures, thereby guiding high-quality data synthesis. Experiments conducted on a total of 10 models across the Qwen, Llama, Mistral, and Gemma series demonstrate that our method achieves notable results: models fine-tuned on 1K synthesized samples outperform widely-used datasets of comparable scale (LIMO, s1K) across eight mathematical benchmarks, exhibiting superior out-of-distribution generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Legislator-Executor setup for evolving constraint graphs offers a structured way to synthesize math reasoning data, but the reported gains lack ablations or metrics showing the adversarial part actually drives higher complexity.

read the letter

The paper's main idea is a two-part system where one model evolves constraint graphs to define logical structure and another turns those graphs into worded problems. They report that fine-tuning on 1K of the resulting examples beats LIMO and s1K across eight benchmarks with stronger out-of-distribution results. That decoupling of skeleton from language is the clearest novelty compared to seed mutation or basic prompting methods mentioned in the abstract.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes MathAgent, a hierarchical framework for synthesizing mathematical reasoning data without human priors. It formulates synthesis as unsupervised optimization over constraint graphs using a Legislator-Executor paradigm: the Legislator adversarially evolves structured blueprints encoding logical constraints, while the Executor instantiates them into natural language scenarios. The central empirical claim is that fine-tuning 10 models (Qwen, Llama, Mistral, Gemma series) on 1K synthesized samples outperforms widely-used datasets of similar scale (LIMO, s1K) across eight mathematical benchmarks and exhibits superior out-of-distribution generalization.

Significance. If the results hold after proper validation, the work could be significant for automated data synthesis in mathematical reasoning. By decoupling constraint-graph evolution from linguistic realization, it targets mode collapse and limited logical complexity in prior methods, potentially enabling more efficient, scalable generation of high-quality training data that improves LLM generalization on math tasks.

major comments (3)

[Abstract] Abstract: the headline result (1K samples outperforming LIMO/s1K on eight benchmarks with better OOD generalization) is presented without methodological details, baseline comparisons, statistical tests, or error analysis, so the central claim cannot be evaluated from the given text.
[Method] Method (Legislator-Executor paradigm): the claim that adversarial evolution of constraint graphs reliably yields higher structural complexity and diversity than seed mutation or prompt engineering lacks any quantitative metrics on graph properties (e.g., average constraint depth, number of interdependent variables, logical step count).
[Experiments] Experiments: no ablations isolate the adversarial Legislator component from the overall framework, Executor instantiation, or synthesis-model strength; without them, performance gains cannot be attributed to the claimed mechanism rather than incidental factors such as topic coverage.

minor comments (1)

[Abstract] Abstract: the phrase 'notable results' is vague; specific deltas, benchmark names, and significance levels should be stated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline result (1K samples outperforming LIMO/s1K on eight benchmarks with better OOD generalization) is presented without methodological details, baseline comparisons, statistical tests, or error analysis, so the central claim cannot be evaluated from the given text.

Authors: We agree the abstract is concise and omits supporting details due to typical length limits. Methodological elements of the Legislator-Executor paradigm appear in Section 3, while baseline comparisons, statistical tests (paired t-tests with p-values), and error analysis (standard deviations across runs) are in Section 5. We will revise the abstract to briefly describe the framework and direct readers to the experiments for full evaluation details. revision: yes
Referee: [Method] Method (Legislator-Executor paradigm): the claim that adversarial evolution of constraint graphs reliably yields higher structural complexity and diversity than seed mutation or prompt engineering lacks any quantitative metrics on graph properties (e.g., average constraint depth, number of interdependent variables, logical step count).

Authors: The manuscript supports the claim via downstream gains and qualitative examples, but direct quantitative graph metrics are indeed absent. We will add these in revision, reporting average constraint depth, number of interdependent variables, and logical step counts for our method versus seed mutation and prompt engineering baselines in a new analysis subsection. revision: yes
Referee: [Experiments] Experiments: no ablations isolate the adversarial Legislator component from the overall framework, Executor instantiation, or synthesis-model strength; without them, performance gains cannot be attributed to the claimed mechanism rather than incidental factors such as topic coverage.

Authors: We recognize that component ablations are needed to attribute gains specifically to the adversarial Legislator. Current results compare full methods but lack targeted ablations. We will add experiments in revision that disable adversarial evolution (e.g., non-adversarial or fixed graphs) while holding the Executor and topics constant, to isolate its role and control for coverage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on external benchmark comparisons

full rationale

The paper frames data synthesis as an unsupervised optimization over constraint graphs using a Legislator-Executor paradigm, then validates via fine-tuning 1K samples and measuring performance on eight external mathematical benchmarks against LIMO and s1K. No equations or steps reduce the claimed superiority to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. The adversarial evolution is presented as a methodological choice whose efficacy is tested empirically rather than assumed by construction. The derivation chain remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that constraint graphs can encode sufficient logical complexity and that adversarial evolution improves diversity without introducing artifacts.

axioms (1)

domain assumption Adversarial evolution of constraint graphs produces logically complex and diverse structures superior to human priors or simple mutations.
This is the core premise enabling the claim of better data quality but is not justified or evidenced in the abstract.

pith-pipeline@v0.9.0 · 5503 in / 1207 out tokens · 40858 ms · 2026-05-10T15:18:54.358896+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

SCM-GRPO grounds multi-hop fact verification in structural causal models and applies GRPO reinforcement learning to optimize reasoning chain length, outperforming baselines on HoVer and EX-FEVER.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 4.0

The SCM-GRPO framework models multi-hop fact verification as causal inference and applies reinforcement learning to optimize reasoning depth, reporting outperformance on HoVer and EX-FEVER.

Reference graph

Works this paper leans on

6 extracted references · 4 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

InThe Twelfth International Conference on Learning Representations

Alpagasus: Training a better alpaca with fewer data. InThe Twelfth International Conference on Learning Representations. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas...
[2]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Sirui Chen, Changxin Tian, Binbin Hu, Kunlong Chen, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. 2025. Ar- rows of math reasoning data synthesis for large lan- guage models: Diversity, complexity and correctness. InProceedings of the 34th ACM International Con- ference on Information ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhi- hong Shao, Zhuoshu Li, Ziyi Gao, and 181 others. 2025a. Deepseek-r1: Incentivizing rea...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Mistral 7B

Neural tangent kernel: Convergence and gen- eralization in neural networks. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Gemma 2: Improving Open Language Models at a Practical Size

IEEE. Jun Rao, Yunjie Liao, Xuebo Liu, Zepeng Lin, Lian Lian, Dong Jin, Shengjun Cheng, Jun Yu, and Min Zhang. 2025a. Seapo: Strategic error amplification for robust preference optimization of large language models. InFindings of the Association for Compu- tational Linguistics: EMNLP 2025. Association for Computational Linguistics. Jun Rao, Zepeng Lin, Xu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Concept: Descrip- tion

datasets as high-quality baselines. These datasets are constructed through expert-designed pipelines and rigorous screening to ensure reason- ing depth, representing the state-of-the-art in small- scale, curated reasoning data. For s1K, which of- fers two reasoning model versions based onGemini (Google, 2024) and DeepSeek-R1 (DeepSeek-AI et al., 2025a) re...

2024