arxiv: 2604.16322 · v1 · submitted 2026-02-27 · 💻 cs.SE · cs.AI· cs.PL

Recognition: 1 theorem link

· Lean Theorem

Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution

Tinglin Huang , Bo Chen , Xiao Zhang , Kai Shen , Rex Ying

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PL

keywords instruction followingcoding data synthesisschema co-evolutionMCTS samplerLLM performanceparametric schemasIFCodeBench

0 comments

The pith

Parametric schemas co-evolve with an actor model to synthesize large-scale instruction-following coding data that lifts open models to match proprietary SOTA performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework called IFCodeEvolve that represents instructions as parametric function schemas to build a library covering diverse coding tasks through dynamic constraint instantiation. Monte Carlo Tree Search then navigates this space using feedback from the actor model as a termination signal, while a co-evolution loop iteratively mutates and composes schemas based on sampler statistics to generate progressively harder data. This produces training data that improves LLM instruction following for code, with a 32B model reaching parity with closed-source leaders, and includes the IFCodeBench benchmark for verification. A reader would care because manually ensuring logical compatibility across multiple constraints in coding instructions is intractable at scale, limiting automated programming systems.

Core claim

IFCodeEvolve constructs a schema library that covers the instruction space via parametric function schemas and dynamic constraint instantiation. An MCTS sampler navigates this space with actor model feedback serving as the dynamic termination signal. A co-evolving paradigm then advances both the actor and the schema library through composition and mutation driven by sampler statistics, yielding data that significantly boosts base model performance on instruction-following coding tasks to the point that a 32B model achieves parity with proprietary state-of-the-art systems.

What carries the argument

The actor-parametric schema co-evolution framework, where instructions are encoded as parametric function schemas allowing dynamic constraint instantiation, navigated by MCTS with actor feedback and iteratively mutated for harder problems.

Load-bearing premise

That representing instructions as parametric function schemas with dynamic constraint instantiation produces logically compatible combinations of multiple constraints without introducing inconsistencies or biases.

What would settle it

Train an otherwise identical base model on data generated by this method versus standard coding data and measure whether instruction-following accuracy on multi-constraint problems shows no gain or exhibits logical errors in generated code.

read the original abstract

Interpreting and following human instructions is a critical capability of large language models (LLMs) in automatic programming. However, synthesizing large-scale instruction-paired coding data remains largely unexplored and is particularly challenging when ensuring logical compatibility among multiple constraints. In this study, we propose IFCodeEvolve, an actor-schema co-evolution framework for instruction following coding data generation. By representing instructions as parametric function schema, we construct a library that covers the vast instruction space via dynamic constraint instantiation. Building upon this, Monte Carlo Tree Search (MCTS) sampler is applied to efficiently navigate this space, utilizing actor model feedback as a dynamic termination signal. Furthermore, to progressively explore challenging problems, we introduce a co-evolving paradigm that iteratively advances both the actor model and the schema library, via schema composition and mutation, based on sampler statistics. Empirical results demonstrate that IFCodeEvolve significantly boosts base model performance, with our 32B model achieving parity with proprietary SOTA models. Additionally, we contribute IFCodeBench, a comprehensive human-verified benchmark equipped with solutions and robust AST-based verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IFCodeEvolve brings a concrete new loop for schema-based coding data synthesis with MCTS and co-evolution, but the parity claim with SOTA rests on unshown validation of data consistency.

read the letter

The paper's main move is IFCodeEvolve: instructions are encoded as parametric function schemas that get instantiated on the fly, a library grows through composition and mutation, MCTS explores the space using the actor model's own feedback as a termination signal, and both the actor and the schema set co-evolve based on sampler statistics. That combination is not just another data-augmentation trick; it directly targets the problem of keeping multiple constraints logically compatible while scaling up instruction-following coding pairs. The release of IFCodeBench with human verification plus AST-based checks is also a usable addition that future work can reference without reinventing the wheel. The approach shows clear engineering thought about how to steer generation toward harder problems without manual curation. The soft spot is the empirical section. The abstract states that a 32B model trained on this data reaches parity with proprietary SOTA models, yet it supplies no numbers on baselines, sample counts, variance, or any explicit check that instantiated constraints remain satisfiable together. The stress-test note is on target here: actor feedback only measures downstream utility, so nothing in the described loop rules out a non-trivial fraction of pairs carrying hidden conflicts or biases. If those exist, the reported lift could be explained by data volume rather than the steerable mechanism. Readers working on synthetic data pipelines for code models will find the schema representation and co-evolution loop worth examining. The paper deserves a serious referee because the framework is distinct enough and the benchmark is concrete, even though the results will need tighter scrutiny on data quality.

Referee Report

2 major / 1 minor

Summary. The paper proposes IFCodeEvolve, an actor-schema co-evolution framework for synthesizing large-scale instruction-following coding data. Instructions are represented as parametric function schemas with dynamic constraint instantiation to cover the instruction space; MCTS sampling uses actor-model feedback as a termination signal, and a co-evolution loop iteratively refines both the actor and the schema library via composition and mutation based on sampler statistics. The central empirical claim is that fine-tuning on the resulting data significantly boosts base-model performance, with the authors' 32B model reaching parity with proprietary SOTA models; the paper also contributes the human-verified IFCodeBench benchmark with AST-based verification.

Significance. If the performance claims are substantiated, the work would be significant for the field of LLM-based code generation: it directly tackles the open problem of scalable, logically consistent instruction-paired data synthesis and demonstrates that open 32B models can match closed SOTA systems on instruction following. The introduction of a verified benchmark with robust AST checking would also provide a reusable resource for future research.

major comments (2)

[Abstract / Experimental Results] Abstract and Experimental Results section: the claim that the 32B model achieves parity with proprietary SOTA models is presented without any description of the experimental setup, baselines, evaluation metrics, number of runs, or error bars. This information is load-bearing for the central performance claim and must be supplied before the result can be assessed.
[Method] Method section (schema co-evolution and dynamic constraint instantiation): the construction does not include an explicit check or proof that simultaneously instantiated constraints remain satisfiable. The termination signal comes only from downstream actor utility; without an upstream consistency verifier, it is possible that a non-negligible fraction of the synthetic pairs contain incompatible constraints, which would undermine the attribution of gains to the steerable synthesis method.

minor comments (1)

[Abstract] The abstract refers to 'IFCodeBench' but provides no citation or pointer to its release location or exact composition; this should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your valuable feedback on our manuscript. We address the major comments point by point below, agreeing that additional details and checks are warranted, and will update the paper accordingly.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the claim that the 32B model achieves parity with proprietary SOTA models is presented without any description of the experimental setup, baselines, evaluation metrics, number of runs, or error bars. This information is load-bearing for the central performance claim and must be supplied before the result can be assessed.

Authors: We fully agree with this observation. The current manuscript provides insufficient detail on the experimental protocol supporting the performance claims. In the revised version, we will include a dedicated subsection in the Experimental Results that describes the full setup: the models compared (including specific proprietary SOTA systems), the evaluation metrics used on IFCodeBench, the number of independent runs performed, and statistical measures such as standard deviations or error bars. This will allow proper assessment of the claim that the 32B model reaches parity with closed-source SOTA. revision: yes
Referee: [Method] Method section (schema co-evolution and dynamic constraint instantiation): the construction does not include an explicit check or proof that simultaneously instantiated constraints remain satisfiable. The termination signal comes only from downstream actor utility; without an upstream consistency verifier, it is possible that a non-negligible fraction of the synthetic pairs contain incompatible constraints, which would undermine the attribution of gains to the steerable synthesis method.

Authors: This is a valid concern. Although the actor feedback in MCTS serves as a practical filter by assigning low utility to unsatisfiable or inconsistent cases, an explicit upstream verification would enhance reliability. We will revise the Method section to incorporate a consistency verification step during dynamic constraint instantiation. This verifier will check for logical compatibility of simultaneously instantiated constraints using a simple satisfiability solver or rule-based checks before proceeding with sampling. We believe this addition will strengthen the attribution of performance gains to the proposed synthesis method. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical synthesis and evaluation pipeline

full rationale

The paper describes an empirical data-generation pipeline (parametric schemas + MCTS + actor feedback + co-evolution) whose performance claims rest on downstream fine-tuning results and a separately contributed human-verified benchmark (IFCodeBench) with AST verification. No equations, predictions, or uniqueness claims are shown to reduce by construction to fitted inputs, self-citations, or ansatzes imported from prior author work. The central result is therefore an observed empirical outcome rather than a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities with independent evidence are detailed. The 'parametric function schema' and 'actor model' appear as core constructs but lack specification of fitting or assumptions.

pith-pipeline@v0.9.0 · 5498 in / 1087 out tokens · 44719 ms · 2026-05-15T17:56:30.893420+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

representing instructions as parametric function schema... MCTS sampler... actor-schema co-evolution... proof-by-construction instantiation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 14 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

work page 2024
[4]

Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–...

work page 2025
[5]

Self-play with execution feedback: Improving instruction-following capabilities of large language models.arXiv preprint arXiv:2406.13542, 2024

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models.arXiv preprint arXiv:2406.13542, 2024

work page arXiv 2024
[6]

Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861, 2023

work page arXiv 2023
[7]

Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, et al. Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

work page arXiv 2025
[8]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Followbench: A multi-level fine-grained constraints following benchmark for large language models

Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. Followbench: A multi-level fine-grained constraints following benchmark for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4667–4688, 2024

work page 2024
[14]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Highly accurate protein structure prediction with alphafold

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021

work page 2021
[16]

Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022. 11

work page 2022
[17]

Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents. arXiv preprint arXiv:2508.02085, 2025

work page arXiv 2025
[18]

A comprehensive survey on instruction following.arXiv preprint arXiv:2303.10475, 1, 2023

Renze Lou, Kai Zhang, and Wenpeng Yin. A comprehensive survey on instruction following.arXiv preprint arXiv:2303.10475, 1, 2023

work page arXiv 2023
[19]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

work page 2023
[20]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

work page 2022
[22]

Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025

work page arXiv 2025
[23]

Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025

ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025

work page arXiv 2025
[24]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

work page 2016
[26]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Self-instruct: Aligning language models with self-generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

work page 2023
[28]

Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

work page arXiv 2025
[29]

Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.arXiv preprint arXiv:2405.14333,

Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.arXiv preprint arXiv:2405.14333, 2024

work page arXiv 2024
[30]

Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving

Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Ming Ding. Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 32588–32599, 2025

work page 2025
[31]

Wizardlm: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[32]

Codeif: Benchmarking the instruction-following capabilities of large language models for code generation.arXiv preprint arXiv:2502.19166, 2025

Kaiwen Yan, Hongcheng Guo, Xuanqing Shi, Jingyi Xu, Yaonan Gu, and Zhoujun Li. Codeif: Benchmarking the instruction-following capabilities of large language models for code generation.arXiv preprint arXiv:2502.19166, 2025

work page arXiv 2025
[33]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe TwelfthInternational Conference on Learning Representations, 2023. 12

work page 2023
[34]

Ifevalcode: Controlled code generation.arXiv preprint arXiv:2507.22462, 2025

Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li, et al. Ifevalcode: Controlled code generation.arXiv preprint arXiv:2507.22462, 2025

work page arXiv 2025
[35]

Optimizing generative ai by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

work page 2025
[36]

Multi-swe-bench: A multilingual benchmark for issue resolving,

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

work page arXiv 2025
[37]

Star: Self-taught reasoner bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. Star: Self-taught reasoner bootstrapping reasoning with reasoning. InProc. the 36th International Conference on Neural Information Processing Systems, volume 1126, 2024

work page 2024
[38]

A study of smoothing methods for language models applied to ad hoc information retrieval

Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. InAcm sigir forum, volume 51, pages 268–276. ACM New York, NY, USA, 2017

work page 2017
[39]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 13 Appendix A Limitation A limitation of our current study is its focus on Python, chosen for its dominance in algorithmic reasoning benchmarks and comp...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

It is forbidden to use the built-in ’max’ function in your code

work page
[44]

The code must make use of list comprehension. # Example Input and Output Example Input 1: [(’Juan Whelan’,90),(’Sabah Colley’,88),(’Peter Nichols’,7),(’Juan Whelan’,122),(’Sabah Colley’,84)] Example Output 1: (’Juan Whelan’, 212) Example Input 2: [(’Juan Whelan’,10),(’Sabah Colley’,20),(’Peter Nichols’,30),(’Juan Whelan’,40),(’Sabah Colley’,50)] Example O...

work page
[45]

Develop the solution using Python

work page
[46]

Define a variable called found and set its initial value to False in your code

work page
[47]

Utilize the set data structure in your implementation

work page
[48]

Include a switch (or match/case) statement within the code

work page
[49]

Make sure to import the math library in your code. # Example Input and Output Example 1: Input: 25 Output: True (since 3²+ 4²= 25) Example 2: Input: 24 Output: False # Function Signature def sum_Square(n): pass # Solution import math def sum_Square(n): found = False squares = set() i = 1 while i <= math.isqrt(n): square = i * i squares.add(square) remaind...

work page
[51]

Ensure every variable name in your code adheres to the snake_case naming convention

work page
[52]

Do not import the math library in your code

work page
[53]

Make use of list comprehension in your implementation

work page
[54]

Avoid using any for loops in your code

work page
[55]

Incorporate the set data structure in your solution

work page
[56]

Include exactly one while loop in your code. # Example Input and Output Example Input 1: 12 Example Output 1: True Example Input 2: 15 Example Output 2: False # Function Signature def check_abundant(n): pass # Solution def check_abundant(n): divisors = set() i = 1 while i * i <= n: if n % i == 0: divisors.add(i) if i != n // i: divisors.add(n // i) i += 1...

work page
[57]

Implement the solution using Python

work page
[58]

The code must include exactly one list comprehension and one while loop

work page
[59]

Ensure your code does not define a variable with the name "index"

work page
[60]

It is mandatory to incorporate a switch (or match/case) statement in the code. 19

work page
[61]

The solution must make use of a generator expression

work page
[62]

You must employ the built-in ‘collections.Counter‘ function

work page
[63]

A global variable named ‘palindrome_check_enabled‘ must be defined in the code

work page
[64]

Full type annotations (type hints) are required for every function and variable

work page
[65]

code" Output: false Example 2: Input: s =

The built-in ‘sum‘ function is strictly prohibited from use in the solution. # Example Input and Output Example 1: Input: s = "code" Output: false Example 2: Input: s = "aab" Output: true Example 3: Input: s = "carerac" Output: true # Function Signature def canPermutePalindrome(s: str) -> bool: pass # Solution from collections import Counter from typing i...

work page
[66]

‘<question>’: The original problem description, existing instructions, the reference solution (Code), programming language, and language

work page
[67]

Each mutation is a template for a new instruction and a list of ‘<params>’ required to instantiate that template

‘<Mutations>’: A list of ‘<Mutation>’ tags. Each mutation is a template for a new instruction and a list of ‘<params>’ required to instantiate that template. # Workflow (Step-by-Step) Before generating the final XML output, you must perform the following reasoning steps inside a ‘<thought>’ tag:

work page
[68]

Analyze the Original Code: - Understand the algorithm, complexity, and existing instructions of the seed code

work page
[69]

Strategic Parameter Selection and Applicability Check: - Iterate through **each** provided ‘<Mutation>’ in the list. - For each candidate, evaluate two criteria: - **Compatibility**: Does this mutation make sense for this problem and the current instructions? - **Challenge Level**: How much does this force a refactor? 26 - **Selection Strategy**: - Discar...

work page
[70]

- If ‘<Mutation>/<params>’ is empty, skip parameter selection and proceed directly to conflict detection

Parameter Instantiation: - For the selected mutation, look at its ‘<params>’. - If ‘<Mutation>/<params>’ is empty, skip parameter selection and proceed directly to conflict detection. The instruction text is fixed. - If ‘<Mutation>/<params>’ is not empty: - **Maximize Challenge**: Choose parameter options that contradict the *current* implementation (e.g....

work page
[71]

- Such as ’Your code must utilize exactly 1 list comprehension.’ and ’Your code must not use any for loops.’, which are incompatible with each other

Conflict Detection: - Ensure the selected mutation and its parameters do not contradict the original ‘<instruction>’. - Such as ’Your code must utilize exactly 1 list comprehension.’ and ’Your code must not use any for loops.’, which are incompatible with each other

work page
[72]

- The modified code must produce the exact same output for the same inputs as the original code

Refactor Code (Only if Successful): - If successful, rewrite the reference code to **strictly adhere** to the new instantiated constraint. - The modified code must produce the exact same output for the same inputs as the original code

work page
[73]

- If ‘<success>false</success>’, **STOP** after closing the tag

Synthesize Output: - Generate the ‘<success>’ tag first. - If ‘<success>false</success>’, **STOP** after closing the tag. Do not generate params or question. - If ‘<success>true</success>’, generate ‘<instantiated_params>’ and the modified ‘<question>’. - If the input ‘<Mutation>/<params>’ was empty, the tag <instantiated_params> must also be empty (i.e.,...

work page
[74]

- D e c i s i o n : Se le cte d M uta ti on ID [ X ] because

E v a l u a t i o n : - M uta ti on ID 1 : [ C o m p a t i b i l i t y : Yes / No ] | [ C h a l l e n g e : Low / Med / High ] | [ R e a s o n i n g ] - M uta ti on ID 2 : ... - D e c i s i o n : Se le cte d M uta ti on ID [ X ] because

work page
[75]

P a r a m e t e r S e l e c t i o n : [ R e a s o n i n g for chosen params ]

work page
[76]

Sum of list

R e f a c t o r i n g S t r a t e g y : [ How the code will change ]] ] > </ thought > < success >[ true / false ] </ success > < i n s t a n t i a t e d _ p a r a m s > < param > < name >[ P a r a m e t e r Name , e . g . , v a r i a b l e _ n a m e ] </ name > < value >[ Sel ec te d Value , e . g . , t o t a l _ s c o r e ] </ value > </ param > </ i n ...

work page
[77]

- You must explicitly justify this subsumption

Subsumption - Any program that satisfies the EVOLVED instruction MUST also satisfy the ORIGINAL instruction. - You must explicitly justify this subsumption. - The evolved instruction MUST introduce at most ONE new AST-level constraint beyond those already enforced by the original instruction. - The evolved instruction shouldn’t be overly complicated

work page
[78]

28 - These failures must arise from a GENERAL, STRUCTURAL, AST-level constraint

Examples-as-Negative - ALL provided example programs MUST FAIL the EVOLVED instruction. 28 - These failures must arise from a GENERAL, STRUCTURAL, AST-level constraint. - You MUST NOT reference specific identifiers, literals, or fingerprints unique to the examples

work page
[79]

- No runtime execution, no I/O, no performance, no semantic reasoning

AST-Checkability - The evolved instruction MUST be checkable using static AST analysis. - No runtime execution, no I/O, no performance, no semantic reasoning

work page
[80]

readable

Generalization - The evolved instruction MUST generalize beyond the given examples. - You are FORBIDDEN from writing constraints that merely exclude the examples without structural meaning. You MUST NOT: - Mention or encode specific variable names, constants, or literals that appear only in the examples. - Refer to line counts, whitespace, formatting, or ...

work page
[81]

Use synonyms, change sentence structures, and vary the tone to express the exact same requirement

**Rephrase**: This is your main tool. Use synonyms, change sentence structures, and vary the tone to express the exact same requirement

work page
[82]

**Combine**: Merge two or multiple instructions into a single instruction

work page

Showing first 80 references.