pith. machine review for the scientific record. sign in

arxiv: 2604.16322 · v1 · submitted 2026-02-27 · 💻 cs.SE · cs.AI· cs.PL

Recognition: 1 theorem link

· Lean Theorem

Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PL
keywords instruction followingcoding data synthesisschema co-evolutionMCTS samplerLLM performanceparametric schemasIFCodeBench
0
0 comments X

The pith

Parametric schemas co-evolve with an actor model to synthesize large-scale instruction-following coding data that lifts open models to match proprietary SOTA performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework called IFCodeEvolve that represents instructions as parametric function schemas to build a library covering diverse coding tasks through dynamic constraint instantiation. Monte Carlo Tree Search then navigates this space using feedback from the actor model as a termination signal, while a co-evolution loop iteratively mutates and composes schemas based on sampler statistics to generate progressively harder data. This produces training data that improves LLM instruction following for code, with a 32B model reaching parity with closed-source leaders, and includes the IFCodeBench benchmark for verification. A reader would care because manually ensuring logical compatibility across multiple constraints in coding instructions is intractable at scale, limiting automated programming systems.

Core claim

IFCodeEvolve constructs a schema library that covers the instruction space via parametric function schemas and dynamic constraint instantiation. An MCTS sampler navigates this space with actor model feedback serving as the dynamic termination signal. A co-evolving paradigm then advances both the actor and the schema library through composition and mutation driven by sampler statistics, yielding data that significantly boosts base model performance on instruction-following coding tasks to the point that a 32B model achieves parity with proprietary state-of-the-art systems.

What carries the argument

The actor-parametric schema co-evolution framework, where instructions are encoded as parametric function schemas allowing dynamic constraint instantiation, navigated by MCTS with actor feedback and iteratively mutated for harder problems.

Load-bearing premise

That representing instructions as parametric function schemas with dynamic constraint instantiation produces logically compatible combinations of multiple constraints without introducing inconsistencies or biases.

What would settle it

Train an otherwise identical base model on data generated by this method versus standard coding data and measure whether instruction-following accuracy on multi-constraint problems shows no gain or exhibits logical errors in generated code.

read the original abstract

Interpreting and following human instructions is a critical capability of large language models (LLMs) in automatic programming. However, synthesizing large-scale instruction-paired coding data remains largely unexplored and is particularly challenging when ensuring logical compatibility among multiple constraints. In this study, we propose IFCodeEvolve, an actor-schema co-evolution framework for instruction following coding data generation. By representing instructions as parametric function schema, we construct a library that covers the vast instruction space via dynamic constraint instantiation. Building upon this, Monte Carlo Tree Search (MCTS) sampler is applied to efficiently navigate this space, utilizing actor model feedback as a dynamic termination signal. Furthermore, to progressively explore challenging problems, we introduce a co-evolving paradigm that iteratively advances both the actor model and the schema library, via schema composition and mutation, based on sampler statistics. Empirical results demonstrate that IFCodeEvolve significantly boosts base model performance, with our 32B model achieving parity with proprietary SOTA models. Additionally, we contribute IFCodeBench, a comprehensive human-verified benchmark equipped with solutions and robust AST-based verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes IFCodeEvolve, an actor-schema co-evolution framework for synthesizing large-scale instruction-following coding data. Instructions are represented as parametric function schemas with dynamic constraint instantiation to cover the instruction space; MCTS sampling uses actor-model feedback as a termination signal, and a co-evolution loop iteratively refines both the actor and the schema library via composition and mutation based on sampler statistics. The central empirical claim is that fine-tuning on the resulting data significantly boosts base-model performance, with the authors' 32B model reaching parity with proprietary SOTA models; the paper also contributes the human-verified IFCodeBench benchmark with AST-based verification.

Significance. If the performance claims are substantiated, the work would be significant for the field of LLM-based code generation: it directly tackles the open problem of scalable, logically consistent instruction-paired data synthesis and demonstrates that open 32B models can match closed SOTA systems on instruction following. The introduction of a verified benchmark with robust AST checking would also provide a reusable resource for future research.

major comments (2)
  1. [Abstract / Experimental Results] Abstract and Experimental Results section: the claim that the 32B model achieves parity with proprietary SOTA models is presented without any description of the experimental setup, baselines, evaluation metrics, number of runs, or error bars. This information is load-bearing for the central performance claim and must be supplied before the result can be assessed.
  2. [Method] Method section (schema co-evolution and dynamic constraint instantiation): the construction does not include an explicit check or proof that simultaneously instantiated constraints remain satisfiable. The termination signal comes only from downstream actor utility; without an upstream consistency verifier, it is possible that a non-negligible fraction of the synthetic pairs contain incompatible constraints, which would undermine the attribution of gains to the steerable synthesis method.
minor comments (1)
  1. [Abstract] The abstract refers to 'IFCodeBench' but provides no citation or pointer to its release location or exact composition; this should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your valuable feedback on our manuscript. We address the major comments point by point below, agreeing that additional details and checks are warranted, and will update the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the claim that the 32B model achieves parity with proprietary SOTA models is presented without any description of the experimental setup, baselines, evaluation metrics, number of runs, or error bars. This information is load-bearing for the central performance claim and must be supplied before the result can be assessed.

    Authors: We fully agree with this observation. The current manuscript provides insufficient detail on the experimental protocol supporting the performance claims. In the revised version, we will include a dedicated subsection in the Experimental Results that describes the full setup: the models compared (including specific proprietary SOTA systems), the evaluation metrics used on IFCodeBench, the number of independent runs performed, and statistical measures such as standard deviations or error bars. This will allow proper assessment of the claim that the 32B model reaches parity with closed-source SOTA. revision: yes

  2. Referee: [Method] Method section (schema co-evolution and dynamic constraint instantiation): the construction does not include an explicit check or proof that simultaneously instantiated constraints remain satisfiable. The termination signal comes only from downstream actor utility; without an upstream consistency verifier, it is possible that a non-negligible fraction of the synthetic pairs contain incompatible constraints, which would undermine the attribution of gains to the steerable synthesis method.

    Authors: This is a valid concern. Although the actor feedback in MCTS serves as a practical filter by assigning low utility to unsatisfiable or inconsistent cases, an explicit upstream verification would enhance reliability. We will revise the Method section to incorporate a consistency verification step during dynamic constraint instantiation. This verifier will check for logical compatibility of simultaneously instantiated constraints using a simple satisfiability solver or rule-based checks before proceeding with sampling. We believe this addition will strengthen the attribution of performance gains to the proposed synthesis method. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical synthesis and evaluation pipeline

full rationale

The paper describes an empirical data-generation pipeline (parametric schemas + MCTS + actor feedback + co-evolution) whose performance claims rest on downstream fine-tuning results and a separately contributed human-verified benchmark (IFCodeBench) with AST verification. No equations, predictions, or uniqueness claims are shown to reduce by construction to fitted inputs, self-citations, or ansatzes imported from prior author work. The central result is therefore an observed empirical outcome rather than a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities with independent evidence are detailed. The 'parametric function schema' and 'actor model' appear as core constructs but lack specification of fitting or assumptions.

pith-pipeline@v0.9.0 · 5498 in / 1087 out tokens · 44719 ms · 2026-05-15T17:56:30.893420+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

  4. [4]

    Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms

    Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–...

  5. [5]

    Self-play with execution feedback: Improving instruction-following capabilities of large language models.arXiv preprint arXiv:2406.13542, 2024

    Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models.arXiv preprint arXiv:2406.13542, 2024

  6. [6]

    Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861, 2023

  7. [7]

    Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

    Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, et al. Accelerating scientific discovery with autonomous goal-evolving agents.arXiv preprint arXiv:2512.21782, 2025

  8. [8]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

  9. [9]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  11. [11]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

  12. [12]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

  13. [13]

    Followbench: A multi-level fine-grained constraints following benchmark for large language models

    Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. Followbench: A multi-level fine-grained constraints following benchmark for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4667–4688, 2024

  14. [14]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  15. [15]

    Highly accurate protein structure prediction with alphafold

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021

  16. [16]

    Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022. 11

  17. [17]

    Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents

    Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents. arXiv preprint arXiv:2508.02085, 2025

  18. [18]

    A comprehensive survey on instruction following.arXiv preprint arXiv:2303.10475, 1, 2023

    Renze Lou, Kai Zhang, and Wenpeng Yin. A comprehensive survey on instruction following.arXiv preprint arXiv:2303.10475, 1, 2023

  19. [19]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

  20. [20]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  21. [21]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

  22. [22]

    Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025

  23. [23]

    Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025

    ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025

  24. [24]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  25. [25]

    Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

  26. [26]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  27. [27]

    Self-instruct: Aligning language models with self-generated instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

  28. [28]

    Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

    Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

  29. [29]

    Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.arXiv preprint arXiv:2405.14333,

    Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.arXiv preprint arXiv:2405.14333, 2024

  30. [30]

    Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving

    Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Ming Ding. Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 32588–32599, 2025

  31. [31]

    Wizardlm: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

  32. [32]

    Codeif: Benchmarking the instruction-following capabilities of large language models for code generation.arXiv preprint arXiv:2502.19166, 2025

    Kaiwen Yan, Hongcheng Guo, Xuanqing Shi, Jingyi Xu, Yaonan Gu, and Zhoujun Li. Codeif: Benchmarking the instruction-following capabilities of large language models for code generation.arXiv preprint arXiv:2502.19166, 2025

  33. [33]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe TwelfthInternational Conference on Learning Representations, 2023. 12

  34. [34]

    Ifevalcode: Controlled code generation.arXiv preprint arXiv:2507.22462, 2025

    Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li, et al. Ifevalcode: Controlled code generation.arXiv preprint arXiv:2507.22462, 2025

  35. [35]

    Optimizing generative ai by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

  36. [36]

    Multi-swe-bench: A multilingual benchmark for issue resolving,

    Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

  37. [37]

    Star: Self-taught reasoner bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. Star: Self-taught reasoner bootstrapping reasoning with reasoning. InProc. the 36th International Conference on Neural Information Processing Systems, volume 1126, 2024

  38. [38]

    A study of smoothing methods for language models applied to ad hoc information retrieval

    Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. InAcm sigir forum, volume 51, pages 268–276. ACM New York, NY, USA, 2017

  39. [39]

    AFlow: Automating Agentic Workflow Generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024

  40. [40]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

  41. [41]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 13 Appendix A Limitation A limitation of our current study is its focus on Python, chosen for its dominance in algorithmic reasoning benchmarks and comp...

  42. [43]

    It is forbidden to use the built-in ’max’ function in your code

  43. [44]

    The code must make use of list comprehension. # Example Input and Output Example Input 1: [(’Juan Whelan’,90),(’Sabah Colley’,88),(’Peter Nichols’,7),(’Juan Whelan’,122),(’Sabah Colley’,84)] Example Output 1: (’Juan Whelan’, 212) Example Input 2: [(’Juan Whelan’,10),(’Sabah Colley’,20),(’Peter Nichols’,30),(’Juan Whelan’,40),(’Sabah Colley’,50)] Example O...

  44. [45]

    Develop the solution using Python

  45. [46]

    Define a variable called found and set its initial value to False in your code

  46. [47]

    Utilize the set data structure in your implementation

  47. [48]

    Include a switch (or match/case) statement within the code

  48. [49]

    Make sure to import the math library in your code. # Example Input and Output Example 1: Input: 25 Output: True (since 3²+ 4²= 25) Example 2: Input: 24 Output: False # Function Signature def sum_Square(n): pass # Solution import math def sum_Square(n): found = False squares = set() i = 1 while i <= math.isqrt(n): square = i * i squares.add(square) remaind...

  49. [51]

    Ensure every variable name in your code adheres to the snake_case naming convention

  50. [52]

    Do not import the math library in your code

  51. [53]

    Make use of list comprehension in your implementation

  52. [54]

    Avoid using any for loops in your code

  53. [55]

    Incorporate the set data structure in your solution

  54. [56]

    Include exactly one while loop in your code. # Example Input and Output Example Input 1: 12 Example Output 1: True Example Input 2: 15 Example Output 2: False # Function Signature def check_abundant(n): pass # Solution def check_abundant(n): divisors = set() i = 1 while i * i <= n: if n % i == 0: divisors.add(i) if i != n // i: divisors.add(n // i) i += 1...

  55. [57]

    Implement the solution using Python

  56. [58]

    The code must include exactly one list comprehension and one while loop

  57. [59]

    Ensure your code does not define a variable with the name "index"

  58. [60]

    It is mandatory to incorporate a switch (or match/case) statement in the code. 19

  59. [61]

    The solution must make use of a generator expression

  60. [62]

    You must employ the built-in ‘collections.Counter‘ function

  61. [63]

    A global variable named ‘palindrome_check_enabled‘ must be defined in the code

  62. [64]

    Full type annotations (type hints) are required for every function and variable

  63. [65]

    code" Output: false Example 2: Input: s =

    The built-in ‘sum‘ function is strictly prohibited from use in the solution. # Example Input and Output Example 1: Input: s = "code" Output: false Example 2: Input: s = "aab" Output: true Example 3: Input: s = "carerac" Output: true # Function Signature def canPermutePalindrome(s: str) -> bool: pass # Solution from collections import Counter from typing i...

  64. [66]

    ‘<question>’: The original problem description, existing instructions, the reference solution (Code), programming language, and language

  65. [67]

    Each mutation is a template for a new instruction and a list of ‘<params>’ required to instantiate that template

    ‘<Mutations>’: A list of ‘<Mutation>’ tags. Each mutation is a template for a new instruction and a list of ‘<params>’ required to instantiate that template. # Workflow (Step-by-Step) Before generating the final XML output, you must perform the following reasoning steps inside a ‘<thought>’ tag:

  66. [68]

    Analyze the Original Code: - Understand the algorithm, complexity, and existing instructions of the seed code

  67. [69]

    Strategic Parameter Selection and Applicability Check: - Iterate through **each** provided ‘<Mutation>’ in the list. - For each candidate, evaluate two criteria: - **Compatibility**: Does this mutation make sense for this problem and the current instructions? - **Challenge Level**: How much does this force a refactor? 26 - **Selection Strategy**: - Discar...

  68. [70]

    - If ‘<Mutation>/<params>’ is empty, skip parameter selection and proceed directly to conflict detection

    Parameter Instantiation: - For the selected mutation, look at its ‘<params>’. - If ‘<Mutation>/<params>’ is empty, skip parameter selection and proceed directly to conflict detection. The instruction text is fixed. - If ‘<Mutation>/<params>’ is not empty: - **Maximize Challenge**: Choose parameter options that contradict the *current* implementation (e.g....

  69. [71]

    - Such as ’Your code must utilize exactly 1 list comprehension.’ and ’Your code must not use any for loops.’, which are incompatible with each other

    Conflict Detection: - Ensure the selected mutation and its parameters do not contradict the original ‘<instruction>’. - Such as ’Your code must utilize exactly 1 list comprehension.’ and ’Your code must not use any for loops.’, which are incompatible with each other

  70. [72]

    - The modified code must produce the exact same output for the same inputs as the original code

    Refactor Code (Only if Successful): - If successful, rewrite the reference code to **strictly adhere** to the new instantiated constraint. - The modified code must produce the exact same output for the same inputs as the original code

  71. [73]

    - If ‘<success>false</success>’, **STOP** after closing the tag

    Synthesize Output: - Generate the ‘<success>’ tag first. - If ‘<success>false</success>’, **STOP** after closing the tag. Do not generate params or question. - If ‘<success>true</success>’, generate ‘<instantiated_params>’ and the modified ‘<question>’. - If the input ‘<Mutation>/<params>’ was empty, the tag <instantiated_params> must also be empty (i.e.,...

  72. [74]

    - D e c i s i o n : Se le cte d M uta ti on ID [ X ] because

    E v a l u a t i o n : - M uta ti on ID 1 : [ C o m p a t i b i l i t y : Yes / No ] | [ C h a l l e n g e : Low / Med / High ] | [ R e a s o n i n g ] - M uta ti on ID 2 : ... - D e c i s i o n : Se le cte d M uta ti on ID [ X ] because

  73. [75]

    P a r a m e t e r S e l e c t i o n : [ R e a s o n i n g for chosen params ]

  74. [76]

    Sum of list

    R e f a c t o r i n g S t r a t e g y : [ How the code will change ]] ] > </ thought > < success >[ true / false ] </ success > < i n s t a n t i a t e d _ p a r a m s > < param > < name >[ P a r a m e t e r Name , e . g . , v a r i a b l e _ n a m e ] </ name > < value >[ Sel ec te d Value , e . g . , t o t a l _ s c o r e ] </ value > </ param > </ i n ...

  75. [77]

    - You must explicitly justify this subsumption

    Subsumption - Any program that satisfies the EVOLVED instruction MUST also satisfy the ORIGINAL instruction. - You must explicitly justify this subsumption. - The evolved instruction MUST introduce at most ONE new AST-level constraint beyond those already enforced by the original instruction. - The evolved instruction shouldn’t be overly complicated

  76. [78]

    28 - These failures must arise from a GENERAL, STRUCTURAL, AST-level constraint

    Examples-as-Negative - ALL provided example programs MUST FAIL the EVOLVED instruction. 28 - These failures must arise from a GENERAL, STRUCTURAL, AST-level constraint. - You MUST NOT reference specific identifiers, literals, or fingerprints unique to the examples

  77. [79]

    - No runtime execution, no I/O, no performance, no semantic reasoning

    AST-Checkability - The evolved instruction MUST be checkable using static AST analysis. - No runtime execution, no I/O, no performance, no semantic reasoning

  78. [80]

    readable

    Generalization - The evolved instruction MUST generalize beyond the given examples. - You are FORBIDDEN from writing constraints that merely exclude the examples without structural meaning. You MUST NOT: - Mention or encode specific variable names, constants, or literals that appear only in the examples. - Refer to line counts, whitespace, formatting, or ...

  79. [81]

    Use synonyms, change sentence structures, and vary the tone to express the exact same requirement

    **Rephrase**: This is your main tool. Use synonyms, change sentence structures, and vary the tone to express the exact same requirement

  80. [82]

    **Combine**: Merge two or multiple instructions into a single instruction

Showing first 80 references.