arxiv: 2605.07465 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

Fei Yu, Han Xia, Jiajie Zhu, Jiaqing Liang, Jingwen Chang, Qianyu He, Qingyu Ren, Xingzhou Chen, Yanghua Xiao, Zeye Sun

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords instruction followingself-evolutionreinforcement learninglarge language modelsself-supervised trainingco-evolutionRL without external supervision

0 comments

The pith

Language models improve instruction following by generating harder tasks for themselves and learning from their own reward signals in a closed loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SEIF to overcome the limits of static training data or expensive external supervision for building better instruction-following in large language models. It creates a four-role system in which an Instructor produces progressively more difficult instructions while a Follower practices them under reinforcement learning, with a Judger scoring outcomes and a Filter maintaining data quality. The Instructor and Follower alternate training so that rising task difficulty drives capability growth and stronger models in turn support harder instructions. Experiments across model sizes and types show steady gains on instruction-following benchmarks. The work also identifies a practical schedule of heavy early training followed by lighter later stages to build a foundation without overfitting on open-ended tasks.

Core claim

SEIF forms a closed self-evolution loop that improves the model's instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. The Instructor generates increasingly challenging instructions, the Filter removes invalid ones, the Follower learns via reinforcement learning with rewards supplied by the Judger, and the Instructor and Follower are alternately trained to co-evolve.

What carries the argument

The SEIF four-role closed loop in which instruction generation, filtering, following, and self-judged reinforcement rewards alternate and mutually reinforce one another.

If this is right

Instruction difficulty and model capability co-evolve, enabling continuous gains beyond static-difficulty training.
Performance improves consistently across multiple model scales and architectures, indicating the approach generalizes.
A training schedule of sufficient early-stage work followed by moderate late-stage work yields better final results by avoiding overfitting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the self-judgment mechanism proves stable, the method could lower reliance on human feedback for other LLM capabilities such as reasoning or tool use.
Repeated cycles might allow models to discover new task distributions that were not present in the original training data.
The same role-separation structure could be tested on domains with clearer success metrics to check whether the loop remains stable without an explicit Filter.

Load-bearing premise

The Judger supplies reliable reward signals from self-generated data without systematic bias or reward hacking, and the Instructor produces valid instructions that genuinely advance the Follower's capabilities rather than reinforcing existing limitations.

What would settle it

Measure instruction-following benchmark scores after multiple full evolution cycles; if scores plateau or decline rather than rise, or if human raters find that Judger scores diverge from actual task success, the loop claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.07465 by Fei Yu, Han Xia, Jiajie Zhu, Jiaqing Liang, Jingwen Chang, Qianyu He, Qingyu Ren, Xingzhou Chen, Yanghua Xiao, Zeye Sun.

**Figure 2.** Figure 2: Overview of the SEIF framework, which consists of two stages. The Instructor and Follower [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: PCA visualization of training data representations across three iteration turns. The star [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Top-6 constraint types with the largest proportional increases and decreases across the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Reward dynamics of different training strategies across different iteration turns on SEIF-7B. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model's capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model's instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at https://github.com/Rainier-rq1/SEIF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEIF's four-role alternating loop is a distinct self-play framing for instruction following, but the evidence for real capability gains over self-reinforcement stays moderate.

read the letter

SEIF sets up a closed loop with an Instructor generating harder instructions, a Filter cleaning them, a Follower learning via RL, and a Judger supplying rewards, all co-evolving from the same model. The alternating Instructor-Follower training is the main new piece, trying to tie difficulty growth directly to capability growth instead of using fixed instructions like earlier self-play work. They release code and data, which makes the setup checkable, and the note on doing heavy early training then lighter later stages to avoid overfitting gives a usable heuristic for open-ended tasks. Experiments across scales and architectures are reported as consistent, which is better than single-model results often seen in this area. The central risk is exactly what the stress-test flags: everything feeds back through the model's own outputs, so the Judger could reward fluent but shallow responses that the Instructor then echoes with similar instructions. The abstract does not show difficulty metrics over iterations, reward-hacking diagnostics, or clear baselines with statistical detail, so it is hard to separate genuine progress from the model getting better at pleasing itself. That leaves the gains plausible but not yet convincing as a scalable path. This is for groups already running self-improvement or RLHF experiments who want another loop to try. Readers focused on reducing external data needs will find the architecture and artifacts worth pulling down, even if they plan to add their own controls. It deserves peer review because the framing is distinct enough and the public resources lower the bar for verification, though referees will need to press on the self-bias checks.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SEIF, a self-evolving reinforcement learning framework for improving instruction-following in LLMs. It consists of four co-trained roles forming a closed loop: an Instructor that generates progressively harder instructions, a Filter that removes invalid ones, a Follower that learns via RL on the evolved instructions, and a Judger that supplies reward signals. The Instructor and Follower are alternately trained to enable mutual evolution of instruction difficulty and model capability. Experiments across multiple model scales and architectures report consistent performance gains, accompanied by analysis identifying an effective strategy of sufficient early-stage training followed by moderate late-stage training to mitigate overfitting. Code and data are released publicly.

Significance. If the reported gains reflect genuine capability advancement rather than self-reinforcement artifacts, SEIF could reduce dependence on external supervision for LLM improvement and provide a scalable path for self-evolution on open-ended tasks. The public code and data release is a clear strength that supports reproducibility and extension by the community.

major comments (2)

[Abstract] Abstract: The central claim that 'instruction difficulty evolution and model capability evolution reinforce each other' is load-bearing for the contribution, yet the manuscript provides no tracked metrics of instruction difficulty over iterations, no reward-distribution diagnostics, and no ablations isolating the Judger's self-generated signals from potential bias amplification.
[Experiments] Experiments (as summarized): The statement of 'consistent improvements across multiple model scales and architectures' lacks reported exact metrics, baseline comparisons, statistical significance tests, or controls for self-reinforcement artifacts, leaving open whether gains exceed what would arise from amplified model preferences alone.

minor comments (1)

[Abstract] Abstract: The interactions among the four roles (particularly how the Filter prevents propagation of model-specific biases into the Judger's rewards) could be stated more explicitly to clarify the claimed independence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify how to strengthen the presentation of SEIF's self-evolution mechanism. We agree that more explicit evidence for mutual reinforcement between instruction difficulty and model capability, along with fuller experimental reporting, will improve the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'instruction difficulty evolution and model capability evolution reinforce each other' is load-bearing for the contribution, yet the manuscript provides no tracked metrics of instruction difficulty over iterations, no reward-distribution diagnostics, and no ablations isolating the Judger's self-generated signals from potential bias amplification.

Authors: We acknowledge that the manuscript would benefit from more direct evidence supporting the mutual-evolution claim. In the revised version we will add (i) tracked metrics of instruction difficulty across iterations (using both surface proxies such as length and lexical diversity and model-based complexity estimates), (ii) reward-distribution histograms and statistics from the Judger at multiple training stages, and (iii) an ablation that replaces the self-generated Judger signals with either a fixed external judge or randomized rewards while keeping all other components identical. These additions will be placed in a new subsection of the Experiments section and will directly address concerns about bias amplification. revision: yes
Referee: [Experiments] Experiments (as summarized): The statement of 'consistent improvements across multiple model scales and architectures' lacks reported exact metrics, baseline comparisons, statistical significance tests, or controls for self-reinforcement artifacts, leaving open whether gains exceed what would arise from amplified model preferences alone.

Authors: We agree that the current reporting is insufficiently precise. The revised manuscript will include: exact numerical results (means and standard deviations) for every model scale and architecture; expanded baseline tables that compare against standard RLHF, static self-play, and non-evolving instruction sets; statistical significance tests (paired t-tests or Wilcoxon tests with p-values) computed over multiple random seeds; and an explicit control experiment that trains the Follower on instructions generated without the Instructor's evolution loop. These results will be presented in updated tables and a new figure, allowing readers to assess whether observed gains exceed self-reinforcement artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: self-evolution is an explicit training mechanism with independent empirical validation.

full rationale

The paper proposes SEIF as a four-role closed loop (Instructor, Filter, Follower, Judger) in which difficulty and capability co-evolve via alternating training on self-generated data. The central claim—that this process improves instruction-following—is not derived by reducing any quantity to its inputs by definition, nor by renaming a known result, nor by a self-citation chain that supplies the uniqueness theorem. Instead, the authors report performance gains measured on held-out benchmarks across model scales and architectures. The self-referential data generation is the method itself, not a hidden tautology that forces the reported outcome; the experiments supply the external check. No load-bearing step equates the final performance metric to the initial model outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The framework rests on standard RL assumptions plus new components for self-generation and self-judgment whose quality is not independently verified outside the loop.

axioms (2)

domain assumption A learned Judger can provide accurate scalar rewards for instruction-following quality without external ground truth.
Invoked to enable the reinforcement learning update for the Follower.
domain assumption Self-generated instructions can be filtered to remain valid and progressively more difficult without external supervision.
Required for the Instructor to drive capability evolution.

invented entities (3)

Instructor role no independent evidence
purpose: Generate increasingly challenging instructions from the evolving model
Core new component of the closed loop.
Filter role no independent evidence
purpose: Remove invalid or conflicting instructions to maintain data quality
Core new component of the closed loop.
Judger role no independent evidence
purpose: Supply reward signals for reinforcement learning
Core new component of the closed loop.

pith-pipeline@v0.9.0 · 5573 in / 1528 out tokens · 58284 ms · 2026-05-11T01:47:23.258881+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/SelfBootstrapDistinguishability.lean reality_from_one_distinction unclear
SEIF forms a closed self-evolution loop that improves the model's instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. ... Instructor and Follower are alternately trained and co-evolve throughout the process.

Reference graph

Works this paper leans on

152 extracted references · 152 canonical work pages · 9 internal anchors

[1]

Ultraif: Advancing instruction following from the wild

Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, and Baobao Chang. Ultraif: Advancing instruction following from the wild. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18722–18737, 2025

work page 2025
[2]

Introducing claude opus 4.7, 2026

Anthropic. Introducing claude opus 4.7, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7

work page 2026
[3]

Self-evolving curriculum for LLM reasoning.arXiv preprint arXiv:2505.14970, 2025

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning. arXiv preprint arXiv:2505.14970, 2025

work page arXiv 2025
[4]

SPaR: Self-Play with Tree- Search Refinement to Improve Instruction-Following in Large Language Models,

Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, and Minlie Huang. Spar: Self-play with tree-search refinement to improve instruction-following in large language models.arXiv preprint arXiv:2412.11605, 2024

work page arXiv 2024
[5]

Llm self- correction with decrim: Decompose, critique, and refine for enhanced following of instructions with multiple constraints

Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung, Mohit Bansal, and Nanyun Peng. Llm self- correction with decrim: Decompose, critique, and refine for enhanced following of instructions with multiple constraints. InFindings of the Association for Computational Linguistics: E...

work page 2024
[6]

Mixture of cluster-conditional lora experts for vision-language instruction tuning.IEEE Transactions on Image Processing, 35:3881–3892, 2026

Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning.IEEE Transactions on Image Processing, 35:3881–3892, 2026

work page 2026
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Ifdecorator: Wrapping instruction following reinforcement learning with verifiable rewards.arXiv preprint arXiv:2508.04632, 2025

Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, and Kai Chen. Ifdecorator: Wrapping instruction following reinforcement learning with verifiable rewards.arXiv preprint arXiv:2508.04632, 2025

work page arXiv 2025
[9]

From general to targeted rewards: Surpassing gpt-4 in open-ended long-context generation

Zhihan Guo, Jiele Wu, Wenqian Cui, Yifei Zhang, Minda Hu, Yufei Wang, and Irwin King. From general to targeted rewards: Surpassing gpt-4 in open-ended long-context generation. arXiv preprint arXiv:2506.16024, 2025

work page arXiv 2025
[10]

From complex to simple: Enhancing multi-constraint complex instruction following ability of large language models

Qianyu He, Jie Zeng, Qianxi He, Jiaqing Liang, and Yanghua Xiao. From complex to simple: Enhancing multi-constraint complex instruction following ability of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10864–10882, 2024

work page 2024
[11]

arXiv preprint arXiv:2410.15553 , year=

Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, et al. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following.arXiv preprint arXiv:2410.15553, 2024

work page arXiv 2024
[12]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review arXiv 2025
[13]

Musc: Improving complex instruction following with multi-granularity self-contrastive training

Hui Huang, Jiaheng Liu, Yancheng He, Shilong Li, Bing Xu, Conghui Zhu, Muyun Yang, and Tiejun Zhao. Musc: Improving complex instruction following with multi-granularity self-contrastive training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10667–10686, 2025

work page 2025
[14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Followbench: A multi-level fine-grained constraints following benchmark for large language models

Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. Followbench: A multi-level fine-grained constraints following benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4667–4688, 2024

work page 2024
[16]

Big-bench extra hard

Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Yuanzhu Peter Chen, et al. Big-bench extra hard. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26473–26501, 2025

work page 2025
[17]

Openassistant conversations-democratizing large language model alignment.Advances in neural information processing systems, 36:47669–47681, 2023

Andreas Köpf, Yannic Kilcher, Dimitri V on Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations-democratizing large language model alignment.Advances in neural information processing systems, 36:47669–47681, 2023

work page 2023
[18]

Self-alignment with instruction backtranslation

Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason We- ston, and Mike Lewis. Self-alignment with instruction backtranslation.arXiv preprint arXiv:2308.06259, 2023

work page arXiv 2023
[19]

I-sheep: Self-alignment of llm from scratch through an iterative self-enhancement paradigm.arXiv preprint arXiv:2408.08072, 2024

Yiming Liang, Ge Zhang, Xingwei Qu, Tianyu Zheng, Jiawei Guo, Xinrun Du, Zhenzhu Yang, Jiaheng Liu, Chenghua Lin, Lei Ma, et al. I-sheep: Self-alignment of llm from scratch through an iterative self-enhancement paradigm.arXiv preprint arXiv:2408.08072, 2024

work page arXiv 2024
[20]

Self: Language-driven self-evolution for large language model

Jianqiao Lu, Wanjun Zhong, Wenyong Huang, Yufei Wang, Fei Mi, Baojun Wang, Weichao Wang, Lifeng Shang, and Qun Liu. Self: Language-driven self-evolution for large language model. 2023

work page 2023
[21]

Rubriceval: A rubric-level meta-evaluation benchmark for llm judges in instruction following.arXiv preprint arXiv:2603.25133, 2026

Tianjun Pan, Xuan Lin, Wenyan Yang, Qianyu He, Shisong Chen, Licai Qi, Wanqing Xu, Hong- wei Feng, Bo Xu, and Yanghua Xiao. Rubriceval: A rubric-level meta-evaluation benchmark for llm judges in instruction following.arXiv preprint arXiv:2603.25133, 2026

work page arXiv 2026
[22]

Verif: Verification engineering for reinforcement learning in instruction following

Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, and Juanzi Li. Verif: Verification engineering for reinforcement learning in instruction following. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30312–30327, 2025

work page 2025
[23]

Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following. arXiv preprint arXiv:2507.02833, 2025

work page arXiv 2025
[24]

Agentif: Benchmarking instruction following of large language models in agentic scenarios

Yunjia Qi, Hao Peng, Xiaozhi Wang, Amy Xin, Youfeng Liu, Bin Xu, Lei Hou, and Juanzi Li. Agentif: Benchmarking instruction following of large language models in agentic scenarios. arXiv preprint arXiv:2505.16944, 2025

work page arXiv 2025
[25]

Constraint back-translation improves complex instruction following of large language models

Yunjia Qi, Hao Peng, Xiaozhi Wang, Bin Xu, Lei Hou, and Juanzi Li. Constraint back-translation improves complex instruction following of large language models. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 2388–2398, 2025

work page 2025
[26]

Incentivizing reasoning for advanced instruction-following of large language models

Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, and Xing Sun. Incentivizing reasoning for advanced instruction-following of large language models. arXiv preprint arXiv:2506.01413, 2025

work page arXiv 2025
[27]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review arXiv 2023
[28]

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Qingyu Ren, Qianyu He, Powei Chang, Jie Zeng, Zeye Sun, Fei Yu, Jiaqing Liang, and Yanghua Xiao. Instructions are all you need: Self-supervised reinforcement learning for instruction following.arXiv preprint arXiv:2510.14420, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Hi Robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417,

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025

work page arXiv 2025
[30]

arxiv preprint arXiv:2404.02823 , year=

Haoran Sun, Lixin Liu, Junjie Li, Fengyu Wang, Baohua Dong, Ran Lin, and Ruohui Huang. Conifer: Improving complex constrained instruction-following ability of large language models. arXiv preprint arXiv:2404.02823, 2024

work page arXiv 2024
[31]

Qwq-32b: Embracing the power of reinforcement learning, 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, 2025

work page 2025
[32]

Light-if: En- dowing llms with generalizable reasoning via preview and self-checking for complex instruction following

Chenyang Wang, Liang Wen, Shousheng Jia, Xiangzheng Zhang, and Liang Xu. Light-if: En- dowing llms with generalizable reasoning via preview and self-checking for complex instruction following. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33422–33430, 2026

work page 2026
[33]

Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 conference on empirical methods in natural language processing...

work page 2022
[34]

Self-instruct: Aligning language models with self-generated instruc- tions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

work page 2023
[35]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024
[36]

Followir: Evaluating and teaching information retrieval models to follow instructions

Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, and Luca Soldaini. Followir: Evaluating and teaching information retrieval models to follow instructions. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolog...

work page 2025
[37]

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Bosi Wen, Yilin Niu, Cunxiang Wang, Xiaoying Ling, Ying Zhang, Pei Ke, Hongning Wang, and Minlie Huang. If-rewardbench: Benchmarking judge models for instruction-following evaluation.arXiv preprint arXiv:2603.04738, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge

Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason E Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11548–11565, 2025

work page 2025
[39]

Writingbench: A comprehensive benchmark for generative writing.CoRR, abs/2503.05244,

Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, et al. Writingbench: A comprehensive benchmark for generative writing.arXiv preprint arXiv:2503.05244, 2025

work page arXiv 2025
[40]

Easytool: Enhancing llm-based agents with concise tool instruction

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages ...

work page 2025
[41]

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models.arXiv preprint arXiv:2401.10020, 2024

work page internal anchor Pith review arXiv 2024
[42]

Following length constraints in instructions

Weizhe Yuan, Ilia Kulikov, Ping Yu, Kyunghyun Cho, Sainbayar Sukhbaatar, Jason E Weston, and Jing Xu. Following length constraints in instructions. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24243–24254, 2025. 12

work page 2025
[43]

Recommendation as instruction following: A large language model empowered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Recommendation as instruction following: A large language model empowered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

work page 2026
[44]

Cfbench: A comprehensive constraints- following benchmark for llms

Tao Zhang, Chenglin Zhu, Yanjun Shen, Wenjing Luo, Yan Zhang, Hao Liang, Fan Yang, Mingan Lin, Yujing Qiao, Weipeng Chen, et al. Cfbench: A comprehensive constraints- following benchmark for llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32926–32944, 2025

work page 2025
[45]

Iopo: Empowering llms with complex instruction following via input-output preference optimization

Xinghua Zhang, Haiyang Yu, Cheng Fu, Fei Huang, and Yongbin Li. Iopo: Empowering llms with complex instruction following via input-output preference optimization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22185–22200, 2025

work page 2025
[46]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

work page internal anchor Pith review arXiv 2025
[47]

Llamafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

work page 2024
[48]

EasyR1: An efficient, scalable, multi-modality rl training framework, 2025

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. EasyR1: An efficient, scalable, multi-modality rl training framework, 2025. URLhttps://github.com/hiyouga/EasyR1

work page 2025
[49]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 13 Appendix A Training Data Construction A.1 Pipeline For theInstructor, the input is a seed instruction, and the model output is a more complex instruc...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

To make the instruction more complex, I want you to identify and return atomic constraints that can be added to the seed question

I currently have a seed question, but the seed question is relatively simple. To make the instruction more complex, I want you to identify and return atomic constraints that can be added to the seed question

work page
[51]

You may use these references to propose constraints that increase the difficulty of the seed question

I will provide[Seed Question]and[Constraint References]. You may use these references to propose constraints that increase the difficulty of the seed question. 3.[Constraint References]are only suggestions. You may choose one or more constraints from the list, or propose new constraints if needed

work page
[52]

Your task is only to generate additional constraints that can be added to it

Do not modify, rewrite, or answer the seed question. Your task is only to generate additional constraints that can be added to it

work page
[53]

Avoid vague, redundant, or overlapping constraints

Each added constraint should be atomic, specific, and verifiable. Avoid vague, redundant, or overlapping constraints

work page
[54]

c1": "<first constraint>

Return the added constraints in the following JSON format: json { "c1": "<first constraint>", "c2": "<second constraint>", . . . }

work page
[55]

No explanation, no reformulated question, no analysis—only the JSON structure

Do not return anything else. No explanation, no reformulated question, no analysis—only the JSON structure. [Constraint Type References]

work page
[56]

Lexical content constraint: {Definition} {Example}

work page
[57]

Agreement

Rule Constraint: {Definition} {Example} [Seed Question] {raw_question} Table 8: GRPO training hyperparameters. Hyperparameter Value Training Framework EasyR1 [48] Global Batch Size 96 Micro Batch Size Per Device For Update 4 Micro Batch Size Per Device For Experience 8 Use KL Loss True Rollout Batch Size 384 Rollout n 5 Max Prompt Length 2048 Max Response...

work page 2048
[58]

Word Count: at least20 words

work page
[59]

Information coverage Word count Format/highlight T2 (N= 3) [1] Include thename,type of food,cus- tomer rating, andnearby restaurant in a single sentence

Highlight thenameand thenearby lo- cationusing markdown emphasis. Information coverage Word count Format/highlight T2 (N= 3) [1] Include thename,type of food,cus- tomer rating, andnearby restaurant in a single sentence

work page
[60]

Word Count: at least25 words

work page
[61]

Information coverage Higher word count Format/bold T3 (N= 5) [1] Use specific adjectives to describe the restaurant’sambianceandservice qual- ity

Highlight thenameand thenearby restaurantusing bold markdown. Information coverage Higher word count Format/bold T3 (N= 5) [1] Use specific adjectives to describe the restaurant’sambianceandservice qual- ity

work page
[62]

Limit the sentence to30 words

work page
[63]

Mention the nearby restaurant‘Café Rouge’in the sentence

work page
[64]

Write the sentence in thepast tenseto reflect historical customer experiences

work page
[65]

Include a clause about the type ofIn- dian cuisineoffered. Descriptive specificity Word limit Lexical content Tense constraint Cuisine clause Analysis:T1 requires basic field coverage and markdown emphasis; T2 raises the minimum word count and changes the highlighted entity from nearby location to nearby restaurant with bold format- ting; T3 adds descript...

work page
[66]

Highlight theenvironmental impact andcost benefitsof electric vehicles using markdown emphasis

work page
[67]

Your answer must contain[electricity], [battery], and[range]as placehold- ers, and be written entirely inEnglish, avoidingALL CAPS. Sentence count Topic coverage Markdown emphasis Placeholder inclusion English language Case restriction T2 (N= 3) [1] The response should include exactly3 bullet pointsusing markdown bullets

work page
[68]

The response must contain at least2 highlighted spans

work page
[69]

Your answer must be inEnglish, and in all lowercase letters; no capital letters allowed. Bullet count Markdown bullets Highlighted spans English language Lowercase restriction T3 (N= 5) [1] Mandatory use of the terms‘sustain- ability’,‘reduction’, and‘emissions’ in the conversation

work page
[70]

Limit the conversation tothree sen- tences per person

work page
[71]

Write the conversation inMarkdown format, including headers for each speaker’s lines

work page
[72]

Ensure the conversation reflects aprag- matic context, focusing on the benefits and concerns of electric vehicles

work page
[73]

Emulate the style of atechnology blog- gerdiscussing the latest trends in eco- friendly transportation. Mandatory terms Sentence limit Markdown format Speaker headers Pragmatic context Style imitation Analysis:T1 requires a multi-sentence English conversation covering both advantages and disadvan- tages of electric vehicles, with markdown emphasis and req...

work page
[74]

Response must contain the word‘Ital- ian’at leasttwice

work page
[75]

Sentence count Lexical frequency Forbidden word T2 (N= 3) [1] Response should contain exactly2 sen- tences

Response should not use the word ‘bad’. Sentence count Lexical frequency Forbidden word T2 (N= 3) [1] Response should contain exactly2 sen- tences

work page
[76]

Response should include the word‘ex- pensive’at least once

work page
[77]

Sentence count Lexical content Format/bold T3 (N= 5) [1] Useall of the informationprovided

Highlight the restaurant name using bold markdown. Sentence count Lexical content Format/bold T3 (N= 5) [1] Useall of the informationprovided

work page
[78]

Write in anencyclopedic style

work page
[79]

Limit totwo sentences

work page
[80]

Write with apolite tone

work page

Showing first 80 references.