Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Fei Yu; Jiaqing Liang; Jie Zeng; Powei Chang; Qianyu He; Qingyu Ren; Yanghua Xiao; Zeye Sun

arxiv: 2510.14420 · v4 · submitted 2025-10-16 · 💻 cs.CL · cs.AI

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Qingyu Ren , Qianyu He , Powei Chang , Jie Zeng , Zeye Sun , Fei Yu , Jiaqing Liang , Yanghua Xiao This is my paper

Pith reviewed 2026-05-18 06:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords self-supervised reinforcement learninginstruction followingconstraint decompositionreward modelinglanguage modelsmulti-turn tasksagentic tasks

0 comments

The pith

Language models can improve at following complex instructions by learning rewards directly from the instructions using self-supervised reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models often fail at following instructions that contain multiple constraints at once. This paper proposes a self-supervised reinforcement learning method that creates its own reward signals straight from those instructions. The key step is to split each instruction into separate constraints and then run a simple binary check on each one to build pseudo-labels. Those pseudo-labels train a reward model that guides further learning. The resulting system produces better results on both familiar and new datasets, including harder multi-turn and agentic cases.

Core claim

By decomposing instructions into constraints and using constraint-wise binary classification to generate pseudo-labels, the method produces reward signals sufficient to train an effective reward model without any external supervision or human labels, which then supports reinforcement learning that improves instruction following across multiple datasets.

What carries the argument

Constraint decomposition paired with constraint-wise binary classification to generate pseudo-labels directly from instructions for reward model training.

If this is right

Better results on the three in-domain instruction-following datasets used in the experiments.
Stronger generalization to the five out-of-domain datasets, including agentic and multi-turn settings.
No need for external reward signals or human annotations during training.
Computational efficiency is preserved while handling the sparse-reward problem of multi-constraint tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition-plus-binary-check pattern could be tested on other tasks where behavior is defined by explicit constraints, such as tool-use sequences or planning problems.
Scaling the approach to larger base models might reveal whether pseudo-label quality improves or plateaus as model capacity grows.
Combining the instruction-derived rewards with a small amount of human preference data could be explored as a hybrid training signal.

Load-bearing premise

Rewards created by splitting instructions into constraints and running binary checks on each one are accurate enough to train a useful reward model without outside labels.

What would settle it

Human raters scoring model outputs on a fresh set of multi-constraint instructions would show no alignment with the pseudo-rewards produced by the binary classification step.

Figures

Figures reproduced from arXiv: 2510.14420 by Fei Yu, Jiaqing Liang, Jie Zeng, Powei Chang, Qianyu He, Qingyu Ren, Yanghua Xiao, Zeye Sun.

**Figure 1.** Figure 1: Comparison between our self-supervised RL method and previous methods. Our method does not rely on external sources to generate outputs or labels, which is both effective and efficient. et al., 2025; Li et al., 2025b). On one hand, real-world conversations with human users often contain multiple constraints in the instructions (Deshpande et al., 2025; Wen et al., 2024). On the other hand, reliable instru… view at source ↗

**Figure 2.** Figure 2: Overview of our self-supervised RL framework: from label-free instructions, we generate pseudo [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Comparison of reward density. turn dialogues. As shown in Tab. 3, our method improves the model’s instruction-following ability on out-of-domain tasks. Specifically, after training, 0528-Distill-Qwen3-8B achieves performance gains of 6.1 and 7.3 on AgentIF and MultiChallenge, respectively, and achieves stateof-the-art performance on AgentIF. General Abilities. Besides instructionfollowing ability, we a… view at source ↗

**Figure 5.** Figure 5: Qualitative analysis of consistency between model’s thinking and outputs before and after training. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Quantitative analysis of consistency scores [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Sample Distribution across Task Types, Constraint Categories, and Number of Constraints [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: The reward and response length of w/o incremental constraint curriculum. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's self-supervised reward derivation from instruction constraints is a reasonable extension, but the missing checks on pseudo-label accuracy leave the performance claims on shaky ground.

read the letter

The main thing to know is that this work tries to train a reward model for multi-constraint instruction following without any human labels or external supervision. It decomposes instructions into constraints, runs binary classifiers on them to create pseudo-labels, and then uses those for RL. That specific pipeline is not something I've seen in the prior self-supervised RL literature they cite, so the combination counts as new enough to notice. They also release code and data, which is useful for anyone who wants to test the claims directly. The reported gains on three in-domain and five out-of-domain sets, including agentic and multi-turn cases, are the part that would interest applied NLP people if the numbers hold up. What the paper does cleanly is frame the sparse-reward problem in instruction following and show a way to generate training signals straight from the input text. That avoids the usual annotation cost, which is a practical plus. The soft spot is exactly where the stress-test points: there is no reported measurement of how accurate or unbiased the pseudo-labels actually are. If the decomposition step or the binary classifiers systematically err on complex or conflicting constraints, the RL stage is just optimizing against noise. The abstract gives no error rates, no human validation of the pseudo-labels, and no ablation on the decomposition quality. Without those, the out-of-domain improvements are hard to interpret. The baselines and statistical details are also thin in the summary, which makes it difficult to judge whether the gains are robust or just from a favorable setup. This paper is for researchers working on label-efficient RL for LLMs and agentic instruction following. A reader who cares about reducing supervision in reward modeling could get value from the method description and the released code, even if they end up re-running the experiments with their own checks. It is coherent on its own terms and engages the relevant literature, so it deserves a serious referee rather than a desk reject. The idea is worth testing properly, but the current evidence needs more direct support on the pseudo-label step before the generalization claims can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a label-free self-supervised RL framework for language model instruction following. Reward signals are derived directly from input instructions via constraint decomposition strategies followed by constraint-wise binary classification to generate pseudo-labels for reward model training. This eliminates external supervision and addresses sparse rewards in multi-constraint settings. Experiments report strong improvements on 3 in-domain and 5 out-of-domain datasets, including agentic and multi-turn tasks, with code and data released publicly.

Significance. If the pseudo-labels prove accurate, the approach would be significant for scaling RL-based instruction tuning without human labels or external rewards, directly tackling sparse reward issues in complex instructions. Public release of data and code supports reproducibility and is a clear strength.

major comments (2)

[Abstract] Abstract: the central claim that constraint decomposition plus binary classification produces sufficiently accurate pseudo-labels for effective RL training is load-bearing but unsupported. No direct measurement of pseudo-label quality (e.g., error rates against human references or failure modes on multi-constraint/agentic instructions) is provided, leaving open whether the reported gains arise from the method or from distorted reward landscapes.
[Experiments] Experiments section (implied by abstract claims): performance gains are asserted across 3 in-domain and 5 out-of-domain datasets without reported baselines, statistical tests, data exclusion criteria, or ablation on the decomposition step. This prevents verification that the self-supervised signal generalizes rather than fitting to artifacts of the pseudo-labeling process.

minor comments (2)

Clarify the precise constraint decomposition rules for different instruction types (e.g., agentic vs. multi-turn) and how binary classifiers are trained and thresholded.
Add a dedicated subsection or table quantifying pseudo-label agreement or downstream sensitivity to label noise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the evidence for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that constraint decomposition plus binary classification produces sufficiently accurate pseudo-labels for effective RL training is load-bearing but unsupported. No direct measurement of pseudo-label quality (e.g., error rates against human references or failure modes on multi-constraint/agentic instructions) is provided, leaving open whether the reported gains arise from the method or from distorted reward landscapes.

Authors: We agree that direct measurement of pseudo-label quality would provide stronger support for the central claim. Although generalization to out-of-domain tasks offers indirect validation, we have added a new analysis subsection reporting pseudo-label accuracy against human references on a held-out set, including error rates and discussion of failure modes for multi-constraint and agentic instructions. This revision clarifies that performance gains stem from the method rather than reward distortion. revision: yes
Referee: [Experiments] Experiments section (implied by abstract claims): performance gains are asserted across 3 in-domain and 5 out-of-domain datasets without reported baselines, statistical tests, data exclusion criteria, or ablation on the decomposition step. This prevents verification that the self-supervised signal generalizes rather than fitting to artifacts of the pseudo-labeling process.

Authors: The original manuscript reports baseline comparisons and results across the specified datasets. To address the concerns, we have added statistical significance tests across multiple runs, explicit data exclusion criteria, and an ablation isolating the constraint decomposition step. These revisions demonstrate that the self-supervised signal generalizes and is not an artifact of the pseudo-labeling process. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reward construction is explicit and externally validated

full rationale

The paper constructs reward signals directly from input instructions using constraint decomposition followed by constraint-wise binary classification to produce pseudo-labels for the reward model. This is a definitional procedure applied to the instructions themselves rather than a fitted parameter or prediction that reduces back to the same fitted values. The subsequent RL stage optimizes the policy against the resulting reward model, with performance measured on separate in-domain and out-of-domain datasets. No equations or self-citations are shown to create a closed loop where a claimed result equals its own input by construction. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that instruction constraints can be evaluated independently to produce reliable pseudo-labels; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Instructions contain decomposable constraints that can be evaluated independently via binary classification to generate usable reward signals
Invoked to address sparse reward challenges in multi-constraint tasks.

pith-pipeline@v0.9.0 · 5673 in / 1239 out tokens · 46755 ms · 2026-05-18T06:51:31.049660+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

pseudo-labels: (ok, ck, label=1) ... (ok-1, ck, label=0)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following
cs.CL 2026-05 conditional novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks
cs.CL 2026-04 unverdicted novelty 6.0

WEval and WRL introduce fine-grained benchmarking and requirement-selective sample construction for training writing reward models, yielding substantial gains on writing benchmarks with strong generalization.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–18702

Multichallenge: A realistic multi-turn con- versation evaluation benchmark challenging to frontier llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–18702. Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou

work page 2025
[2]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Self-play with execution feedback: Improv- ing instruction-following capabilities of large lan- guage models.arXiv preprint arXiv:2406.13542. Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen

work page arXiv
[3]

2024 , url =

Sciknoweval: Evaluating multi-level scien- tific knowledge of large language models.arXiv preprint arXiv:2406.09098. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783....

work page arXiv 2024
[4]

Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

Big-bench extra hard.arXiv preprint arXiv:2502.19187. Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, and 1 others. 2024. Openassis- tant conversations-democratizing large language model alignment.Advances in Neural Informa- tion Processing Sys...

work page arXiv 2024
[5]

Agentif: Benchmarking instruction following of large language models in agentic scenarios

Agentif: Benchmarking instruction follow- ing of large language models in agentic scenarios. arXiv preprint arXiv:2505.16944. Yunjia Qi, Hao Peng, Xiaozhi Wang, Bin Xu, Lei Hou, and Juanzi Li. 2024. Constraint back- translation improves complex instruction follow- ing of large language models.arXiv preprint arXiv:2410.24175. Yulei Qin, Gang Li, Zongyi Li,...

work page arXiv 2024
[6]

InFirst Conference on Language Modeling

Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling. Qingyu Ren, Jie Zeng, Qianyu He, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, and Fei Yu. 2025. Step-by-step mastery: Enhancing soft constraint following ability of large language models.Preprint, arXiv:2501.04945. ByteDance Seed, Jiaze Chen, Tiantian Fan, ...

work page arXiv 2025
[7]

Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning

Seed1. 5-thinking: Advancing superb rea- soning models with reinforcement learning.arXiv preprint arXiv:2504.13914. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 oth- ers. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint ar...

work page arXiv 2024
[8]

Qwen Team

Conifer: Improving complex constrained instruction-following ability of large language models.arXiv preprint arXiv:2404.02823. Qwen Team. 2024. Qwen2 technical report.arXiv preprint arXiv:2412.15115. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi. 2022a. Self-instruct: Aligning language m...

work page arXiv 2024
[9]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Howard Chen, Austin W Hanjie, Run- zhe Yang, and Karthik Narasimhan. 2023. Collie: Systematic construction of constrained text gener- ation tasks.arXiv preprint arXiv:2307.08689. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

joy,” “anger,

Cfbench: A comprehensive constraints- following benchmark for llms.arXiv preprint arXiv:2408.01122. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid- dhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911. A Appendix A.1 Dataset Analysis A.1.1 Constraint D...

work page arXiv 2023
[11]

To make the instructions more complex, I want you to identify and return five atomic constraints that can be added to the seed question

I currently have a seed question, but the seed questions are relatively simple. To make the instructions more complex, I want you to identify and return five atomic constraints that can be added to the seed question

work page
[12]

I will provide [Seed Question] and [Constraint References], and you can use these references to propose five constraints that would increase the difficulty of the seed question

work page
[13]

You may choose one or more constraints from the list or propose new ones if needed

[Constraint References] are just suggestions. You may choose one or more constraints from the list or propose new ones if needed

work page
[14]

Your task is only to generate new constraints that can be added to it

Do not modify or rewrite the seed question. Your task is only to generate new constraints that can be added to it

work page
[15]

c1": "<first constraint>

Return the added constraints in the following JSON format: json { "c1": "<first constraint>", "c2": "<second constraint>", "c3": "<third constraint>", "c4": "<fourth constraint>", "c5": "<fifth constraint>" }

work page
[16]

No explanation, no reformulated question, no analysis—only the JSON structure

Do not return anything else. No explanation, no reformulated question, no analysis—only the JSON structure. [Constraint References]

work page
[17]

Lexical content constraint : {Definition} {Example}

work page
[18]

Word Count : {Definition} {Example}

work page
[19]

Whisker’s Quest

Rule Constraint : {Definition} {Example} [Seed Question] {raw_question} Table 11: Prompt for generating constraints. dataset accordingly. WritingBench (Wu et al., 2025): It is a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing creative, persuasive, informative, and technical writing. Collie (...

work page 2025

[1] [1]

InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–18702

Multichallenge: A realistic multi-turn con- versation evaluation benchmark challenging to frontier llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–18702. Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou

work page 2025

[2] [2]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Self-play with execution feedback: Improv- ing instruction-following capabilities of large lan- guage models.arXiv preprint arXiv:2406.13542. Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen

work page arXiv

[3] [3]

2024 , url =

Sciknoweval: Evaluating multi-level scien- tific knowledge of large language models.arXiv preprint arXiv:2406.09098. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783....

work page arXiv 2024

[4] [4]

Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

Big-bench extra hard.arXiv preprint arXiv:2502.19187. Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, and 1 others. 2024. Openassis- tant conversations-democratizing large language model alignment.Advances in Neural Informa- tion Processing Sys...

work page arXiv 2024

[5] [5]

Agentif: Benchmarking instruction following of large language models in agentic scenarios

Agentif: Benchmarking instruction follow- ing of large language models in agentic scenarios. arXiv preprint arXiv:2505.16944. Yunjia Qi, Hao Peng, Xiaozhi Wang, Bin Xu, Lei Hou, and Juanzi Li. 2024. Constraint back- translation improves complex instruction follow- ing of large language models.arXiv preprint arXiv:2410.24175. Yulei Qin, Gang Li, Zongyi Li,...

work page arXiv 2024

[6] [6]

InFirst Conference on Language Modeling

Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling. Qingyu Ren, Jie Zeng, Qianyu He, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, and Fei Yu. 2025. Step-by-step mastery: Enhancing soft constraint following ability of large language models.Preprint, arXiv:2501.04945. ByteDance Seed, Jiaze Chen, Tiantian Fan, ...

work page arXiv 2025

[7] [7]

Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning

Seed1. 5-thinking: Advancing superb rea- soning models with reinforcement learning.arXiv preprint arXiv:2504.13914. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 oth- ers. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint ar...

work page arXiv 2024

[8] [8]

Qwen Team

Conifer: Improving complex constrained instruction-following ability of large language models.arXiv preprint arXiv:2404.02823. Qwen Team. 2024. Qwen2 technical report.arXiv preprint arXiv:2412.15115. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi. 2022a. Self-instruct: Aligning language m...

work page arXiv 2024

[9] [9]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Howard Chen, Austin W Hanjie, Run- zhe Yang, and Karthik Narasimhan. 2023. Collie: Systematic construction of constrained text gener- ation tasks.arXiv preprint arXiv:2307.08689. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

joy,” “anger,

Cfbench: A comprehensive constraints- following benchmark for llms.arXiv preprint arXiv:2408.01122. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid- dhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911. A Appendix A.1 Dataset Analysis A.1.1 Constraint D...

work page arXiv 2023

[11] [11]

To make the instructions more complex, I want you to identify and return five atomic constraints that can be added to the seed question

I currently have a seed question, but the seed questions are relatively simple. To make the instructions more complex, I want you to identify and return five atomic constraints that can be added to the seed question

work page

[12] [12]

I will provide [Seed Question] and [Constraint References], and you can use these references to propose five constraints that would increase the difficulty of the seed question

work page

[13] [13]

You may choose one or more constraints from the list or propose new ones if needed

[Constraint References] are just suggestions. You may choose one or more constraints from the list or propose new ones if needed

work page

[14] [14]

Your task is only to generate new constraints that can be added to it

Do not modify or rewrite the seed question. Your task is only to generate new constraints that can be added to it

work page

[15] [15]

c1": "<first constraint>

Return the added constraints in the following JSON format: json { "c1": "<first constraint>", "c2": "<second constraint>", "c3": "<third constraint>", "c4": "<fourth constraint>", "c5": "<fifth constraint>" }

work page

[16] [16]

No explanation, no reformulated question, no analysis—only the JSON structure

Do not return anything else. No explanation, no reformulated question, no analysis—only the JSON structure. [Constraint References]

work page

[17] [17]

Lexical content constraint : {Definition} {Example}

work page

[18] [18]

Word Count : {Definition} {Example}

work page

[19] [19]

Whisker’s Quest

Rule Constraint : {Definition} {Example} [Seed Question] {raw_question} Table 11: Prompt for generating constraints. dataset accordingly. WritingBench (Wu et al., 2025): It is a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing creative, persuasive, informative, and technical writing. Collie (...

work page 2025