pith. the verified trust layer for science. sign in

arxiv: 2507.02833 · v3 · pith:RXRGM7AQnew · submitted 2025-07-03 · 💻 cs.CL

Generalizing Verifiable Instruction Following

Pith reviewed 2026-05-19 05:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords instruction followingreinforcement learningverifiable rewardsgeneralizationlanguage modelsbenchmarksoutput constraints
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{RXRGM7AQ}

Prints a linked pith:RXRGM7AQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Reinforcement learning with verifiable rewards improves language models' generalization to unseen output constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models tend to overfit to a narrow set of known output constraints like yes/no answers or word repetitions and fail on novel ones users might add. This paper creates IFBench with 58 fresh, diverse verifiable constraints to measure true generalization beyond training data. It demonstrates that reinforcement learning with rewards from automatic verification modules markedly raises success rates on these new constraints. Readers care because precise following of custom instructions is essential for practical, reliable AI assistants that adapt to arbitrary user needs.

Core claim

Models overfit on common verifiable constraints and generalize poorly to unseen ones; training via reinforcement learning with verifiable rewards using hand-designed verification functions significantly raises adherence rates on out-of-domain constraints.

What carries the argument

Reinforcement learning with verifiable rewards (RLVR) paired with constraint-specific verification modules that score outputs during training.

If this is right

  • Models can be made to follow a broader range of user-specified output formats without retraining on every possible rule.
  • Verifiable reward signals provide a scalable path to reduce overfitting in instruction-following tasks.
  • Releasing the 29 new training constraints and verification code enables others to replicate and extend the training setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verifiable-reward approach may extend to other controllable generation problems where partial automation of checks is possible.
  • Wider adoption could decrease reliance on massive supervised datasets that try to cover every edge case in advance.
  • If verification modules can be learned rather than hand-written, the method might apply to even more open-ended instructions.

Load-bearing premise

The 58 constraints in IFBench are truly novel and representative of real user instructions that models have not already encountered.

What would settle it

A model trained with RLVR shows no higher success rate than a baseline on the full set of 58 IFBench constraints.

Figures

Figures reproduced from arXiv: 2507.02833 by Hamish Ivison, Hannaneh Hajishirzi, Nathan Lambert, Pradeep Dasigi, Saumya Malik, Shengyi Huang, Valentina Pyatkin, Victoria Graf.

Figure 1
Figure 1. Figure 1: Model performance on IFEval and IFBENCH (single-turn). Left Models: out-of-the-box performance. Right Models: after IF-RLVR training. IFBENCH has either 1 or 2 constraints per instruction. models display good accuracy on IFEval. The scores on our new unseen benchmark, IFBENCH, on the other hand, are much lower due to the verifiable constraints being different, despite the task and evaluation setup being th… view at source ↗
Figure 2
Figure 2. Figure 2: Training on 1 - 6 constraints per instrucF [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training on IFTrain (ood) + n constraints (in-domain) from IFEval. ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Experiments with variable ranges. (TÜLU-DPO policy) Most of the constraint templates contain vari￾ables which can be instantiated with different values. In your response, all lowercase words should appear at most N times., for example, has the variable N which could in theory be any number. For both the IFEval and the IF￾BENCH benchmarks, variables are instantiated for each instruction instance from a fixe… view at source ↗
Figure 6
Figure 6. Figure 6: Removing a constraint category from training. (TÜLU-DPO policy) We designed the new training constraints so that they would cover IF skills models are cur￾rently lacking in, such as copying from the input, counting, and formatting. We find that GRPO training on our new constraints shows targeted improvements in all these areas. As seen in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example output of a model being overoptimized to follow constraints. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparing the model before vs. after RLVR training: LLM-as-judge scores vs. verifiable [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Chat template for IF-RLVR training from base. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that language models overfit to small sets of verifiable constraints in existing benchmarks and fail to generalize to unseen output constraints. It introduces IFBench, a benchmark of 58 new, diverse, and challenging verifiable out-of-domain constraints, releases 29 hand-annotated training constraints with verification functions, and shows that reinforcement learning with verifiable rewards (RLVR) significantly improves precise instruction following generalization.

Significance. If the results hold, the work is significant for identifying a key limitation in current instruction-following capabilities and providing both a new evaluation benchmark and an RLVR-based training approach to address generalization. The open release of IFBench, training constraints, verification modules, prompts, and code supports reproducibility and further research in verifiable instruction following.

major comments (2)
  1. [§3] §3 (IFBench construction): The central generalization claim requires that the 58 constraints are genuinely out-of-domain and unseen. The manuscript describes them as 'new, diverse, and challenging verifiable out-of-domain constraints' but provides no explicit checks (n-gram overlap, embedding similarity, or membership tests against pretraining corpora) to rule out overlap with base model training data. This is load-bearing for the 'unseen' claim.
  2. [§4] §4 (RLVR experiments): The claim that RLVR significantly improves instruction following generalization is load-bearing, yet the provided abstract lacks quantitative metrics, baseline comparisons (e.g., vs. SFT), error bars, or data split details. The full experimental section must include these to allow evaluation of the magnitude and robustness of the reported gains.
minor comments (2)
  1. [Abstract] Abstract: Including one or two key quantitative results (e.g., accuracy deltas on IFBench) would strengthen the summary and allow immediate assessment of the improvement.
  2. [§3.1] Notation and figures: Ensure verification function pseudocode and example constraint-output pairs are consistently formatted across sections for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each major comment below and outline the revisions we will make to strengthen the manuscript's claims on generalization and experimental rigor.

read point-by-point responses
  1. Referee: [§3] §3 (IFBench construction): The central generalization claim requires that the 58 constraints are genuinely out-of-domain and unseen. The manuscript describes them as 'new, diverse, and challenging verifiable out-of-domain constraints' but provides no explicit checks (n-gram overlap, embedding similarity, or membership tests against pretraining corpora) to rule out overlap with base model training data. This is load-bearing for the 'unseen' claim.

    Authors: We appreciate the referee highlighting this point, as the out-of-domain status is indeed central to our generalization claims. The 58 constraints were newly hand-designed by the authors to target verifiable output behaviors absent from prior benchmarks such as IFEval, with verification functions implemented from scratch. Nevertheless, we agree that quantitative overlap checks would provide stronger evidence. In the revised manuscript we will add an appendix section reporting (i) n-gram overlap statistics between IFBench constraints and both existing benchmarks and samples drawn from common pretraining corpora, and (ii) average cosine similarity of constraint embeddings (using a standard sentence transformer) to further substantiate minimal overlap with base-model training data. revision: yes

  2. Referee: [§4] §4 (RLVR experiments): The claim that RLVR significantly improves instruction following generalization is load-bearing, yet the provided abstract lacks quantitative metrics, baseline comparisons (e.g., vs. SFT), error bars, or data split details. The full experimental section must include these to allow evaluation of the magnitude and robustness of the reported gains.

    Authors: We thank the referee for this observation. The full experimental section (§4) already reports quantitative accuracy gains on IFBench, direct comparisons to SFT and other baselines, standard deviations across multiple random seeds (error bars), and explicit train/validation/test split details for both the 29 training constraints and the 58 IFBench constraints. To improve accessibility, we will revise the abstract to include a concise summary of the key numerical results (e.g., absolute and relative improvements under RLVR) and will add a results overview table at the beginning of §4 that consolidates metrics, baselines, and statistical details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical generalization claims rest on external benchmarks and verification functions

full rationale

The paper's central result—that RLVR on 29 hand-annotated training constraints improves performance on the separate 58-constraint IFBench—is an empirical measurement, not a quantity defined by construction from the authors' own prior equations or fitted parameters. Verification modules are designed and applied to produce rewards during training and to score held-out test items; the test constraints are presented as new and out-of-domain relative to both the training set and prior benchmarks. No self-definitional loop, fitted-input-renamed-as-prediction, or load-bearing self-citation chain appears in the derivation. The work is therefore self-contained against external benchmarks and does not reduce its headline claim to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of accurate, automatically computable verification functions for the new constraints and on the assumption that RLVR training on these functions produces genuine generalization rather than benchmark-specific overfitting.

axioms (1)
  • domain assumption Verification modules can be designed to correctly and automatically determine whether a model output satisfies each constraint.
    Stated in the abstract as part of the method for RLVR.

pith-pipeline@v0.9.0 · 5754 in / 1264 out tokens · 124335 ms · 2026-05-19T05:35:04.827796+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  2. Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

    cs.AI 2026-04 unverdicted novelty 8.0

    User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

  3. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  4. Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

  5. Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

  6. The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

    cs.CL 2026-04 accept novelty 7.0

    SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

  7. CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

    cs.CL 2026-04 unverdicted novelty 7.0

    CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform t...

  8. Many-Tier Instruction Hierarchy in LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

  9. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  10. Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution

    cs.SE 2026-02 unverdicted novelty 7.0

    IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.

  11. Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

    eess.AS 2025-09 unverdicted novelty 7.0

    Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.

  12. FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...

  13. SEIF: Self-Evolving Reinforcement Learning for Instruction Following

    cs.CL 2026-05 conditional novelty 6.0

    SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

  14. Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    Dynamic Boundary Evaluation adaptively identifies each LLM's performance boundary on a shared difficulty scale using a calibrated item bank and a search algorithm.

  15. AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

    cs.CL 2026-04 unverdicted novelty 6.0

    AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

  16. GroupDPO: Memory efficient Group-wise Direct Preference Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

  17. Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

    cs.CL 2026-04 unverdicted novelty 6.0

    RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

  18. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

    cs.AI 2026-05 unverdicted novelty 5.0 partial

    Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

  19. Qwen3.5-Omni Technical Report

    cs.CL 2026-04 unverdicted novelty 5.0

    Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...

  20. NVIDIA Nemotron 3: Efficient and Open Intelligence

    cs.CL 2025-12 unverdicted novelty 5.0

    NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

  21. Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

    cs.LG 2025-09 conditional novelty 5.0

    The paper identifies confounds in RLVR evaluations that inflate apparent gains and proposes a minimum standard for budget-matched, contamination-aware assessment with calibration tracking.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 21 Pith papers · 12 internal anchors

  1. [1]

    Nemotron-4 340b technical report

    Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report.arXiv preprint arXiv:2406.11704, 2024

  2. [2]

    Models that prove their own correctness.arXiv preprint arXiv:2405.15722, 2024

    Noga Amit, Shafi Goldwasser, Orr Paradise, and Guy Rothblum. Models that prove their own correctness.arXiv preprint arXiv:2405.15722, 2024

  3. [3]

    Scaling instruction-finetuned language models.J

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.J. Mach. Learn. Res., 2024

  4. [4]

    Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

    Ricardo Dominguez-Olmedo, Florian E Dorner, and Moritz Hardt. Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024

  5. [6]

    Time travel in llms: Tracing data contamination in large language models

    Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations

  6. [7]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  7. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [9]

    A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility

    Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility. 2025

  9. [10]

    Training chain-of-thought via latent-variable inference

    Matthew Douglas Hoffman, Du Phan, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A Saurous. Training chain-of-thought via latent-variable inference. InNeurIPS, 2023

  10. [11]

    arXiv preprint arXiv:2311.10702 , year=

    Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew E. Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hanna Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2.ArXiv, abs/2311.10702, 2023

  11. [12]

    Followbench: A multi-level fine-grained constraints following benchmark for large language models.CoRR, 2023

    Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. Followbench: A multi-level fine-grained constraints following benchmark for large language models.CoRR, 2023

  12. [13]

    Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. 2024

  13. [14]

    A systematic examination of preference learning through the lens of instruction-following

    Joongwon Kim, Anirudh Goyal, Aston Zhang, Bo Xiong, Rui Hou, Melanie Kambadur, Dhruv Mahajan, Hannaneh Hajishirzi, and Liang Tan. A systematic examination of preference learning through the lens of instruction-following. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the As...

  14. [15]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\" ulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  15. [16]

    Wildifeval: Instruction following in the wild.arXiv preprint arXiv:2503.06573, 2025

    Gili Lior, Asaf Yehudai, Ariel Gera, and Liat Ein-Dor. Wildifeval: Instruction following in the wild.arXiv preprint arXiv:2503.06573, 2025

  16. [17]

    Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models

    Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models. InThe Thirteenth International Conference on Learning Representations

  17. [18]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

  18. [19]

    Infobench: Evaluating instruction following ability in large language models

    Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuan- sheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. Infobench: Evaluating instruction following ability in large language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 13025–13048, 2024

  19. [20]

    To the cutoff

    Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, and Samuel Dooley. To the cutoff... and beyond? a longitudinal perspective on llm data contamination. InThe Twelfth International Conference on Learning Representations

  20. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  21. [22]

    Improving instruction-following in language models through activation steering,

    Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

  22. [23]

    Evaluating large language models on controlled generation tasks

    Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Frederick Wiet- ing, Nanyun Peng, and Xuezhe Ma. Evaluating large language models on controlled generation tasks. InThe 2023 Conference on Empirical Methods in Natural Language Processing

  23. [24]

    Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442, 2024

    Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, and Yun-Nung Chen. Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442, 2024

  24. [25]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  25. [27]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  26. [28]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208, 2024

  27. [29]

    Reinforcement Learning for Reasoning in Large Language Models with One Training Example

    Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025

  28. [30]

    Verifiable format control for large language model generations

    Zhaoyang Wang, Jinqi Jiang, Huichi Zhou, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, and Huaxiu Yao. Verifiable format control for large language model generations. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3499–3513, 2025. 12

  29. [31]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  30. [32]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  31. [33]

    Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

  32. [34]

    Wildchat: 1m chatgpt interaction logs in the wild

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild. InThe Twelfth International Conference on Learning Representations

  33. [35]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

  34. [36]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. A Out-of-Distribution Test Constraints Instruction Group Instruction Description count conjunctions Use at least {N} different coordinating conjunc- tio...