Generalizing Verifiable Instruction Following
Pith reviewed 2026-05-19 05:35 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{RXRGM7AQ}
Prints a linked pith:RXRGM7AQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Reinforcement learning with verifiable rewards improves language models' generalization to unseen output constraints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models overfit on common verifiable constraints and generalize poorly to unseen ones; training via reinforcement learning with verifiable rewards using hand-designed verification functions significantly raises adherence rates on out-of-domain constraints.
What carries the argument
Reinforcement learning with verifiable rewards (RLVR) paired with constraint-specific verification modules that score outputs during training.
If this is right
- Models can be made to follow a broader range of user-specified output formats without retraining on every possible rule.
- Verifiable reward signals provide a scalable path to reduce overfitting in instruction-following tasks.
- Releasing the 29 new training constraints and verification code enables others to replicate and extend the training setup.
Where Pith is reading between the lines
- The same verifiable-reward approach may extend to other controllable generation problems where partial automation of checks is possible.
- Wider adoption could decrease reliance on massive supervised datasets that try to cover every edge case in advance.
- If verification modules can be learned rather than hand-written, the method might apply to even more open-ended instructions.
Load-bearing premise
The 58 constraints in IFBench are truly novel and representative of real user instructions that models have not already encountered.
What would settle it
A model trained with RLVR shows no higher success rate than a baseline on the full set of 58 IFBench constraints.
Figures
read the original abstract
A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that language models overfit to small sets of verifiable constraints in existing benchmarks and fail to generalize to unseen output constraints. It introduces IFBench, a benchmark of 58 new, diverse, and challenging verifiable out-of-domain constraints, releases 29 hand-annotated training constraints with verification functions, and shows that reinforcement learning with verifiable rewards (RLVR) significantly improves precise instruction following generalization.
Significance. If the results hold, the work is significant for identifying a key limitation in current instruction-following capabilities and providing both a new evaluation benchmark and an RLVR-based training approach to address generalization. The open release of IFBench, training constraints, verification modules, prompts, and code supports reproducibility and further research in verifiable instruction following.
major comments (2)
- [§3] §3 (IFBench construction): The central generalization claim requires that the 58 constraints are genuinely out-of-domain and unseen. The manuscript describes them as 'new, diverse, and challenging verifiable out-of-domain constraints' but provides no explicit checks (n-gram overlap, embedding similarity, or membership tests against pretraining corpora) to rule out overlap with base model training data. This is load-bearing for the 'unseen' claim.
- [§4] §4 (RLVR experiments): The claim that RLVR significantly improves instruction following generalization is load-bearing, yet the provided abstract lacks quantitative metrics, baseline comparisons (e.g., vs. SFT), error bars, or data split details. The full experimental section must include these to allow evaluation of the magnitude and robustness of the reported gains.
minor comments (2)
- [Abstract] Abstract: Including one or two key quantitative results (e.g., accuracy deltas on IFBench) would strengthen the summary and allow immediate assessment of the improvement.
- [§3.1] Notation and figures: Ensure verification function pseudocode and example constraint-output pairs are consistently formatted across sections for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our work. We address each major comment below and outline the revisions we will make to strengthen the manuscript's claims on generalization and experimental rigor.
read point-by-point responses
-
Referee: [§3] §3 (IFBench construction): The central generalization claim requires that the 58 constraints are genuinely out-of-domain and unseen. The manuscript describes them as 'new, diverse, and challenging verifiable out-of-domain constraints' but provides no explicit checks (n-gram overlap, embedding similarity, or membership tests against pretraining corpora) to rule out overlap with base model training data. This is load-bearing for the 'unseen' claim.
Authors: We appreciate the referee highlighting this point, as the out-of-domain status is indeed central to our generalization claims. The 58 constraints were newly hand-designed by the authors to target verifiable output behaviors absent from prior benchmarks such as IFEval, with verification functions implemented from scratch. Nevertheless, we agree that quantitative overlap checks would provide stronger evidence. In the revised manuscript we will add an appendix section reporting (i) n-gram overlap statistics between IFBench constraints and both existing benchmarks and samples drawn from common pretraining corpora, and (ii) average cosine similarity of constraint embeddings (using a standard sentence transformer) to further substantiate minimal overlap with base-model training data. revision: yes
-
Referee: [§4] §4 (RLVR experiments): The claim that RLVR significantly improves instruction following generalization is load-bearing, yet the provided abstract lacks quantitative metrics, baseline comparisons (e.g., vs. SFT), error bars, or data split details. The full experimental section must include these to allow evaluation of the magnitude and robustness of the reported gains.
Authors: We thank the referee for this observation. The full experimental section (§4) already reports quantitative accuracy gains on IFBench, direct comparisons to SFT and other baselines, standard deviations across multiple random seeds (error bars), and explicit train/validation/test split details for both the 29 training constraints and the 58 IFBench constraints. To improve accessibility, we will revise the abstract to include a concise summary of the key numerical results (e.g., absolute and relative improvements under RLVR) and will add a results overview table at the beginning of §4 that consolidates metrics, baselines, and statistical details. revision: yes
Circularity Check
No significant circularity; empirical generalization claims rest on external benchmarks and verification functions
full rationale
The paper's central result—that RLVR on 29 hand-annotated training constraints improves performance on the separate 58-constraint IFBench—is an empirical measurement, not a quantity defined by construction from the authors' own prior equations or fitted parameters. Verification modules are designed and applied to produce rewards during training and to score held-out test items; the test constraints are presented as new and out-of-domain relative to both the training set and prior benchmarks. No self-definitional loop, fitted-input-renamed-as-prediction, or load-bearing self-citation chain appears in the derivation. The work is therefore self-contained against external benchmarks and does not reduce its headline claim to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Verification modules can be designed to correctly and automatically determine whether a model output satisfies each constraint.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a new benchmark, IFBENCH, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints... reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Training on a combination of constraints improves both in-domain and out-of-domain performance... wider constraint variable ranges for the training prompts in the RL stage
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
-
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
-
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
-
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform t...
-
Many-Tier Instruction Hierarchy in LLM Agents
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution
IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
-
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
-
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
-
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
Dynamic Boundary Evaluation adaptively identifies each LLM's performance boundary on a shared difficulty scale using a calibrated item bank and a search algorithm.
-
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
-
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
-
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards
The paper identifies confounds in RLVR evaluations that inflate apparent gains and proposes a minimum standard for budget-matched, contamination-aware assessment with calibration tracking.
Reference graph
Works this paper leans on
-
[1]
Nemotron-4 340b technical report
Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report.arXiv preprint arXiv:2406.11704, 2024
-
[2]
Models that prove their own correctness.arXiv preprint arXiv:2405.15722, 2024
Noga Amit, Shafi Goldwasser, Orr Paradise, and Guy Rothblum. Models that prove their own correctness.arXiv preprint arXiv:2405.15722, 2024
-
[3]
Scaling instruction-finetuned language models.J
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.J. Mach. Learn. Res., 2024
work page 2024
-
[4]
Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024
Ricardo Dominguez-Olmedo, Florian E Dorner, and Moritz Hardt. Training on the test task confounds evaluation and emergence.arXiv preprint arXiv:2407.07890, 2024
-
[6]
Time travel in llms: Tracing data contamination in large language models
Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations
-
[7]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility
Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility. 2025
work page 2025
-
[10]
Training chain-of-thought via latent-variable inference
Matthew Douglas Hoffman, Du Phan, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A Saurous. Training chain-of-thought via latent-variable inference. InNeurIPS, 2023
work page 2023
-
[11]
arXiv preprint arXiv:2311.10702 , year=
Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew E. Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hanna Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2.ArXiv, abs/2311.10702, 2023
-
[12]
Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. Followbench: A multi-level fine-grained constraints following benchmark for large language models.CoRR, 2023
work page 2023
-
[13]
Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. 2024
work page 2024
-
[14]
A systematic examination of preference learning through the lens of instruction-following
Joongwon Kim, Anirudh Goyal, Aston Zhang, Bo Xiong, Rui Hou, Melanie Kambadur, Dhruv Mahajan, Hannaneh Hajishirzi, and Liang Tan. A systematic examination of preference learning through the lens of instruction-following. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the As...
work page 2025
-
[15]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\" ulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Wildifeval: Instruction following in the wild.arXiv preprint arXiv:2503.06573, 2025
Gili Lior, Asaf Yehudai, Ariel Gera, and Liat Ein-Dor. Wildifeval: Instruction following in the wild.arXiv preprint arXiv:2503.06573, 2025
-
[17]
Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models
Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models. InThe Thirteenth International Conference on Learning Representations
-
[18]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Infobench: Evaluating instruction following ability in large language models
Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuan- sheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. Infobench: Evaluating instruction following ability in large language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 13025–13048, 2024
work page 2024
-
[20]
Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, and Samuel Dooley. To the cutoff... and beyond? a longitudinal perspective on llm data contamination. InThe Twelfth International Conference on Learning Representations
-
[21]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Improving instruction-following in language models through activation steering,
Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024
-
[23]
Evaluating large language models on controlled generation tasks
Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Frederick Wiet- ing, Nanyun Peng, and Xuezhe Ma. Evaluating large language models on controlled generation tasks. InThe 2023 Conference on Empirical Methods in Natural Language Processing
work page 2023
-
[24]
Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, and Yun-Nung Chen. Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442, 2024
-
[25]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Verifiable format control for large language model generations
Zhaoyang Wang, Jinqi Jiang, Huichi Zhou, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, and Huaxiu Yao. Verifiable format control for large language model generations. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3499–3513, 2025. 12
work page 2025
-
[31]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022
work page 2022
-
[34]
Wildchat: 1m chatgpt interaction logs in the wild
Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild. InThe Twelfth International Conference on Learning Representations
-
[35]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023
work page 2023
-
[36]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. A Out-of-Distribution Test Constraints Instruction Group Instruction Description count conjunctions Use at least {N} different coordinating conjunc- tio...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.