hub Mixed citations

Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi · 2025 · cs.CL · arXiv 2507.02833

Mixed citation behavior. Most common role is background (44%).

32 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 32 citing papers arXiv PDF

abstract

A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 3 method 1

citation-polarity summary

background 4 use dataset 3 unclear 1 use method 1

representative citing papers

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

IFMTBench is a new benchmark for multilingual translation instruction following that tests models on single and multi-constraint scenarios using deterministic checkers and LLM judges.

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

eess.AS · 2025-09-30 · unverdicted · novelty 7.0

Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

cs.CL · 2026-04-28 · accept · novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.

Many-Tier Instruction Hierarchy in LLM Agents

cs.CL · 2026-04-10 · unverdicted · novelty 7.0

ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution

cs.SE · 2026-02-27 · unverdicted · novelty 7.0

IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

cs.AI · 2026-06-25 · unverdicted · novelty 6.0

BINEVAL turns evaluation criteria into binary questions answered by LLMs to yield transparent multi-dimensional scores that match human judgments on standard benchmarks and support iterative prompt optimization.

MobileMoE: Scaling On-Device Mixture of Experts

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

cs.CL · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.

Post-Trained MoE Can Skip Half Experts via Self-Distillation

cs.LG · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

ZEDA turns post-trained static MoE models into dynamic ones via zero-output expert injection and two-stage self-distillation, cutting over 50% expert FLOPs on Qwen3-30B-A3B and GLM-4.7-Flash with small accuracy drops across 11 benchmarks.

Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces

cs.AI · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Shepherd provides a reversible execution trace substrate for LLM agents that enables meta-agents to inspect and transform runs, yielding reported gains on coding and terminal benchmarks via supervision, counterfactual repair, and RL credit assignment.

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

cs.CL · 2025-09-25 · unverdicted · novelty 6.0

RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.

FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

cs.CL · 2026-05-08 · conditional · novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.

AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

citing papers explorer

Showing 28 of 28 citing papers after filters.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning cs.LG · 2026-05-09 · conditional · none · ref 23
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models cs.AI · 2026-04-02 · unverdicted · none · ref 13
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs cs.CL · 2026-06-10 · unverdicted · none · ref 57 · internal anchor
ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.
IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following cs.CL · 2026-05-27 · unverdicted · none · ref 5 · internal anchor
IFMTBench is a new benchmark for multilingual translation instruction following that tests models on single and multi-constraint scenarios using deterministic checkers and LLM judges.
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance cs.CL · 2026-05-08 · unverdicted · none · ref 9
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control cs.LG · 2026-05-08 · unverdicted · none · ref 25
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models cs.CL · 2026-04-28 · accept · none · ref 16
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems cs.CL · 2026-04-14 · unverdicted · none · ref 16
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.
Many-Tier Instruction Hierarchy in LLM Agents cs.CL · 2026-04-10 · unverdicted · none · ref 21
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 91
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution cs.SE · 2026-02-27 · unverdicted · none · ref 22
IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement cs.AI · 2026-06-25 · unverdicted · none · ref 1 · internal anchor
BINEVAL turns evaluation criteria into binary questions answered by LLMs to yield transparent multi-dimensional scores that match human judgments on standard benchmarks and support iterative prompt optimization.
MobileMoE: Scaling On-Device Mixture of Experts cs.LG · 2026-05-26 · unverdicted · none · ref 48 · internal anchor
MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs cs.CL · 2026-05-19 · unverdicted · none · ref 15 · 2 links · internal anchor
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
Post-Trained MoE Can Skip Half Experts via Self-Distillation cs.LG · 2026-05-18 · unverdicted · none · ref 16 · 2 links · internal anchor
ZEDA turns post-trained static MoE models into dynamic ones via zero-output expert injection and two-stage self-distillation, cutting over 50% expert FLOPs on Qwen3-30B-A3B and GLM-4.7-Flash with small accuracy drops across 11 benchmarks.
Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces cs.AI · 2026-05-11 · unverdicted · none · ref 29 · 2 links · internal anchor
Shepherd provides a reversible execution trace substrate for LLM agents that enables meta-agents to inspect and transform runs, yielding reported gains on coding and terminal benchmarks via supervision, counterfactual repair, and RL credit assignment.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration cs.LG · 2026-05-08 · unverdicted · none · ref 23
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.
SEIF: Self-Evolving Reinforcement Learning for Instruction Following cs.CL · 2026-05-08 · conditional · none · ref 23
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 15
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs cs.CL · 2026-04-24 · unverdicted · none · ref 22
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization cs.CL · 2026-04-17 · unverdicted · none · ref 34
GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks cs.CL · 2026-04-03 · unverdicted · none · ref 6
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection cs.CL · 2026-06-09 · unverdicted · none · ref 13 · internal anchor
APEX dynamically tiers data into Easy/Hard/Mixed based on optimization lineage and prioritizes Mixed examples, reporting 11.2% and 6.8% average gains over baseline prompts on two models under a 5,000-call budget.
ComplexConstraints and Beyond: Expert Rubrics for RLVR cs.AI · 2026-06-08 · unverdicted · none · ref 16 · internal anchor
Expert-curated rubrics in the new ComplexConstraints dataset improve LLM instruction following by 12-15% when used as RL training signals, with gains transferring to out-of-distribution agentic benchmarks.
CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts cs.LG · 2026-05-30 · unverdicted · none · ref 31 · internal anchor
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 32
Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.
Qwen3.5-Omni Technical Report cs.CL · 2026-04-17 · unverdicted · none · ref 31 · 2 links
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
OpenCompass: A Universal Evaluation Platform for Large Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 14 · 2 links · internal anchor
OpenCompass is presented as a one-stop, scalable, high-concurrency LLM evaluation platform with modular architecture supporting multiple domains and evaluator types.

Generalizing Verifiable Instruction Following

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer