Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

· 2025 · cs.CL · arXiv 2510.14420

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if

representative citing papers

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

cs.CL · 2026-05-08 · conditional · novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

WEval and WRL introduce fine-grained benchmarking and requirement-selective sample construction for training writing reward models, yielding substantial gains on writing benchmarks with strong generalization.

citing papers explorer

Showing 2 of 2 citing papers.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following cs.CL · 2026-05-08 · conditional · none · ref 28 · internal anchor
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks cs.CL · 2026-04-30 · unverdicted · none · ref 3 · internal anchor
WEval and WRL introduce fine-grained benchmarking and requirement-selective sample construction for training writing reward models, yielding substantial gains on writing benchmarks with strong generalization.

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

fields

years

verdicts

representative citing papers

citing papers explorer