hub Canonical reference

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al · 2022

Canonical reference. 83% of citing Pith papers cite this work as background.

81 Pith papers citing it

Background 83% of classified citations

browse 81 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 21 method 3

citation-polarity summary

background 20 use method 3 support 1

claims ledger

background step limit, demonstrating a failure to incorporate short-term state and past actions into decision-making. Agents also typically disregard previously entered inputs or action history [162]. Modern pretraining and supervised fine-tuning paradigms on dialogue-style data, which trains the model to learn short-term instruction-response behavior (while deprioritizing long-term embodied sequential state tracking), are likely resulting in these shortcomings [165, 162]. Premature termination and achieva
background further advances the alignment of LLMs with human intent by applying supervised fine-tuning (SFT) on instruction-following datasets, followed by reinforcement learning from human feedback (RLHF). Since then, alignment techniques have been extensively studied to ensure that large AI models behave in accordance with safety considerations, human preferences, and values [52]. These technological advances have led to the development of highly capable commercial LLMs, such as GPT- 4 [3] and Claude, wh
method seamlessly integrates reasoning and action generation, allowing adaptive switching between direct trajectory generation and CoT reasoning. In supervised fine-tuning (SFT), we leverage both trajectory- only data and CoT reasoning data to equip the model with dual-process capabilities (fast and slow thinking). Furthermore, we propose reinforcement fine-tuning (RFT) [48], utilizing Group Relative Policy Optimization (GRPO) [49] with verifiable planning reward functions. This enables adaptive reason
background Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. [2] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. [3] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katar
background Large language models (LLMs) are increasingly deployed in high-stakes settings, spanning scientific research [17], cybersecurity [28], and medical consultation [14], making misuse prevention a central safety challenge. Recent advances in model reasoning, safety alignment, and external guardrails have made frontier systems more effective at refusing explicit harmful requests [20, 1, 11, 42]. However, these improvements have also changed how attacks are carried out: rather than stating a harmful o
background that increasing instruction variation yields larger gains than scaling the number of training instances. Our work differs from these prior approaches by generating multiple response variations per question via heuristic-conditioned prompting and studying the effect of this diversity during mid-training on subsequent RL. Reinforcement Learning for LLMsReinforcement Learning from Human Feedback (RLHF) [ 38] has become a standard post-training step, aligning models with human preferences by trainin

co-cited works

representative citing papers

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

GRASP: Deterministic argument ranking in interaction graphs

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

Do Coding Agents Understand Least-Privilege Authorization?

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.

Query-Conditioned Test-Time Self-Training for Large Language Models

cs.CL · 2026-05-13 · conditional · novelty 7.0 · 2 refs

QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

cs.LG · 2026-05-12 · conditional · novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives across four benchmarks and three model families.

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

cs.CR · 2026-05-04 · unverdicted · novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.

Bounded Ratio Reinforcement Learning

cs.LG · 2026-04-20 · conditional · novelty 7.0

BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.

Step-level Denoising-time Diffusion Alignment with Multiple Objectives

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

cs.LG · 2026-04-12 · unverdicted · novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

cs.CL · 2026-01-05 · unverdicted · novelty 7.0

ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.

On the Sample Complexity of Differentially Private Policy Optimization

cs.LG · 2025-10-24 · unverdicted · novelty 7.0

Differential privacy in policy optimization adds sample complexity costs that often appear as lower-order terms rather than dominating the bounds.

Beyond Syntax: Action Semantics Learning for App Agents

cs.AI · 2025-06-21 · unverdicted · novelty 7.0

Action Semantics Learning trains app agents to align with the semantic effects of actions via a Semantic Estimator module, improving robustness to out-of-distribution scenarios over syntax-matching fine-tuning.

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.

CLORE: Content-Level Optimization for Reasoning Efficiency

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.

citing papers explorer

Showing 20 of 20 citing papers after filters.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation cs.LG · 2026-05-12 · conditional · none · ref 3
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization cs.LG · 2026-05-11 · unverdicted · none · ref 46
MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives across four benchmarks and three model families.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 27
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 39 · 2 links
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment cs.AI · 2026-05-12 · unverdicted · none · ref 28
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 7 · 2 links
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
What should post-training optimize? A test-time scaling law perspective cs.LG · 2026-05-11 · unverdicted · none · ref 13
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning cs.AI · 2026-05-10 · unverdicted · none · ref 22 · 2 links
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL · 2026-05-10 · unverdicted · none · ref 27
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Reinforcing Multimodal Reasoning Against Visual Degradation cs.CV · 2026-05-10 · unverdicted · none · ref 23
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks cs.CR · 2026-05-10 · unverdicted · none · ref 26
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 34
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue cs.CL · 2026-05-07 · unverdicted · none · ref 20 · 2 links
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model cs.LG · 2025-03-31 · unverdicted · none · ref 6
A simple PPO-based RL training pipeline on base models scales reasoning performance and response length, outperforming prior work on math and science benchmarks with one-tenth the training steps.
CoT-Guard: Small Models for Strong Monitoring cs.CR · 2026-05-12 · unverdicted · none · ref 45
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants cs.CL · 2026-05-10 · unverdicted · none · ref 51
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 38
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
Incentivizing User Data Contributions for LLM Improvement under Withdrawal Rights cs.GT · 2026-05-08 · unverdicted · none · ref 3
Withdrawal rights paired with centralized cost-based assignment prevent subsidy waste by collecting data only when the improvement threshold is sustainably reachable, turning infeasible cases into null outcomes.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 52
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges cs.AI · 2025-10-27 · unverdicted · none · ref 165
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer