Expanding the capabilities of reinforcement learning via text feedback

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, Andrea Zanette · 2026 · arXiv 2602.02482

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

representative citing papers

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

cs.AI · 2026-05-01 · accept · novelty 7.0 · 2 refs

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

cs.LG · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates to ~50% in binary-reward GRPO, delivering 2.01x and 1.55x speedups on Qwen3-14B/32B with slight score improvements on SWE-bench Verified.

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

cs.CL · 2026-05-04 · unverdicted · novelty 6.0

FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.

Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems

cs.IR · 2026-04-11 · unverdicted · novelty 6.0

CoARS enables co-evolving recommender and user agents by using interaction-derived rewards and self-distilled credit assignment to internalize multi-turn feedback into model parameters, outperforming prior agentic baselines.

Self-Improving 4D Perception via Self-Distillation

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight benchmarks.

Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect ones, outperforming baselines without ground-truth contexts.

citing papers explorer

Showing 7 of 7 citing papers.

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR cs.LG · 2026-05-11 · unverdicted · none · ref 22
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding cs.AI · 2026-05-01 · accept · none · ref 30 · 2 links
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime cs.LG · 2026-05-06 · unverdicted · none · ref 34 · 2 links
Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates to ~50% in binary-reward GRPO, delivering 2.01x and 1.55x speedups on Qwen3-14B/32B with slight score improvements on SWE-bench Verified.
FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents cs.CL · 2026-05-04 · unverdicted · none · ref 61
FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.
Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems cs.IR · 2026-04-11 · unverdicted · none · ref 21
CoARS enables co-evolving recommender and user agents by using interaction-derived rewards and self-distilled credit assignment to internalize multi-turn feedback into model parameters, outperforming prior agentic baselines.
Self-Improving 4D Perception via Self-Distillation cs.CV · 2026-04-09 · unverdicted · none · ref 54
SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight benchmarks.
Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback cs.LG · 2026-05-08 · unverdicted · none · ref 37
SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect ones, outperforming baselines without ground-truth contexts.

Expanding the capabilities of reinforcement learning via text feedback

fields

years

verdicts

representative citing papers

citing papers explorer