hub Canonical reference

Reinforcement learning for long-horizon interactive llm agents

Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun · 2025 · arXiv 2502.01600

Canonical reference. 71% of citing Pith papers cite this work as background.

20 Pith papers citing it

Background 71% of classified citations

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 5 unclear 2

representative citing papers

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

cs.AI · 2026-04-08 · unverdicted · novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

cs.AI · 2025-06-04 · unverdicted · novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

cs.LG · 2026-05-14 · conditional · novelty 6.0

ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

HCL-GP learns parameterized policies and reuses extracted components to achieve 98% accuracy on AppWorld benchmark tasks for LLM agents, outperforming static synthesis by 15.8 points on challenges.

A Survey on LLM-based Conversational User Simulation

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

A survey that introduces a taxonomy for LLM-based conversational user simulation, analyzes core techniques and evaluation methods, and identifies open challenges in the field.

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

cs.LG · 2026-04-12 · unverdicted · novelty 6.0

Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

cs.RO · 2025-09-29 · unverdicted · novelty 6.0

World-Env replaces physical robot interactions with a world model-based virtual environment and VLM-guided rewards to enable efficient RL post-training for VLA models, showing gains with only five demonstrations per task.

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

cs.AI · 2026-05-07 · unverdicted · novelty 5.0 · 3 refs

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards

cs.AI · 2026-04-29 · unverdicted · novelty 5.0 · 4 refs

FutureWorld is a modified verl-tool framework that enables delayed real-world outcome rewards for training LLM-based predictive agents, yielding consistent gains in accuracy, scoring, and calibration across three open-source models.

GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving

cs.CL · 2026-05-25 · unverdicted · novelty 4.0

GeoMathCode interleaves math reasoning with programmatic code outputs for geometry problems in MLLMs and shows that reasoning steps and hierarchical code structures become disentangled in latent space after SFT.

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

cs.AI · 2026-05-01 · unverdicted · novelty 4.0 · 2 refs

AEM adaptively modulates response-level entropy in agentic RL to improve credit assignment and exploration-exploitation balance, yielding gains on ALFWorld, WebShop, and SWE-bench.

A Survey of Scaling in Large Language Model Reasoning

cs.AI · 2025-04-02 · unverdicted · novelty 3.0

A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

cs.AI · 2026-04-04

citing papers explorer

Showing 1 of 1 citing paper after filters.

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training cs.RO · 2025-09-29 · unverdicted · none · ref 7
World-Env replaces physical robot interactions with a world model-based virtual environment and VLM-guided rewards to enable efficient RL post-training for VLA models, showing gains with only five demonstrations per task.

Reinforcement learning for long-horizon interactive llm agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer