hub Canonical reference

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang · 2026 · cs.CL · arXiv 2603.10165

Canonical reference. 83% of citing Pith papers cite this work as background.

35 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 baseline 1

citation-polarity summary

background 10 baseline 1 unclear 1

representative citing papers

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

cs.LG · 2026-05-11 · conditional · novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

cs.CL · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

cs.AI · 2026-06-03 · unverdicted · novelty 6.0

AgentJet presents a decoupled multi-node swarm architecture for LLM agent RL that enables heterogeneous multi-model training, multi-task isolation, fault tolerance, live code iteration, context-optimized training, and an autonomous research system.

What to Format and How: A Benchmark and Workflow Approach for Document Formatting

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

Presents DocFormBench benchmark and DocFormFlow workflow for content-aware LLM document formatting, claiming higher accuracy and lower token use via decoupled localization and modification.

Enhancing LLM Metacognition via Cognitive Pairwise Training

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.

ECHO: Terminal Agents Learn World Models for Free

cs.LG · 2026-05-23 · unverdicted · novelty 6.0

ECHO is a hybrid RL objective that trains agents to predict environment observation tokens from their actions, doubling GRPO pass@1 on TerminalBench-2.0 while improving dynamics prediction on held-out trajectories.

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

HINT-SD improves long-horizon LLM agent training by using hindsight to target self-distillation on failure-relevant action spans, delivering up to 18.8% higher performance and 2.26x lower time per step than dense per-turn feedback.

OpenJarvis: Personal AI, On Personal Devices

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

OpenJarvis decomposes personal AI into Intelligence, Engine, Agents, Tools & Memory, and Learning primitives and applies LLM-guided spec search to produce on-device configurations that reach within 3.2 pp of cloud baselines on average across eight tasks.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

When Model Editing Meets Service Evolution: A Knowledge-Update Perspective for Service Recommendation

cs.SE · 2026-04-29 · unverdicted · novelty 6.0

EVOREC integrates locate-then-edit model editing with FA-constrained decoding to improve LLM-based service recommendation under evolution, reporting 25.9% average relative gain in Recall@5 over baselines and 22.3% over fine-tuning in dynamic scenarios.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

Multilingual Safety Alignment via Self-Distillation

cs.LG · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

ClawGUI delivers a unified open-source stack for stable RL training of GUI agents, standardized evaluation on 6 benchmarks with 95.8% reproduction, and real-device deployment, yielding a 2B model at 17.1% success rate on MobileWorld GUI-Only.

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

cs.LG · 2026-04-12 · unverdicted · novelty 6.0

Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.

Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

cs.AI · 2026-06-03 · unverdicted · novelty 5.0

An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.

citing papers explorer

Showing 35 of 35 citing papers.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Continual Harness: Online Adaptation for Self-Improving Foundation Agents cs.LG · 2026-05-11 · conditional · none · ref 22 · internal anchor
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models cs.AI · 2026-04-02 · unverdicted · none · ref 18 · internal anchor
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments cs.CL · 2026-06-04 · unverdicted · none · ref 53 · internal anchor
CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.
EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing cs.LG · 2026-05-30 · unverdicted · none · ref 112 · internal anchor
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations cs.CL · 2026-05-14 · unverdicted · none · ref 1 · 2 links · internal anchor
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 61 · internal anchor
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning cs.AI · 2026-05-11 · unverdicted · none · ref 30 · internal anchor
TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents cs.LG · 2026-04-27 · unverdicted · none · ref 17
TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning cs.AI · 2026-06-03 · unverdicted · none · ref 16 · internal anchor
AgentJet presents a decoupled multi-node swarm architecture for LLM agent RL that enables heterogeneous multi-model training, multi-task isolation, fault tolerance, live code iteration, context-optimized training, and an autonomous research system.
What to Format and How: A Benchmark and Workflow Approach for Document Formatting cs.CL · 2026-06-01 · unverdicted · none · ref 18 · internal anchor
Presents DocFormBench benchmark and DocFormFlow workflow for content-aware LLM document formatting, claiming higher accuracy and lower token use via decoupled localization and modification.
Enhancing LLM Metacognition via Cognitive Pairwise Training cs.LG · 2026-05-30 · unverdicted · none · ref 4 · internal anchor
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
ECHO: Terminal Agents Learn World Models for Free cs.LG · 2026-05-23 · unverdicted · none · ref 18 · internal anchor
ECHO is a hybrid RL objective that trains agents to predict environment observation tokens from their actions, doubling GRPO pass@1 on TerminalBench-2.0 while improving dynamics prediction on held-out trajectories.
HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents cs.LG · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
HINT-SD improves long-horizon LLM agent training by using hindsight to target self-distillation on failure-relevant action spans, delivering up to 18.8% higher performance and 2.26x lower time per step than dense per-turn feedback.
OpenJarvis: Personal AI, On Personal Devices cs.LG · 2026-05-16 · unverdicted · none · ref 90 · internal anchor
OpenJarvis decomposes personal AI into Intelligence, Engine, Agents, Tools & Memory, and Learning primitives and applies LLM-guided spec search to produce on-device configurations that reach within 3.2 pp of cloud baselines on average across eight tasks.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 41 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 43 · internal anchor
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
When Model Editing Meets Service Evolution: A Knowledge-Update Perspective for Service Recommendation cs.SE · 2026-04-29 · unverdicted · none · ref 42 · internal anchor
EVOREC integrates locate-then-edit model editing with FA-constrained decoding to improve LLM-based service recommendation under evolution, reporting 25.9% average relative gain in Recall@5 over baselines and 22.3% over fine-tuning in dynamic scenarios.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 34
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
Multilingual Safety Alignment via Self-Distillation cs.LG · 2026-05-03 · unverdicted · none · ref 12 · 2 links
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0) cs.CL · 2026-04-18 · unverdicted · none · ref 3
GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents cs.LG · 2026-04-13 · unverdicted · none · ref 3
ClawGUI delivers a unified open-source stack for stable RL training of GUI agents, standardized evaluation on 6 benchmarks with 95.8% reproduction, and real-device deployment, yielding a 2B model at 17.1% success rate on MobileWorld GUI-Only.
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents cs.LG · 2026-04-12 · unverdicted · none · ref 30
Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline cs.AI · 2026-06-03 · unverdicted · none · ref 37 · internal anchor
An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.
BraveGuard: From Open-World Threats to Safer Computer-Use Agents cs.CR · 2026-05-31 · unverdicted · none · ref 37 · internal anchor
BraveGuard trains guard models on realistic agent trajectories derived from open-world threats, raising detection accuracy on AgentHazard from 38.79% to 82.38%.
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution cs.CL · 2026-05-18 · unverdicted · none · ref 63 · internal anchor
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
ClawGym: A Scalable Framework for Building Effective Claw Agents cs.CL · 2026-04-29 · unverdicted · none · ref 2 · 2 links · internal anchor
ClawGym provides synthetic data generation, SFT and RL training, and a benchmark for developing Claw-style agents that operate over local files and tools.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent cs.LG · 2026-04-07 · unverdicted · none · ref 25
AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-optimal accuracy on benchmarks.
ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration cs.RO · 2026-04-06 · unverdicted · none · ref 33
ROSClaw is a hierarchical framework that unifies vision-language model control with e-URDF-based sim-to-real mapping and closed-loop data collection to enable semantic-physical collaboration among heterogeneous multi-agent robots.
CLaaS: Continual learning as a service for sample efficient online learning cs.LG · 2026-06-04 · unverdicted · none · ref 24 · internal anchor
CLaaS enables sample-efficient online continual learning for agents via replay-buffered parametric updates, outperforming in-context learning in forward transfer and retention on an adversarial task.
Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents cs.DC · 2026-07-01 · unverdicted · none · ref 2 · internal anchor
Enterprise self-evolving agents require new agentic RL systems built around standardized trajectory data protocols, workload-to-learning data proxies, and automatic policy evolution control planes.
Rethinking Agentic Reinforcement Learning In Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 93 · 3 links · internal anchor
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
Xiaomi-GUI-0 Technical Report cs.AI · 2026-06-30 · unreviewed · ref 44 · internal anchor
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unreviewed · ref 77
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data cs.LG · 2026-04-15 · unreviewed · ref 35

OpenClaw-RL: Train Any Agent Simply by Talking

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer