{"work":{"id":"78607317-8305-4515-8dc3-20b4ff5b8f3a","openalex_id":null,"doi":null,"arxiv_id":"2603.10165","raw_key":null,"title":"OpenClaw-RL: Train Any Agent Simply by Talking","authors":null,"authors_text":"Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang","year":2026,"venue":"cs.CL","abstract":"Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.","external_url":"https://arxiv.org/abs/2603.10165","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-13T21:03:20.146427+00:00","pith_arxiv_id":"2603.10165","created_at":"2026-05-09T05:45:22.450185+00:00","updated_at":"2026-05-14T17:03:06.181135+00:00","title_quality_ok":true,"display_title":"OpenClaw-RL: Train Any Agent Simply by Talking","render_title":"OpenClaw-RL: Train Any Agent Simply by Talking"},"hub":{"state":{"work_id":"78607317-8305-4515-8dc3-20b4ff5b8f3a","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":19,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2026-04-02T17:57:29+00:00","last_pith_cited_at":"2026-05-12T17:57:04+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-14T22:06:16.128702+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":1}],"polarity_counts":[{"context_polarity":"background","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}