StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Daoyu Wang; Enhong Chen; Jie Ouyang; Mingyue Cheng; Qi Liu; Qingchuan Li; Shuo Yu

arxiv: 2604.18401 · v4 · pith:GBMGNK6Qnew · submitted 2026-04-20 · 💻 cs.CL

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Daoyu Wang , Qingchuan Li , Mingyue Cheng , Jie Ouyang , Shuo Yu , Qi Liu , Enhong Chen This is my paper

Pith reviewed 2026-05-10 04:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords agentic reinforcement learningstep-level MDPLLM agentspolicy optimizationcredit assignmentmulti-turn interactionstool usedecision making

0 comments

The pith

LLM agents need step-level MDP and credit assignment rather than token-level modeling for multi-turn RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional token-level MDP modeling inherited from standard LLM training struggles to capture the delayed rewards, sparse signals, and long variable contexts that arise in multi-turn agent interactions. The paper advances a step-level MDP formulation in which entire steps function as the atomic actions, with corresponding step-level credit assignment to propagate rewards at the natural granularity of agent decisions. This alignment lets policy optimization target core capabilities such as decision making and tool use directly. If the claim holds, agentic RL would become substantially more effective at training general agents.

Core claim

The paper claims that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. Step-level credit assignment is introduced as the matching optimization method, aligning policy updates and reward propagation with the scale of actual agent behavior. Preliminary experiments supply initial support for the approach.

What carries the argument

Step-level MDP formulation in which steps serve as actions, together with step-level credit assignment that propagates rewards at the granularity of agent decisions.

If this is right

Credit assignment operates at the scale of real agent decisions rather than individual tokens.
Policy optimization directly targets multi-turn behaviors such as tool use and planning.
Sparse and delayed rewards become easier to propagate across variable-length interactions.
Training focuses on decision-level outcomes instead of token prediction accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New training pipelines could track decision boundaries explicitly to enable step-level logging and replay.
Benchmarks may shift toward evaluating complete steps rather than token sequences to better match the new MDP.
The formulation could combine with hierarchical RL techniques to scale to still longer agent horizons.
Agent harnesses might need updated interfaces that expose step boundaries for reward shaping.

Load-bearing premise

Redefining the MDP and credit assignment at step granularity will meaningfully address delayed and sparse rewards plus long-context challenges in multi-turn agent settings.

What would settle it

A head-to-head experiment on a multi-turn tool-use benchmark in which step-level credit assignment produces measurably higher success rates than matched token-level baselines.

read the original abstract

Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose \textbf{StepPO}, a step-centric paradigm for agentic RL via step-aligned policy optimization. Specifically, we reformulate agentic RL from a token-level Markov Decision Process (MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-level credit assignment to align policy optimization with the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turn agent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into how step-centric paradigm improves agent training. We hope this step-centric paradigm offers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents StepPO as a position paper on agentic reinforcement learning for LLMs. It argues that the conventional token-level MDP formulation is inadequate for multi-turn interactive agent settings and should be replaced by a step-level MDP in which steps (rather than tokens) constitute the atomic actions; it proposes step-level credit assignment as the corresponding optimization mechanism to better align with agent decisions and mitigate delayed/sparse rewards, discusses required systems-level designs, and reports preliminary experiments as initial supporting evidence.

Significance. If the step-level MDP can be rigorously formalized and implemented, the perspective could supply a more natural granularity for credit assignment and policy optimization in long-horizon LLM agent tasks, potentially improving sample efficiency and addressing limitations of token-centric RLHF/RLVR approaches in harness-style agent training.

major comments (2)

[Step-level MDP formulation (proposal section)] The central claim that advancing to a step-level MDP yields a well-defined Markovian process whose credit assignment directly mitigates sparse/delayed rewards is load-bearing yet lacks an explicit transition kernel P(s_{t+1}|s_t, step) or intra-step reward decomposition. Without these, it remains unclear whether step boundaries chosen post-hoc preserve the Markov property or whether partial observability inside a step invalidates the claimed advantage over token-level MDPs.
[Preliminary experiments] The preliminary experiments are invoked to provide 'initial evidence' for the effectiveness of the step-aligned paradigm, but no quantitative results, baselines, task descriptions, or evaluation protocols are supplied. This leaves the empirical support too thin to substantiate the position's practical claims.

minor comments (2)

[Abstract and introduction] The abstract and introduction could more explicitly contrast the proposed step-level credit assignment with existing hierarchical or option-based RL methods to clarify novelty.
[Notation and definitions] Notation for the step-level action space and state representation is introduced informally; adding a compact table or equation block defining A_step, S_step, and the reward function at step granularity would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the potential significance of a step-level MDP perspective for agentic RL. We address the two major comments point by point below and will revise the manuscript to incorporate the requested clarifications and details.

read point-by-point responses

Referee: [Step-level MDP formulation (proposal section)] The central claim that advancing to a step-level MDP yields a well-defined Markovian process whose credit assignment directly mitigates sparse/delayed rewards is load-bearing yet lacks an explicit transition kernel P(s_{t+1}|s_t, step) or intra-step reward decomposition. Without these, it remains unclear whether step boundaries chosen post-hoc preserve the Markov property or whether partial observability inside a step invalidates the claimed advantage over token-level MDPs.

Authors: We agree that the current presentation would benefit from greater formal rigor. In the revised manuscript we will explicitly define the step-level transition kernel P(s_{t+1} | s_t, a_step), where a_step denotes the atomic step action (e.g., a tool invocation or a complete response turn). We will also supply an intra-step reward decomposition that assigns terminal rewards at step boundaries while allowing optional dense signals inside steps. On the Markov property, we will clarify that step boundaries are not chosen post-hoc but are aligned with natural decision points in agent harnesses (after environment feedback), thereby ensuring the state representation at each step boundary captures the necessary history to reduce partial observability relative to token-level modeling. This formulation directly supports more effective credit assignment for delayed rewards. revision: yes
Referee: [Preliminary experiments] The preliminary experiments are invoked to provide 'initial evidence' for the effectiveness of the step-aligned paradigm, but no quantitative results, baselines, task descriptions, or evaluation protocols are supplied. This leaves the empirical support too thin to substantiate the position's practical claims.

Authors: We acknowledge that the current description of the preliminary experiments is too brief to serve as convincing initial evidence. In the revision we will expand the section to report quantitative metrics, explicit task descriptions (multi-turn tool-use and decision-making benchmarks), comparison baselines (token-level PPO and standard RLHF variants), and the evaluation protocols employed. These additions will be presented as preliminary while preserving the position-paper character of the work. revision: yes

Circularity Check

0 steps flagged

No circularity: position paper advances conceptual reformulation without self-referential reduction

full rationale

The manuscript is explicitly framed as a position paper proposing a shift from token-level to step-level MDP for agentic RL. The core argument identifies limitations of existing token-centric modeling for multi-turn settings and advocates redefining actions and credit assignment at step granularity. No equations, fitted parameters, or derivations are supplied that reduce by construction to the inputs (e.g., no self-defined MDP transition kernel or prediction that is statistically forced by a fit). Preliminary experiments are cited only as initial evidence, not as the load-bearing justification. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The proposal remains a forward-looking perspective rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central proposal rests on one domain assumption about the superiority of step-level modeling; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Step-level MDP formulation better captures LLM agent behavior and enables effective credit assignment than token-level MDP
This is the load-bearing premise stated in the abstract for advancing agentic RL.

pith-pipeline@v0.9.0 · 5623 in / 1135 out tokens · 53313 ms · 2026-05-10T04:23:57.354913+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SocraticPO: Policy Optimization via Interactive Guidance
cs.LG 2026-06 unverdicted novelty 6.0

SocraticPO adds Socratic-style teacher guidance and reward decay to RL rollouts for LLMs, improving performance on scientific reasoning benchmarks over baselines.
TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning
cs.CL 2026-06 unverdicted novelty 5.0

TabClaw is an interactive LLM agent for spreadsheets that exposes editable plans, uses parallel specialist agents, streams ReAct loops, and distills skills from user feedback, reporting improved benchmark task completion.
Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning
cs.LG 2026-06 unverdicted novelty 4.0

Claw-R1 provides a Gateway Server and Data Pool to manage step-level agent interaction traces as structured data assets for agentic RL training.
Rethinking Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 3.0

The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
Rethinking Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
Rethinking Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...