3SPO: State-Score-Supervised Policy Optimization for LLM Agents
Pith reviewed 2026-06-27 17:28 UTC · model grok-4.3
The pith
3SPO updates LLM agent policies after every step using state scores from historical success rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
State-Score-Supervised Policy Optimization (3SPO) performs post-step policy optimization with dynamic state score supervision based on historical success rates. This enables step-wise credit assignment and adaptive rollout decisions without requiring value function estimation. Under a per-state bandit abstraction the method achieves logarithmic allocation regret and supplies sample-complexity guarantees for action identification, score distinguishability, and filtering stability. On ALFWorld and WebShop the approach outperforms GRPO while using comparable resources.
What carries the argument
Dynamic state score computed from historical success rates, which supervises step-wise credit assignment, adaptive rollout, and post-step policy optimization.
If this is right
- Achieves logarithmic allocation regret under the per-state bandit abstraction.
- Provides sample-complexity guarantees for action identification, score distinguishability, and filtering stability.
- Yields 2.4 times more state exploration and 1.8 times faster convergence than trajectory-level baselines.
- Delivers 22.6 percent higher success on ALFWorld and 15.6 points higher success on WebShop with comparable resources.
Where Pith is reading between the lines
- The same score-supervision idea could be tested in other sequential decision settings that suffer from delayed rewards.
- Combining state scores with existing value-function methods might further reduce variance in long-horizon credit assignment.
- The bandit abstraction implies that score-based filtering could stabilize training in environments where success rates change slowly.
Load-bearing premise
The per-state bandit abstraction accurately captures the credit-assignment dynamics of the actual multi-turn agent environments used in the experiments.
What would settle it
Replacing the historical-success-based state score with random values and observing whether the reported gains on ALFWorld and WebShop disappear.
Figures
read the original abstract
Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbf{State-Score-Supervised Policy Optimization (3SPO)}, a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by $+22.6\%$ on ALFWorld and $+15.6$ points on WebShop, while using comparable resources to achieve $2.4\times$ more state exploration and $1.8\times$ faster convergence. Code is available at https://github.com/genalyu/3SPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces 3SPO, an RL algorithm for training LLM agents that performs post-step policy optimization using dynamic state scores derived from historical success rates. This enables step-wise credit assignment without value functions or auxiliary models. Under a per-state bandit abstraction, it claims logarithmic allocation regret and sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments with Qwen2.5 models on ALFWorld and WebShop report consistent outperformance over GRPO (+22.6% and +15.6 points respectively), with 2.4× more state exploration and 1.8× faster convergence using comparable resources. Code is released.
Significance. If the empirical gains and theoretical guarantees hold under the stated conditions, 3SPO would offer a practical advance in credit assignment for sparse-reward, multi-turn LLM agents by avoiding trajectory-level optimization and extra models. Public code availability strengthens reproducibility. However, the significance is tempered by the unverified applicability of the independent per-state bandit model to the actual sequential environments.
major comments (3)
- [theoretical analysis section] Theoretical analysis (per-state bandit abstraction): The derivation of logarithmic allocation regret and sample-complexity guarantees assumes independent per-state bandits with stationarity. This does not hold in the ALFWorld and WebShop environments, where states lie in correlated trajectories with delayed sparse rewards and propagating action effects, violating the independence and stationarity conditions required for the bounds to apply to the credit-assignment mechanism.
- [method and experiments] State score computation (method description): State scores are computed from historical success rates collected during the same rollouts used for optimization. This introduces potential circular dependence between the supervision signal and the policy updates, which is not addressed in the analysis of filtering stability or score distinguishability.
- [experiments] Empirical claims (results section): The reported gains (+22.6% on ALFWorld, +15.6 on WebShop) and efficiency metrics (2.4× exploration, 1.8× convergence) are presented without error bars, statistical tests, or ablation isolating the contribution of the state-score supervision versus the bandit allocation rule.
minor comments (2)
- [abstract] The abstract states performance numbers and theoretical results but the full text should include explicit derivation details, experimental protocols, and error analysis to support the claims.
- [method] Notation for state scores and allocation mechanism should be clarified with explicit equations to distinguish from standard bandit formulations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [theoretical analysis section] Theoretical analysis (per-state bandit abstraction): The derivation of logarithmic allocation regret and sample-complexity guarantees assumes independent per-state bandits with stationarity. This does not hold in the ALFWorld and WebShop environments, where states lie in correlated trajectories with delayed sparse rewards and propagating action effects, violating the independence and stationarity conditions required for the bounds to apply to the credit-assignment mechanism.
Authors: The manuscript explicitly frames the analysis as holding under a per-state bandit abstraction (see Section 4). This abstraction enables derivation of the logarithmic allocation regret and sample-complexity bounds for action identification, score distinguishability, and filtering stability. We agree that the actual environments exhibit trajectory correlations and non-stationarity. The abstraction is intended to isolate the credit-assignment mechanism rather than claim exact equivalence. In revision we will add a limitations paragraph explicitly discussing the gap between the abstraction and sequential decision processes. revision: partial
-
Referee: [method and experiments] State score computation (method description): State scores are computed from historical success rates collected during the same rollouts used for optimization. This introduces potential circular dependence between the supervision signal and the policy updates, which is not addressed in the analysis of filtering stability or score distinguishability.
Authors: State scores are computed from success rates accumulated in prior iterations (historical data), not from the current batch of rollouts undergoing optimization. This temporal separation is stated in the method but can be made more explicit. The filtering-stability and distinguishability analyses treat scores as fixed during each allocation phase. We will revise the method section to clarify the use of lagged historical statistics and add a short paragraph explaining why this avoids intra-update circularity. revision: yes
-
Referee: [experiments] Empirical claims (results section): The reported gains (+22.6% on ALFWorld, +15.6 on WebShop) and efficiency metrics (2.4× exploration, 1.8× convergence) are presented without error bars, statistical tests, or ablation isolating the contribution of the state-score supervision versus the bandit allocation rule.
Authors: We agree that the results section would be strengthened by statistical reporting. In the revised version we will report means and standard deviations over multiple random seeds, include paired t-tests or Wilcoxon tests for the main comparisons, and add an ablation that isolates the state-score supervision component from the bandit allocation rule. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents 3SPO as an algorithm that computes state scores directly from historical success rates observed in rollouts and uses them for post-step optimization; the theoretical results are explicitly conditioned on a per-state bandit abstraction as a modeling assumption rather than derived from the target multi-turn environments. No equations or steps in the provided text reduce a claimed prediction or guarantee to a fitted parameter or self-citation by construction. The performance claims are empirical comparisons, and the bandit analysis is offered as an independent layer under stated assumptions. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The environment dynamics admit a per-state bandit abstraction sufficient for proving logarithmic allocation regret and sample-complexity guarantees.
Forward citations
Cited by 1 Pith paper
-
BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents
BiPACE improves LLM agent policy optimization by using bisimulation proxies from hidden states for step clustering and action-conditioned baselines for advantage estimation, raising success rates on ALFWorld, WebShop,...
Reference graph
Works this paper leans on
-
[2]
This means I need to open the cabinet to check if there is a dirty apple inside
I am currently at cabinet 1, and the cabinet is closed. This means I need to open the cabinet to check if there is a dirty apple inside
-
[4]
</think> <action>open cabinet 1</action> Environment (Step 3) You open the cabinet 1
If there is no dirty apple in the cabinet, I will need to check other locations such as other cabinets, the counter top, or the sink basin, but cabinet 1 is the most likely place to find the apple right now. </think> <action>open cabinet 1</action> Environment (Step 3) You open the cabinet 1. The cabinet 1 is open. In it, you see nothing. Agent (Step 3) <think>
-
[6]
There is no apple here
I opened cabinet 1 and found it empty. There is no apple here
-
[7]
</think> <action>go to cabinet 2</action> Environment (Step 4) You arrive at cabinet 2
I need to check the next cabinet, cabinet 2, to continue searching for the apple. </think> <action>go to cabinet 2</action> Environment (Step 4) You arrive at cabinet 2. The cabinet 2 is closed. Agent (Step 4) <think>
-
[9]
This means I need to open the cabinet to check if there is a dirty apple inside
I am currently at cabinet 2, and the cabinet is closed. This means I need to open the cabinet to check if there is a dirty apple inside
-
[11]
</think> <action>open cabinet 2</action> Environment (Step 5) You open the cabinet 2
If there is no dirty apple in the cabinet, I will need to check other locations such as other cabinets, the counter top, or the sink basin, but cabinet 2 is the most likely place to find the apple right now. </think> <action>open cabinet 2</action> Environment (Step 5) You open the cabinet 2. The cabinet 2 is open. In it, you see nothing. Agent (Step 5) <think>
-
[12]
The task is to put the dirty apple on the dining table. 2. I opened cabinet 1 and 2 and found nothing. I need to continue searching other cabinets. 3. I will go to cabinet 3 next to check for the apple. 20 </think> <action>go to cabinet 3</action> Environment (Step 6) You arrive at cabinet 3. The cabinet 3 is closed. Agent (Step 6) <think>
-
[14]
This means I need to open the cabinet to check if there is a dirty apple inside
I am currently at cabinet 3, and the cabinet is closed. This means I need to open the cabinet to check if there is a dirty apple inside
-
[15]
If there is a dirty apple in the cabinet, I will take it out and then go to the dining table to place it
-
[16]
</think> <action>open cabinet 3</action> Environment (Step 7) You open the cabinet 3
If there is no dirty apple in the cabinet, I will need to check other locations such as other cabinets, the counter top, or the sink basin, but cabinet 3 is the most likely place to find the apple right now. </think> <action>open cabinet 3</action> Environment (Step 7) You open the cabinet 3. The cabinet 3 is open. In it, you see a dirty apple 1 and a for...
-
[18]
This is the item I need
I have just opened the cabinet 3 and found a dirty apple 1 inside. This is the item I need
-
[19]
Since I have found the dirty apple, my next step is to pick it up from the cabinet
-
[20]
</think> <action>pick up dirty apple 1 from cabinet 3</action> Environment (Step 8) You pick up the dirty apple 1 from the cabinet 3
After picking up the apple, I will need to go to the dining table to place it. </think> <action>pick up dirty apple 1 from cabinet 3</action> Environment (Step 8) You pick up the dirty apple 1 from the cabinet 3. Agent (Step 8) <think>
-
[22]
I have just picked up the dirty apple 1 from cabinet 3
-
[23]
My next step is to go to the dining table to place the apple
-
[24]
</think> <action>go to dining table 1</action> Environment (Step 9) You arrive at dining table 1
The dining table is dining table 1 in the environment, so I will head there now. </think> <action>go to dining table 1</action> Environment (Step 9) You arrive at dining table 1. 21 Agent (Step 9) <think>
-
[25]
The task is to put the dirty apple on the dining table
-
[26]
I am now at dining table 1, holding the dirty apple 1
-
[27]
</think> <action>place dirty apple 1 on dining table 1</action> Environment (Step 10) You place the dirty apple 1 on the dining table 1
My next step is to place the dirty apple 1 on the dining table to complete the task. </think> <action>place dirty apple 1 on dining table 1</action> Environment (Step 10) You place the dirty apple 1 on the dining table 1. The task is completed. 22
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.