arxiv: 2605.01327 · v2 · submitted 2026-05-02 · 💻 cs.AI · cs.LG

Recognition: unknown

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

Hao Sun, Hongbo Sun, Jiakang Yuan, Lei Gao, Mengxi Jia, Xuelong Li, Zhuoming Li

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:41 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords reinforcement learningpolicy optimizationmulti-modal reasoninglarge language modelssegment alignmentcredit assignmenttraining stability

0 comments

The pith

Treating coherent reasoning steps as the units of policy update improves accuracy and stability in multi-modal reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for large language models usually updates policies at the level of single tokens or complete sequences. This creates a mismatch with how reasoning actually proceeds in steps, resulting in ineffective credit assignment and erratic training. The authors address this by defining a new method called Segment-Aligned Policy Optimization that operates on identifiable reasoning segments instead. Experiments show consistent gains in accuracy along with steadier training and more reliable value estimates. A sympathetic reader would care because aligning the learning process with the natural structure of thought could make complex reasoning in AI more reliable and efficient.

Core claim

By introducing a step-wise Markov decision process abstraction over reasoning segments and using segment-level mechanisms for value estimation, advantage computation, and importance sampling, the proposed approach achieves superior performance compared to token-level and sequence-level policy optimization methods on multi-modal reasoning benchmarks, with notable improvements in accuracy, training stability, and value estimation consistency.

What carries the argument

The segment-level Markov decision process abstraction that aligns policy updates, value estimation, and advantage computation with the boundaries of coherent reasoning steps.

If this is right

Significant accuracy improvements on representative reasoning benchmarks.
Better training stability than existing token-level and sequence-level methods.
More consistent value estimation during training.
More semantically grounded policy optimization for complex reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This alignment approach might extend naturally to single-modal or other structured reasoning problems if step identification remains reliable.
Future work could explore automatic segment detection methods to reduce reliance on manual or heuristic boundaries.
Combining segment alignment with other RL techniques could address remaining credit assignment issues in very long reasoning chains.

Load-bearing premise

Coherent reasoning steps can be reliably identified as distinct units without losing important token-level details or creating new credit assignment problems.

What would settle it

A controlled experiment where reasoning segments are deliberately misidentified or ambiguously defined, leading to no performance gains or degraded stability in SAPO compared to baselines.

Figures

Figures reproduced from arXiv: 2605.01327 by Hao Sun, Hongbo Sun, Jiakang Yuan, Lei Gao, Mengxi Jia, Xuelong Li, Zhuoming Li.

**Figure 1.** Figure 1: Comparison of credit assignment granularity in reinforcement learning for long-CoT. (a) Token-level methods such as PPO assign advantages independently to individual tokens, leading to noisy and inconsistent learning signals within the same reasoning step. (b) Sequence-level methods such as GRPO assign uniform credit across the entire trajectory, failing to distinguish the contributions of individual reaso… view at source ↗

**Figure 2.** Figure 2: Overview of Segment-Aligned Policy Optimization (SAPO). The response is partitioned into coherent reasoning steps using entropy-aware boundaries. SAPO performs policy optimization and value learning over these reasoning steps, ensuring that tokens within the same step are updated with consistent learning signals. across the entire segment, which can lead to rapidly increasing variance as segment length gr… view at source ↗

**Figure 3.** Figure 3: Stage-averaged Lift–q curves across training. Lift increases with the entropy threshold and shifts upward from early to late training stages, with the Lift > 1 region expanding toward lower q. Shaded areas denote 95% bootstrap confidence intervals over steps. Moreover, the smoothed mean Lift exhibits a positive monotonic trend over training (Spearman ρ = 0.15; bootstrap 95% CI for the linear slope excludes… view at source ↗

**Figure 4.** Figure 4: Comparison of text-only reasoning performance on GSM8K and MATH500 for RhoMath 1.1B and DeepSeekMath 7B under different reinforcement learning strategies. Baseline results are from (Guo et al., 2025b; Kazemnejad et al., 2025). segmentation, which splits the generated response according to newline characters (Kazemnejad et al., 2025; Zhang et al., 2025); (2) Step-based segmentation, which explicitly instr… view at source ↗

**Figure 5.** Figure 5: Effect of the top-k% ratio in entropy-based segmentation on Geo3K test accuracy. Sensitivity to Hyperparameter k in Entropy-based Segmentation view at source ↗

**Figure 6.** Figure 6: illustrates the training dynamics of the value model under SAPO and the VC-PPO baseline. Both methods exhibit a rapid decrease in value loss during the early training stage, indicating that the value function quickly learns a coarse estimate of returns. However, as training proceeds, the value loss under SAPO consistently converges to a lower level and remains more stable, whereas VC-PPO shows higher resi… view at source ↗

**Figure 7.** Figure 7: Comparison of entropy (left) and response length (right) in training progress between SAPO and PPO. C. Further Analysis of Entropy-Based Segmentation The role of entropy-based segmentation is to provide a statistically grounded proxy for identifying decision-critical transitions. In particular, high-entropy tokens are strongly associated with large value discontinuities: E[∆Vt | Ht ≥ q] > E[∆Vt] , (14) whe… view at source ↗

read the original abstract

Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a novel reinforcement learning paradigm that treats coherent reasoning steps, rather than tokens or full sequences as fundamental units of policy update. SAPO introduces a step-wise Markov decision process abstraction over reasoning segments, accompanied by segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries. Experiments on representative reasoning benchmarks demonstrate that SAPO consistently outperforms token-level and sequence-level policy optimization methods, achieving significant accuracy improvements while exhibiting better training stability and value estimation consistency. Our work underscores the importance of aligning reinforcement learning updates with the intrinsic structure of reasoning, paving the way for more efficient and semantically grounded policy optimization in complex reasoning tasks. Codes and models will be released to ensure full reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAPO proposes aligning RL updates with reasoning segments instead of tokens or sequences, but the abstract supplies no numbers or boundary details to support the performance claims.

read the letter

The main thing to know is that this paper introduces Segment-Aligned Policy Optimization, which reframes RL for LLMs as a step-wise MDP over coherent reasoning segments rather than individual tokens or full sequences. The authors add segment-level value estimation, advantage computation, and importance sampling that match those boundaries, arguing this fixes poor credit assignment and training instability in multi-modal reasoning tasks.

Referee Report

2 major / 1 minor

Summary. The paper proposes Segment-Aligned Policy Optimization (SAPO), a reinforcement learning framework for multi-modal reasoning in LLMs that abstracts the problem as a step-wise MDP over coherent reasoning segments rather than tokens or full sequences. It introduces segment-level value estimation, advantage computation, and importance sampling aligned with reasoning boundaries, and claims that this yields consistent outperformance over token-level and sequence-level baselines in accuracy, training stability, and value estimation consistency on reasoning benchmarks.

Significance. If the empirical results hold and segment boundaries can be identified reliably, SAPO could meaningfully improve credit assignment in RL for complex reasoning by aligning updates with the intrinsic step structure of multi-modal outputs. The promised release of code and models would support reproducibility and allow direct testing of the segment-aligned mechanisms.

major comments (2)

[Abstract] Abstract: the central claim that SAPO 'consistently outperforms token-level and sequence-level policy optimization methods, achieving significant accuracy improvements while exhibiting better training stability' is asserted without any quantitative results, baselines, error bars, benchmark names, or statistical details. This prevents evaluation of whether the reported gains are substantial or robust.
[Abstract] Abstract and methods description: the proposal depends on treating 'coherent reasoning segments' as the fundamental MDP units, with 'segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries.' No detection rule (heuristic, learned, or LLM-prompted), no robustness analysis under boundary noise, and no alignment metric with human step annotations are provided. If boundary errors are frequent, any stability or accuracy gains could arise from changed effective horizons rather than the claimed semantic alignment.

minor comments (1)

[Abstract] The abstract states that 'Codes and models will be released to ensure full reproducibility,' which is a positive commitment, but the experimental section should include at least high-level details on the multi-modal benchmarks, base models, and training hyperparameters to allow immediate assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments on our paper. We address each of the major comments point by point below, providing clarifications and indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that SAPO 'consistently outperforms token-level and sequence-level policy optimization methods, achieving significant accuracy improvements while exhibiting better training stability' is asserted without any quantitative results, baselines, error bars, benchmark names, or statistical details. This prevents evaluation of whether the reported gains are substantial or robust.

Authors: We acknowledge the referee's point that the abstract presents claims without supporting quantitative details. While the full experimental results with specific accuracy numbers, baselines, error bars, and benchmark names are detailed in the body of the paper (Experiments section), we agree that incorporating a brief mention of key quantitative outcomes in the abstract would enhance its informativeness. We will revise the abstract accordingly in the next version. revision: yes
Referee: [Abstract] Abstract and methods description: the proposal depends on treating 'coherent reasoning segments' as the fundamental MDP units, with 'segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries.' No detection rule (heuristic, learned, or LLM-prompted), no robustness analysis under boundary noise, and no alignment metric with human step annotations are provided. If boundary errors are frequent, any stability or accuracy gains could arise from changed effective horizons rather than the claimed semantic alignment.

Authors: The referee correctly identifies a gap in the current description. The manuscript does not specify the exact method for detecting coherent reasoning segments nor provide robustness analysis or alignment metrics. We will add a detailed subsection in the Methods describing the segment detection procedure, along with experiments analyzing sensitivity to boundary noise and correlation with human-annotated steps. This will strengthen the claim that the benefits arise from semantic alignment. revision: yes

Circularity Check

0 steps flagged

No circularity: SAPO introduces independent segment-aligned mechanisms

full rationale

The paper proposes a new MDP abstraction over reasoning segments with accompanying value estimation and advantage rules. No equations or derivations in the provided text reduce a claimed prediction or first-principles result to a fitted parameter or self-definition from the same work. The central claim rests on empirical comparisons to token- and sequence-level baselines rather than tautological re-derivation. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear. Boundary detection is left unspecified, but that is a methodological gap, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the method is described at a high conceptual level without equations or implementation details.

pith-pipeline@v0.9.0 · 5494 in / 1117 out tokens · 28543 ms · 2026-05-09T14:41:37.110802+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 21 canonical work pages · 10 internal anchors

[1]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,

work page internal anchor Pith review arXiv
[3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Segment policy optimization: Ef- fective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025a. Guo, Y ., Xu, L., Liu, J., Ye, D., and Qiu, S. Segment policy optimization: Effective segment-level credit as- signment in rl for large language mo...

work page arXiv
[5]

Rectifying LLM Thought from Lens of Optimization

Liu, J., Liu, H., Zhang, S., and Chen, K. Rectifying llm thought from lens of optimization.arXiv preprint arXiv:2512.01925,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., and Zhu, S.-C. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165,

work page arXiv
[7]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

work page internal anchor Pith review arXiv
[8]

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., and Zhang, D. Wizard- math: Empowering mathematical reasoning for large lan- guage models via reinforced evol-instruct.arXiv preprint arXiv:2308.09583,

work page arXiv
[9]

Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228,

Pal, A., Karkhanis, D., Dooley, S., Roberts, M., Naidu, S., and White, C. Smaug: Fixing failure modes of pref- erence optimisation with dpo-positive.arXiv preprint arXiv:2402.13228,

work page arXiv
[10]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review arXiv
[13]

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models,

Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P. J., Harrison, J., Lee, J., Xu, K., et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585,

work page arXiv
[14]

Sutton and Doina Precup and Satinder Singh , keywords =

ISSN 0004-3702. doi: 10.1016/ S0004-3702(99)00052-1. URL https://doi.org/ 10.1016/S0004-3702(99)00052-1. Sutton, R. S., Barto, A. G., et al. Reinforcement learning: An introduction 2nd ed.MIT press Cambridge, 1(2):25,

work page doi:10.1016/s0004-3702(99)00052-1
[15]

Visual Anchors

URL https: //openreview.net/forum?id=yfcpdY4gMP. Xiao, Y ., Sun, E., Liu, T., and Wang, W. Logicvista: Multi- modal llm logical reasoning benchmark in visual contexts. arXiv preprint arXiv:2407.04973,

work page arXiv
[16]

Single-stream policy optimization.arXiv preprint arXiv:2509.13232,

URLhttps://arxiv.org/abs/2509.13232. Yuan, Y ., Yue, Y ., Zhu, R., Fan, T., and Yan, L. What’s behind ppo’s collapse in long-cot? value optimization holds the secret,

work page arXiv
[17]

URL https://arxiv.org/ abs/2503.01491. Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., and Chen, W. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi....

work page arXiv
[18]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

URL https://arxiv.org/abs/2504.05118. Zhang, H., Cui, H., Bao, G., Yang, L., Wang, J., and Zhang, Y . Direct value optimization: Improving chain- of-thought reasoning in LLMs with refined values. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- i...

work page internal anchor Pith review arXiv 2025
[19]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979- 8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025
[20]

emnlp-main.668/

URL https://aclanthology.org/2025. emnlp-main.668/. Zhang, R., Jiang, D., Zhang, Y ., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K.-W., Qiao, Y ., et al. Math- verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pp. 169–186. Springer,

2025
[21]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review arXiv
[22]

ISBN 9798331314385

Zou, C., Guo, X., Yang, R., Zhang, J., Hu, B., and Zhang, H. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,

work page arXiv
[23]

Benchmark Settings

11 Segment-Aligned Policy Optimization for Multi-Modal Reasoning Algorithm 1Segment-Aligned Policy Optimization (SAPO) Input:dataset D, initial policy πθ, value function Vϕ, number of epochs E, number of mini-batches B, clip parameter ϵ, discount factorγ, decay parameterλ, segmentation hyperparameterk Output:optimized policyπ θ, optimized value functionV ...

2023
[24]

SAPO consistently outperforms PPO and GRPO, achieving the best performance, and demonstrates strong generalization to advanced reasoning models. 14 Segment-Aligned Policy Optimization for Multi-Modal Reasoning Comparison of different segmentation strategies across multiple cross-domain benchmarks.We further compare our entropy-based segmentation with newl...

2024
[25]

Table 7.Hyperparameters for different training methods on Geo3K. (a)SAPO Hyperparameter Value Total training steps 200 Train batch size 512 Mini batch size 128 Max prompt length 1024 Max response length 2048 Lambda 0.99 Gamma 1.0 Actor LR 1e-6 Critic LR 2e-6 KL penalty in reward K1 KL coef 0.001 Eval. temperature 0.6 Eval. top-p0.95 (b)GRPO Hyperparameter...

2048