pith. machine review for the scientific record. sign in

arxiv: 2601.22776 · v2 · submitted 2026-01-30 · 💻 cs.AI

Recognition: unknown

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Authors on Pith no claims yet
classification 💻 cs.AI
keywords tspohomogenizationmodelsoptimizationpolicyreasoningrewardrewards
0
0 comments X
read the original abstract

Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively. Code is available at https://github.com/Flipped-May/TSPO.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dissecting Failure Dynamics in Large Language Model Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    LLM reasoning failures cluster at early entropy-spike transitions; the GUARD inference-time framework redirects them for more reliable results.