Towards flash thinking via decoupled advantage policy optimization

Zezhong Tan, Hang Gao, Xinhong Ma, Feng Zhang, Ziqiang Dong · 2025 · arXiv 2510.15374

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

cs.LG · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective cs.LG · 2026-05-13 · unverdicted · none · ref 13 · 2 links
ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.

Towards flash thinking via decoupled advantage policy optimization

fields

years

verdicts

representative citing papers

citing papers explorer