pith. sign in

arxiv: 2605.29198 · v2 · pith:PH55DKONnew · submitted 2026-05-28 · 💻 cs.CV

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

Pith reviewed 2026-06-29 08:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords GCPOtoken credit assignmentcontrastive promptsdiscrete policy optimizationtext-to-image generationchain-of-thought reasoningreinforcement learningpolicy optimization
0
0 comments X

The pith

GCPO assigns token-level advantages proportional to prediction differences under positive and negative prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Group-advantage RL methods assign the same reward signal to every token in a sample, which blurs which parts of the output actually matter. GCPO replaces that uniform broadcast with a per-token advantage computed from the difference in the model's own predictions when the prompt is flipped from positive to negative. The resulting signal highlights tokens that shift the output most under the contrast, producing learning that focuses on prompt-matched image areas and key words in reasoning chains. This change yields higher scores than GRPO and DAPO on both text-to-image and chain-of-thought benchmarks.

Core claim

Rather than broadcasting sample-level advantages uniformly, GCPO assigns token-level advantages proportional to the difference between model predictions under positive and negative prompts, supplying finer-grained credit assignment for discrete policy optimization.

What carries the argument

The difference between contrastive model predictions under positive and negative prompts, used to scale token-level advantages.

If this is right

  • Token-level advantages become more precise and informative than uniform sample-level rewards.
  • GCPO consistently outperforms GRPO and DAPO on text-to-image generation benchmarks.
  • GCPO consistently outperforms GRPO and DAPO on chain-of-thought reasoning benchmarks.
  • The method emphasizes semantically relevant visual regions aligned with prompts and critical keywords in reasoning traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrast mechanism could be tested on other discrete sequence tasks such as code generation.
  • Performance may depend on careful design of the positive and negative prompt pairs in ways the experiments do not fully vary.
  • GCPO could be combined with existing advantage estimators to further refine the token signals.

Load-bearing premise

The difference in model predictions under positive and negative prompts accurately isolates each token's contribution to the sample reward without systematic bias from prompt construction.

What would settle it

An ablation that replaces the positive and negative prompts with unrelated prompts and measures whether the performance gains over GRPO and DAPO disappear.

Figures

Figures reproduced from arXiv: 2605.29198 by Aditya Grover, Akash Gokul, Konstantinos Kallidromitis, Shufan Li, Yuta Kyuragi.

Figure 1
Figure 1. Figure 1: GCPO enables fine-grained token credit assignment via contrastive guidance. GCPO assigns per-token advantages by contrasting the likelihood of each token in a sampled sequence under positive (conditional) and negative (unconditional) prompts. Tokens in semantically critical image regions show high divergence between the two and are amplified, while background regions are down-weighted. The resulting per-to… view at source ↗
Figure 2
Figure 2. Figure 2: GCPO computes per-token importance weights via contrastive guidance. For text-to￾image generation (left), a sampled image sequence is scored by πθ under both positive (conditional, text prompt) and negative (unconditional, empty prompt) inputs, like in CFG. For multimodal reasoning (right), a sampled response is scored under positive (original question) and negative ("generate the wrong answer") prompts. I… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of normalization strategies. Min-max and softmax are sensitive to outliers. Histogram equalization produces a smooth distribution of weights regardless of the absolute scale, ensuring consistent optimization behavior across different samples and prompts. 3 Method In this work, we explore a novel token credit assignment algorithm based on classifier-free guidance. During text-to-image inference, … view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of GCPO vs GRPO vs Base Model (Janus-Pro-7B). We evaluate the models on GenEval to observe quality and text conditioning. GCPO is able to produce higher quality outputs (Clock, Boats, Scissors) that are more consistent with the text instructions ("apple above tv") [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training Dynamics of GCPO. We visualize the progression of MM12K validation accuracy at different training steps for GCPO, DAPO, and GRPO experiments. D Qualitative Results for multimodal reasoning. We provide output samples of GCPO-finetuned Qwen2.5-VL-7B-Instruct model in figure 9, figure 10, and figure 11, highlighting the strong multimodal reasoning capabilities of GCPO-enhanced VLMs. E Limitation Desp… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Examples of Heatmap Guidance 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualizations of per token weighting under GCPO framework. Darker colors indicates higher weighting. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: More visualization of per token weighting under GCPO framework. Darker colors indicates higher weighting. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Examples of Multimodal Reasoning (1/3). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative Examples of Multimodal Reasoning (2/3). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative Examples of Multimodal Reasoning (3/3). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Guidance Contrastive Policy Optimization (GCPO) as an extension to group-advantage RL methods such as GRPO and DAPO. GCPO enables per-token credit assignment by defining token-level advantages as proportional to the difference in model predictions (or logits) conditioned on positive versus negative prompts, rather than broadcasting uniform sample-level advantages. The authors claim this yields more precise learning signals, emphasizes semantically relevant tokens (e.g., prompt-aligned visual regions or critical keywords in CoT traces), and produces consistent empirical gains over the baselines on text-to-image generation and chain-of-thought reasoning benchmarks.

Significance. If the core construction is shown to isolate token contributions without systematic bias and the reported gains are reproducible with standard statistical controls, GCPO would address a recognized limitation of sample-level reward propagation in discrete policy optimization. The approach is general and could scale to other generation tasks, but the current manuscript supplies no quantitative metrics, dataset sizes, variance estimates, or protocol details, so the practical significance cannot yet be assessed.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim that GCPO 'consistently outperforms GRPO and DAPO' is unsupported because the abstract (and visible text) contains no metrics, dataset sizes, statistical tests, ablation results, or experimental protocol; this prevents any evaluation of whether the data-to-claim link holds.
  2. [Abstract / §3] Method definition (abstract and §3): the token advantage is defined as proportional to the contrastive prediction difference under positive/negative prompts, yet no derivation, identity, or bound is supplied showing that this difference recovers the true marginal contribution of each token to the sample reward or is free of prompt-construction bias and global effects (attention, KV cache). This assumption is load-bearing for the credit-assignment claim.
minor comments (1)
  1. [Abstract] Notation for the contrastive difference operator and the exact proportionality constant should be introduced with an equation rather than prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support in the abstract and a formal justification for the token advantage. We will revise the manuscript to address both points directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim that GCPO 'consistently outperforms GRPO and DAPO' is unsupported because the abstract (and visible text) contains no metrics, dataset sizes, statistical tests, ablation results, or experimental protocol; this prevents any evaluation of whether the data-to-claim link holds.

    Authors: We agree that the abstract should be self-contained with quantitative support. The full manuscript reports results in Section 4 on text-to-image and CoT benchmarks, but to resolve this, we will revise the abstract to include representative metrics (e.g., relative improvements), dataset sizes, and a brief note on the evaluation protocol and variance. revision: yes

  2. Referee: [Abstract / §3] Method definition (abstract and §3): the token advantage is defined as proportional to the contrastive prediction difference under positive/negative prompts, yet no derivation, identity, or bound is supplied showing that this difference recovers the true marginal contribution of each token to the sample reward or is free of prompt-construction bias and global effects (attention, KV cache). This assumption is load-bearing for the credit-assignment claim.

    Authors: We acknowledge that the current version motivates the advantage via contrastive guidance but lacks a formal derivation or bound. In revision we will add a subsection in §3 deriving the form under an independence assumption on token contributions, plus explicit discussion of prompt-construction bias and global effects from attention/KV cache, supported by sensitivity analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: GCPO contrastive advantage defined independently of fitted inputs or self-citations

full rationale

The abstract and provided excerpts define GCPO token advantages directly via the contrastive difference between positive/negative prompt predictions, without any equation reducing this quantity to a fitted parameter, a renamed baseline, or a self-cited uniqueness result. GRPO and DAPO are referenced only as performance baselines, not as load-bearing premises whose validity is imported via author overlap. No self-definitional loop, ansatz smuggling, or renaming of empirical patterns is exhibited. The construction is presented as an empirical proposal whose justification rests on downstream benchmarks rather than tautological identity with its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, invented entities, or formal axioms; the central mechanism rests on an unstated domain assumption about contrastive prediction differences.

axioms (1)
  • domain assumption Difference in model predictions under positive versus negative prompts isolates token-level contributions to sample reward
    This premise is required for the proportional advantage assignment described in the abstract.

pith-pipeline@v0.9.1-grok · 5728 in / 1257 out tokens · 30676 ms · 2026-06-29T08:56:57.770251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 38 canonical work pages · 29 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Understanding the impact of negative prompts: When and how do they take effect? Ineuropean conference on computer vision, pages 190–206

    Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, and Cho-Jui Hsieh. Understanding the impact of negative prompts: When and how do they take effect? Ineuropean conference on computer vision, pages 190–206. Springer, 2024

  4. [4]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

  5. [5]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  6. [6]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.NeurIPS, 2017

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  9. [9]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  10. [10]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  11. [11]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  12. [12]

    Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

    Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

  13. [13]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  14. [14]

    Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation

    Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244, 2025

  15. [15]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  16. [16]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. 10

  17. [17]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

  18. [18]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol and Prafulla Dhariwal. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021

  19. [19]

    Contrastive decoding improves reasoning in large language models.arXiv preprint arXiv:2309.09117, 2023

    Sean O’Brien and Mike Lewis. Contrastive decoding improves reasoning in large language models.arXiv preprint arXiv:2309.09117, 2023

  20. [20]

    Steering language generation: Harnessing contrastive expert guidance and negative prompting.arXiv preprint arXiv:2308.07645, 2023

    Charles O’Neill et al. Steering language generation: Harnessing contrastive expert guidance and negative prompting.arXiv preprint arXiv:2308.07645, 2023

  21. [21]

    Dall·e 3.https://openai.com/index/dall-e-3/, 2023

    OpenAI. Dall·e 3.https://openai.com/index/dall-e-3/, 2023

  22. [22]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  23. [23]

    Janus-pro-r1: Advancing collaborative visual comprehension and generation via reinforcement learning.arXiv preprint arXiv:2506.01480, 2025

    Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, et al. Janus-pro-r1: Advancing collaborative visual comprehension and generation via reinforcement learning.arXiv preprint arXiv:2506.01480, 2025

  24. [24]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  25. [25]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.arXiv preprint arXiv:2112.10752, 2022

  26. [26]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia et al. Photorealistic text-to-image diffusion models with deep language understanding.arXiv preprint arXiv:2205.11487, 2022

  27. [27]

    Stay on topic with classifier-free guidance.arXiv preprint arXiv:2306.17806, 2023

    Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammana- manchi, and Stella Biderman. Stay on topic with classifier-free guidance.arXiv preprint arXiv:2306.17806, 2023

  28. [28]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2016

  29. [29]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  31. [31]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  32. [32]

    Learning to summarize with human feedback.NeurIPS, 2020

    Nisan Stiennon et al. Learning to summarize with human feedback.NeurIPS, 2020

  33. [33]

    Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

    Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

  34. [34]

    Joty, and Nikhil Naik

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq R. Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2023

  35. [35]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025. 11

  36. [36]

    Measuring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  37. [37]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

  38. [38]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  39. [39]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  40. [40]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  41. [41]

    LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

  42. [42]

    Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

    Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649, 2025

  43. [43]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  44. [44]

    MMaDA: Multimodal Large Diffusion Language Models

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

  45. [45]

    Data-regularized reinforcement learning for diffusion models at scale.arXiv preprint arXiv:2512.04332, 2025

    Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, et al. Data-regularized reinforcement learning for diffusion models at scale.arXiv preprint arXiv:2512.04332, 2025

  46. [46]

    Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

    En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jian- jian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

  47. [47]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  48. [48]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

  49. [49]

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?arXiv preprint arXiv:2403.14624, 2024

  50. [50]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  51. [51]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. 12 A Additional Implementation Details Text-to-Image.We employ Janus-Pro-7B as the base model and train GCPO for 1,600 ste...

  52. [52]

    generate a wrong answer

    as our base model. We followed the setup of VPPO and train the model for two epochs, which amounts to 202 steps on ViRL39K dataset. The specific hyperparameter are listed in Table 6 Table 6:Training configurations for Text-to-Image tasks. Learning Rate1×10 −6 Steps 202 β1 0.99 β2 0.999 optimizer AdamW Learning Rate Schedule Constant Model Size 8B Global B...

  53. [53]

    Rearrange the equation to isolate x: x + 4 + 6 = 3x implies x + 10 = 3x implies 10 = 2x implies x = 5

    For the sides, the equation derived from the equality of opposite sides will be: x + 4 = 3x - 6 Solving for x: x + 4 = 3x - 6. Rearrange the equation to isolate x: x + 4 + 6 = 3x implies x + 10 = 3x implies 10 = 2x implies x = 5

  54. [54]

    Specifically, the angles at the point of intersection where the diagonals split the parallelogram will help in finding y

    For the angles, the sum of the interior angles on the same side of a transversal is supplementary (180 degrees). Specifically, the angles at the point of intersection where the diagonals split the parallelogram will help in finding y. However, the value of y can be found directly from the side lengths equality. So, the value of x that makes the quadrilatera...

  55. [55]

    The triangle is in the top-left corner

    The first image has a triangle in the top-left, a square in the top-right, and a circle in the bottom-left. The triangle is in the top-left corner

  56. [56]

    The second image shows a square in the top-right, a circle in the bottom-left, and the triangle is missing from the top-left position

  57. [57]

    The circle is in the top-left corner

    The third image has a circle in the top-left, a triangle in the top-right, and a square in the bottom-left. The circle is in the top-left corner

  58. [58]

    Now, observing the position changes: - In the first image, the triangle is in the top-left

    The fourth image shows a square in the top-right, a circle in the bottom-left, and the triangle is missing from the top-left position. Now, observing the position changes: - In the first image, the triangle is in the top-left. - In the second image, it moves to the top-right, and the circle goes to the bottom-left. - In the third image, the circle is in th...

  59. [59]

    replaces all squares with stars,

    The first operation is "replaces all squares with stars," meaning any square symbol (\(\text{⬛ }\)) is replaced by a star (\(\star\)). So, \(\text{⬛ }\) becomes \(\star\). The set now is: \(\text{O} \, \star \, \star\)

  60. [60]

    changes the shading on the symbols

    The second operation is "changes the shading on the symbols." This operation will change the shading of each symbol. In the initial set, the symbols are: - O (circle) - \(\star\) (star) - \(\text{⬛ }\) (black square, which becomes \(\star\)) After replacing the square with a star, the set is \(\text{O} \, \star \, \star\). Now applying the shading change ...