Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

Aditya Grover; Akash Gokul; Konstantinos Kallidromitis; Shufan Li; Yuta Kyuragi

arxiv: 2605.29198 · v2 · pith:PH55DKONnew · submitted 2026-05-28 · 💻 cs.CV

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

Shufan Li , Konstantinos Kallidromitis , Akash Gokul , Yuta Kyuragi , Aditya Grover This is my paper

Pith reviewed 2026-06-29 08:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords GCPOtoken credit assignmentcontrastive promptsdiscrete policy optimizationtext-to-image generationchain-of-thought reasoningreinforcement learningpolicy optimization

0 comments

The pith

GCPO assigns token-level advantages proportional to prediction differences under positive and negative prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Group-advantage RL methods assign the same reward signal to every token in a sample, which blurs which parts of the output actually matter. GCPO replaces that uniform broadcast with a per-token advantage computed from the difference in the model's own predictions when the prompt is flipped from positive to negative. The resulting signal highlights tokens that shift the output most under the contrast, producing learning that focuses on prompt-matched image areas and key words in reasoning chains. This change yields higher scores than GRPO and DAPO on both text-to-image and chain-of-thought benchmarks.

Core claim

Rather than broadcasting sample-level advantages uniformly, GCPO assigns token-level advantages proportional to the difference between model predictions under positive and negative prompts, supplying finer-grained credit assignment for discrete policy optimization.

What carries the argument

The difference between contrastive model predictions under positive and negative prompts, used to scale token-level advantages.

If this is right

Token-level advantages become more precise and informative than uniform sample-level rewards.
GCPO consistently outperforms GRPO and DAPO on text-to-image generation benchmarks.
GCPO consistently outperforms GRPO and DAPO on chain-of-thought reasoning benchmarks.
The method emphasizes semantically relevant visual regions aligned with prompts and critical keywords in reasoning traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrast mechanism could be tested on other discrete sequence tasks such as code generation.
Performance may depend on careful design of the positive and negative prompt pairs in ways the experiments do not fully vary.
GCPO could be combined with existing advantage estimators to further refine the token signals.

Load-bearing premise

The difference in model predictions under positive and negative prompts accurately isolates each token's contribution to the sample reward without systematic bias from prompt construction.

What would settle it

An ablation that replaces the positive and negative prompts with unrelated prompts and measures whether the performance gains over GRPO and DAPO disappear.

Figures

Figures reproduced from arXiv: 2605.29198 by Aditya Grover, Akash Gokul, Konstantinos Kallidromitis, Shufan Li, Yuta Kyuragi.

**Figure 1.** Figure 1: GCPO enables fine-grained token credit assignment via contrastive guidance. GCPO assigns per-token advantages by contrasting the likelihood of each token in a sampled sequence under positive (conditional) and negative (unconditional) prompts. Tokens in semantically critical image regions show high divergence between the two and are amplified, while background regions are down-weighted. The resulting per-to… view at source ↗

**Figure 2.** Figure 2: GCPO computes per-token importance weights via contrastive guidance. For text-toimage generation (left), a sampled image sequence is scored by πθ under both positive (conditional, text prompt) and negative (unconditional, empty prompt) inputs, like in CFG. For multimodal reasoning (right), a sampled response is scored under positive (original question) and negative ("generate the wrong answer") prompts. I… view at source ↗

**Figure 3.** Figure 3: Comparison of normalization strategies. Min-max and softmax are sensitive to outliers. Histogram equalization produces a smooth distribution of weights regardless of the absolute scale, ensuring consistent optimization behavior across different samples and prompts. 3 Method In this work, we explore a novel token credit assignment algorithm based on classifier-free guidance. During text-to-image inference, … view at source ↗

**Figure 4.** Figure 4: Comparison of GCPO vs GRPO vs Base Model (Janus-Pro-7B). We evaluate the models on GenEval to observe quality and text conditioning. GCPO is able to produce higher quality outputs (Clock, Boats, Scissors) that are more consistent with the text instructions ("apple above tv") [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training Dynamics of GCPO. We visualize the progression of MM12K validation accuracy at different training steps for GCPO, DAPO, and GRPO experiments. D Qualitative Results for multimodal reasoning. We provide output samples of GCPO-finetuned Qwen2.5-VL-7B-Instruct model in figure 9, figure 10, and figure 11, highlighting the strong multimodal reasoning capabilities of GCPO-enhanced VLMs. E Limitation Desp… view at source ↗

**Figure 6.** Figure 6: Qualitative Examples of Heatmap Guidance 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Visualizations of per token weighting under GCPO framework. Darker colors indicates higher weighting. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: More visualization of per token weighting under GCPO framework. Darker colors indicates higher weighting. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative Examples of Multimodal Reasoning (1/3). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative Examples of Multimodal Reasoning (2/3). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative Examples of Multimodal Reasoning (3/3). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GCPO tries to fix uniform token credit in GRPO/DAPO via contrastive pos/neg prompt differences, but the abstract gives no derivation or numbers and the isolation claim looks unproven.

read the letter

The main thing to know is that GCPO replaces the flat sample-level advantage with a per-token one computed from how much the model's output distribution shifts between a positive prompt and a negative one. The paper correctly flags that GRPO and DAPO broadcast the same reward signal to every token, which wastes information in long sequences for image generation or reasoning traces.

What it does reasonably is name a real limitation and offer a concrete, easy-to-implement alternative that tries to spotlight the tokens that actually move the reward. The description of it highlighting prompt-aligned image regions or key reasoning words is at least directionally plausible.

The soft spots are the ones the stress-test note flags. Nothing in the abstract shows why the raw prediction difference should be proportional to a token's marginal contribution rather than picking up global effects, attention bleed, or artifacts from how the negative prompt was written. There is also no equation, bound, or even pseudocode to make the proportionality claim checkable. On top of that, the abstract asserts consistent wins over the baselines but supplies zero metrics, dataset sizes, run counts, or statistical details, so the empirical part cannot be assessed at all.

This is aimed at people already working on RL fine-tuning of discrete generative policies, especially those who have hit the credit-assignment wall with group-advantage methods. A reader who wants to experiment with contrastive signals might borrow the high-level idea, but they would have to fill in the justification and run their own controls.

I would send it to peer review. The problem it targets matters and the proposed fix is simple enough that referees could give useful feedback on whether the assumption holds up once the experiments and any derivation are in place.

Referee Report

2 major / 1 minor

Summary. The paper proposes Guidance Contrastive Policy Optimization (GCPO) as an extension to group-advantage RL methods such as GRPO and DAPO. GCPO enables per-token credit assignment by defining token-level advantages as proportional to the difference in model predictions (or logits) conditioned on positive versus negative prompts, rather than broadcasting uniform sample-level advantages. The authors claim this yields more precise learning signals, emphasizes semantically relevant tokens (e.g., prompt-aligned visual regions or critical keywords in CoT traces), and produces consistent empirical gains over the baselines on text-to-image generation and chain-of-thought reasoning benchmarks.

Significance. If the core construction is shown to isolate token contributions without systematic bias and the reported gains are reproducible with standard statistical controls, GCPO would address a recognized limitation of sample-level reward propagation in discrete policy optimization. The approach is general and could scale to other generation tasks, but the current manuscript supplies no quantitative metrics, dataset sizes, variance estimates, or protocol details, so the practical significance cannot yet be assessed.

major comments (2)

[Abstract] Abstract: the central empirical claim that GCPO 'consistently outperforms GRPO and DAPO' is unsupported because the abstract (and visible text) contains no metrics, dataset sizes, statistical tests, ablation results, or experimental protocol; this prevents any evaluation of whether the data-to-claim link holds.
[Abstract / §3] Method definition (abstract and §3): the token advantage is defined as proportional to the contrastive prediction difference under positive/negative prompts, yet no derivation, identity, or bound is supplied showing that this difference recovers the true marginal contribution of each token to the sample reward or is free of prompt-construction bias and global effects (attention, KV cache). This assumption is load-bearing for the credit-assignment claim.

minor comments (1)

[Abstract] Notation for the contrastive difference operator and the exact proportionality constant should be introduced with an equation rather than prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support in the abstract and a formal justification for the token advantage. We will revise the manuscript to address both points directly.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that GCPO 'consistently outperforms GRPO and DAPO' is unsupported because the abstract (and visible text) contains no metrics, dataset sizes, statistical tests, ablation results, or experimental protocol; this prevents any evaluation of whether the data-to-claim link holds.

Authors: We agree that the abstract should be self-contained with quantitative support. The full manuscript reports results in Section 4 on text-to-image and CoT benchmarks, but to resolve this, we will revise the abstract to include representative metrics (e.g., relative improvements), dataset sizes, and a brief note on the evaluation protocol and variance. revision: yes
Referee: [Abstract / §3] Method definition (abstract and §3): the token advantage is defined as proportional to the contrastive prediction difference under positive/negative prompts, yet no derivation, identity, or bound is supplied showing that this difference recovers the true marginal contribution of each token to the sample reward or is free of prompt-construction bias and global effects (attention, KV cache). This assumption is load-bearing for the credit-assignment claim.

Authors: We acknowledge that the current version motivates the advantage via contrastive guidance but lacks a formal derivation or bound. In revision we will add a subsection in §3 deriving the form under an independence assumption on token contributions, plus explicit discussion of prompt-construction bias and global effects from attention/KV cache, supported by sensitivity analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: GCPO contrastive advantage defined independently of fitted inputs or self-citations

full rationale

The abstract and provided excerpts define GCPO token advantages directly via the contrastive difference between positive/negative prompt predictions, without any equation reducing this quantity to a fitted parameter, a renamed baseline, or a self-cited uniqueness result. GRPO and DAPO are referenced only as performance baselines, not as load-bearing premises whose validity is imported via author overlap. No self-definitional loop, ansatz smuggling, or renaming of empirical patterns is exhibited. The construction is presented as an empirical proposal whose justification rests on downstream benchmarks rather than tautological identity with its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, invented entities, or formal axioms; the central mechanism rests on an unstated domain assumption about contrastive prediction differences.

axioms (1)

domain assumption Difference in model predictions under positive versus negative prompts isolates token-level contributions to sample reward
This premise is required for the proportional advantage assignment described in the abstract.

pith-pipeline@v0.9.1-grok · 5728 in / 1257 out tokens · 30676 ms · 2026-06-29T08:56:57.770251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 38 canonical work pages · 29 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Understanding the impact of negative prompts: When and how do they take effect? Ineuropean conference on computer vision, pages 190–206

Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, and Cho-Jui Hsieh. Understanding the impact of negative prompts: When and how do they take effect? Ineuropean conference on computer vision, pages 190–206. Springer, 2024

2024
[4]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.NeurIPS, 2017

2017
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[10]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023
[11]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

work page arXiv 2025
[13]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[14]

Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244, 2025

work page arXiv 2025
[15]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol and Prafulla Dhariwal. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Contrastive decoding improves reasoning in large language models.arXiv preprint arXiv:2309.09117, 2023

Sean O’Brien and Mike Lewis. Contrastive decoding improves reasoning in large language models.arXiv preprint arXiv:2309.09117, 2023

work page arXiv 2023
[20]

Steering language generation: Harnessing contrastive expert guidance and negative prompting.arXiv preprint arXiv:2308.07645, 2023

Charles O’Neill et al. Steering language generation: Harnessing contrastive expert guidance and negative prompting.arXiv preprint arXiv:2308.07645, 2023

work page arXiv 2023
[21]

Dall·e 3.https://openai.com/index/dall-e-3/, 2023

OpenAI. Dall·e 3.https://openai.com/index/dall-e-3/, 2023

2023
[22]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[23]

Janus-pro-r1: Advancing collaborative visual comprehension and generation via reinforcement learning.arXiv preprint arXiv:2506.01480, 2025

Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, et al. Janus-pro-r1: Advancing collaborative visual comprehension and generation via reinforcement learning.arXiv preprint arXiv:2506.01480, 2025

work page arXiv 2025
[24]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[25]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.arXiv preprint arXiv:2112.10752, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia et al. Photorealistic text-to-image diffusion models with deep language understanding.arXiv preprint arXiv:2205.11487, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Stay on topic with classifier-free guidance.arXiv preprint arXiv:2306.17806, 2023

Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammana- manchi, and Stella Biderman. Stay on topic with classifier-free guidance.arXiv preprint arXiv:2306.17806, 2023

work page arXiv 2023
[28]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Learning to summarize with human feedback.NeurIPS, 2020

Nisan Stiennon et al. Learning to summarize with human feedback.NeurIPS, 2020

2020
[33]

Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

work page arXiv 2025
[34]

Joty, and Nikhil Naik

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq R. Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2023

2024
[35]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Measuring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

2024
[37]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[40]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Data-regularized reinforcement learning for diffusion models at scale.arXiv preprint arXiv:2512.04332, 2025

Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, et al. Data-regularized reinforcement learning for diffusion models at scale.arXiv preprint arXiv:2512.04332, 2025

work page arXiv 2025
[46]

Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jian- jian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

work page arXiv 2025
[47]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

2025
[49]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?arXiv preprint arXiv:2403.14624, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. 12 A Additional Implementation Details Text-to-Image.We employ Janus-Pro-7B as the base model and train GCPO for 1,600 ste...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

generate a wrong answer

as our base model. We followed the setup of VPPO and train the model for two epochs, which amounts to 202 steps on ViRL39K dataset. The specific hyperparameter are listed in Table 6 Table 6:Training configurations for Text-to-Image tasks. Learning Rate1×10 −6 Steps 202 β1 0.99 β2 0.999 optimizer AdamW Learning Rate Schedule Constant Model Size 8B Global B...
[53]

Rearrange the equation to isolate x: x + 4 + 6 = 3x implies x + 10 = 3x implies 10 = 2x implies x = 5

For the sides, the equation derived from the equality of opposite sides will be: x + 4 = 3x - 6 Solving for x: x + 4 = 3x - 6. Rearrange the equation to isolate x: x + 4 + 6 = 3x implies x + 10 = 3x implies 10 = 2x implies x = 5
[54]

Speciﬁcally, the angles at the point of intersection where the diagonals split the parallelogram will help in ﬁnding y

For the angles, the sum of the interior angles on the same side of a transversal is supplementary (180 degrees). Speciﬁcally, the angles at the point of intersection where the diagonals split the parallelogram will help in ﬁnding y. However, the value of y can be found directly from the side lengths equality. So, the value of x that makes the quadrilatera...
[55]

The triangle is in the top-left corner

The ﬁrst image has a triangle in the top-left, a square in the top-right, and a circle in the bottom-left. The triangle is in the top-left corner
[56]

The second image shows a square in the top-right, a circle in the bottom-left, and the triangle is missing from the top-left position
[57]

The circle is in the top-left corner

The third image has a circle in the top-left, a triangle in the top-right, and a square in the bottom-left. The circle is in the top-left corner
[58]

Now, observing the position changes: - In the ﬁrst image, the triangle is in the top-left

The fourth image shows a square in the top-right, a circle in the bottom-left, and the triangle is missing from the top-left position. Now, observing the position changes: - In the ﬁrst image, the triangle is in the top-left. - In the second image, it moves to the top-right, and the circle goes to the bottom-left. - In the third image, the circle is in th...
[59]

replaces all squares with stars,

The ﬁrst operation is "replaces all squares with stars," meaning any square symbol ($\text{⬛ }$) is replaced by a star ($\star$). So, $\text{⬛ }$ becomes $\star$. The set now is: $\text{O} \, \star \, \star$
[60]

changes the shading on the symbols

The second operation is "changes the shading on the symbols." This operation will change the shading of each symbol. In the initial set, the symbols are: - O (circle) - $\star$ (star) - $\text{⬛ }$ (black square, which becomes $\star$) After replacing the square with a star, the set is $\text{O} \, \star \, \star$. Now applying the shading change ...

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Understanding the impact of negative prompts: When and how do they take effect? Ineuropean conference on computer vision, pages 190–206

Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, and Cho-Jui Hsieh. Understanding the impact of negative prompts: When and how do they take effect? Ineuropean conference on computer vision, pages 190–206. Springer, 2024

2024

[4] [4]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.NeurIPS, 2017

2017

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[10] [10]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023

[11] [11]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

work page arXiv 2025

[13] [13]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024

[14] [14]

Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244, 2025

work page arXiv 2025

[15] [15]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol and Prafulla Dhariwal. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Contrastive decoding improves reasoning in large language models.arXiv preprint arXiv:2309.09117, 2023

Sean O’Brien and Mike Lewis. Contrastive decoding improves reasoning in large language models.arXiv preprint arXiv:2309.09117, 2023

work page arXiv 2023

[20] [20]

Steering language generation: Harnessing contrastive expert guidance and negative prompting.arXiv preprint arXiv:2308.07645, 2023

Charles O’Neill et al. Steering language generation: Harnessing contrastive expert guidance and negative prompting.arXiv preprint arXiv:2308.07645, 2023

work page arXiv 2023

[21] [21]

Dall·e 3.https://openai.com/index/dall-e-3/, 2023

OpenAI. Dall·e 3.https://openai.com/index/dall-e-3/, 2023

2023

[22] [22]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[23] [23]

Janus-pro-r1: Advancing collaborative visual comprehension and generation via reinforcement learning.arXiv preprint arXiv:2506.01480, 2025

Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, et al. Janus-pro-r1: Advancing collaborative visual comprehension and generation via reinforcement learning.arXiv preprint arXiv:2506.01480, 2025

work page arXiv 2025

[24] [24]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023

[25] [25]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.arXiv preprint arXiv:2112.10752, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia et al. Photorealistic text-to-image diffusion models with deep language understanding.arXiv preprint arXiv:2205.11487, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Stay on topic with classifier-free guidance.arXiv preprint arXiv:2306.17806, 2023

Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammana- manchi, and Stella Biderman. Stay on topic with classifier-free guidance.arXiv preprint arXiv:2306.17806, 2023

work page arXiv 2023

[28] [28]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[29] [29]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Learning to summarize with human feedback.NeurIPS, 2020

Nisan Stiennon et al. Learning to summarize with human feedback.NeurIPS, 2020

2020

[33] [33]

Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

work page arXiv 2025

[34] [34]

Joty, and Nikhil Naik

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq R. Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2023

2024

[35] [35]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Measuring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

2024

[37] [37]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[40] [40]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Data-regularized reinforcement learning for diffusion models at scale.arXiv preprint arXiv:2512.04332, 2025

Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, et al. Data-regularized reinforcement learning for diffusion models at scale.arXiv preprint arXiv:2512.04332, 2025

work page arXiv 2025

[46] [46]

Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jian- jian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

work page arXiv 2025

[47] [47]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

2025

[49] [49]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?arXiv preprint arXiv:2403.14624, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. 12 A Additional Implementation Details Text-to-Image.We employ Janus-Pro-7B as the base model and train GCPO for 1,600 ste...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

generate a wrong answer

as our base model. We followed the setup of VPPO and train the model for two epochs, which amounts to 202 steps on ViRL39K dataset. The specific hyperparameter are listed in Table 6 Table 6:Training configurations for Text-to-Image tasks. Learning Rate1×10 −6 Steps 202 β1 0.99 β2 0.999 optimizer AdamW Learning Rate Schedule Constant Model Size 8B Global B...

[53] [53]

Rearrange the equation to isolate x: x + 4 + 6 = 3x implies x + 10 = 3x implies 10 = 2x implies x = 5

For the sides, the equation derived from the equality of opposite sides will be: x + 4 = 3x - 6 Solving for x: x + 4 = 3x - 6. Rearrange the equation to isolate x: x + 4 + 6 = 3x implies x + 10 = 3x implies 10 = 2x implies x = 5

[54] [54]

Speciﬁcally, the angles at the point of intersection where the diagonals split the parallelogram will help in ﬁnding y

For the angles, the sum of the interior angles on the same side of a transversal is supplementary (180 degrees). Speciﬁcally, the angles at the point of intersection where the diagonals split the parallelogram will help in ﬁnding y. However, the value of y can be found directly from the side lengths equality. So, the value of x that makes the quadrilatera...

[55] [55]

The triangle is in the top-left corner

The ﬁrst image has a triangle in the top-left, a square in the top-right, and a circle in the bottom-left. The triangle is in the top-left corner

[56] [56]

The second image shows a square in the top-right, a circle in the bottom-left, and the triangle is missing from the top-left position

[57] [57]

The circle is in the top-left corner

The third image has a circle in the top-left, a triangle in the top-right, and a square in the bottom-left. The circle is in the top-left corner

[58] [58]

Now, observing the position changes: - In the ﬁrst image, the triangle is in the top-left

The fourth image shows a square in the top-right, a circle in the bottom-left, and the triangle is missing from the top-left position. Now, observing the position changes: - In the ﬁrst image, the triangle is in the top-left. - In the second image, it moves to the top-right, and the circle goes to the bottom-left. - In the third image, the circle is in th...

[59] [59]

replaces all squares with stars,

The ﬁrst operation is "replaces all squares with stars," meaning any square symbol (\(\text{⬛ }\)) is replaced by a star (\(\star\)). So, \(\text{⬛ }\) becomes \(\star\). The set now is: \(\text{O} \, \star \, \star\)

[60] [60]

changes the shading on the symbols

The second operation is "changes the shading on the symbols." This operation will change the shading of each symbol. In the initial set, the symbols are: - O (circle) - \(\star\) (star) - \(\text{⬛ }\) (black square, which becomes \(\star\)) After replacing the square with a star, the set is \(\text{O} \, \star \, \star\). Now applying the shading change ...