GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan; Hongsheng Li; Kun Wang; LinJiang Huang; Rongyao Fang; Xihui Liu; Xingyu Zeng; Yuqing Wang

arxiv: 2505.17022 · v2 · submitted 2025-05-22 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.MM

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan , Rongyao Fang , Yuqing Wang , Kun Wang , LinJiang Huang , Xingyu Zeng , Hongsheng Li , Xihui Liu This is my paper

Pith reviewed 2026-05-22 12:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.MM

keywords reinforcement learningtext-to-image generationmultimodal large language modelscompositional generationspatial reasoningGeneration Chain-of-Thoughtattribute binding

0 comments

The pith

GoT-R1 uses reinforcement learning to let visual generation models discover their own reasoning strategies for complex spatial and attribute prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GoT-R1 as a way to fix the weakness of text-to-image models on prompts that require multiple objects in exact positions with specific properties. It starts from the Generation Chain-of-Thought idea but replaces fixed templates with reinforcement learning so the model can find better reasoning steps on its own. A dual-stage reward system lets multimodal large language models judge both the thinking process and the finished image on semantic fit, spatial correctness, and visual quality at the same time. If this works, models should produce more accurate images for detailed compositional descriptions without needing hand-crafted reasoning scripts. This matters for any application where users want reliable control over layout and object details in generated pictures.

Core claim

GoT-R1 shows that reinforcement learning applied to the Generation Chain-of-Thought process allows multimodal models to autonomously develop effective reasoning strategies for visual generation. The key mechanism is a dual-stage multi-dimensional reward framework in which MLLMs provide supervision on both the intermediate reasoning steps and the final image output, scoring semantic alignment, spatial accuracy, and visual quality together. Experiments on the T2I-CompBench benchmark confirm clear gains, especially on tasks that test precise spatial relationships and attribute binding.

What carries the argument

The dual-stage multi-dimensional reward framework that uses MLLMs to score both the reasoning chain and the generated image across semantic, spatial, and quality dimensions.

If this is right

Clear gains on compositional image generation tasks that involve precise object placement and attribute binding.
Models learn reasoning paths that go beyond any fixed Chain-of-Thought templates supplied at training time.
Unified supervision across the full pipeline from text reasoning to final pixel output.
Transfer of advanced language-model reasoning techniques directly into the visual generation setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward design could be tested on video or 3D generation where temporal and depth relations add extra spatial demands.
Autonomous discovery of reasoning steps may reduce the need for human-written templates across other multimodal tasks.
If the MLLM evaluator can be replaced by lighter models, the method becomes more practical for large-scale training.
Similar reinforcement approaches might improve text-to-image alignment even when the base model is not an MLLM.

Load-bearing premise

The MLLM-based dual-stage rewards give reliable and unbiased feedback on reasoning quality and image correctness.

What would settle it

A new benchmark of spatial and attribute prompts where human raters consistently disagree with the MLLM reward scores or where the reported gains on T2I-CompBench do not appear.

read the original abstract

Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GoT-R1 adds RL on top of generation chain-of-thought to improve compositional text-to-image results, but the MLLM reward lacks any reported human correlation check for spatial scoring.

read the letter

The punchline is that GoT-R1 shows how reinforcement learning can help a multimodal model develop its own reasoning steps for better text-to-image generation, especially on prompts that require precise object placement and attributes. The gains on T2I-CompBench look promising on paper, but the evaluation setup has a clear weakness. What is new here is combining generation chain-of-thought with RL and a dual reward that checks both the thought process and the image. The reward covers semantic alignment, spatial accuracy, and visual quality using an MLLM as judge. This moves beyond fixed templates and tries to supervise the entire pipeline. Releasing the code and models publicly is a solid move that lets others test and extend the work. The paper does a decent job highlighting a real limitation in current visual generation models. Complex prompts with spatial relations still trip up most systems, and this framework aims to address that through learned reasoning. The soft spot is the lack of validation for the MLLM reward. The approach assumes the MLLM can reliably score spatial accuracy and attribute binding. But MLLMs are known to be inconsistent with predicates like left of or above. Without a correlation study or inter-rater agreement with human raters on the same images, it's possible the improvements come from exploiting quirks in the evaluator rather than genuine reasoning gains. If the full results section has detailed ablations or human comparisons, that would address this, but the abstract alone does not show it. This paper is for people working on improving controllability in diffusion or autoregressive image models. A reader who follows work on chain-of-thought for vision or RL in generation would find the framework description useful. It deserves a serious referee because the central claim is testable and the code is out there for verification. I would recommend sending it to peer review, with the expectation that reviewers will push on the reward reliability and ask for more quantitative details on the improvements.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes GoT-R1, a reinforcement learning framework that extends the Generation Chain-of-Thought approach to enhance semantic-spatial reasoning in MLLMs for text-to-image generation. It introduces a dual-stage multi-dimensional reward system that uses an MLLM to score both reasoning traces and final images on semantic alignment, spatial accuracy, and visual quality. The paper claims significant improvements on the T2I-CompBench benchmark, especially in compositional tasks involving precise spatial relationships and attribute binding, and releases code and models publicly.

Significance. If the results hold after addressing the evaluator validation gap, the work could advance visual generation by showing how RL enables autonomous discovery of reasoning strategies beyond fixed templates. The public code and model release supports reproducibility and extensions in the multimodal generation community.

major comments (1)

[§3.2–3.3] §3.2–3.3: The headline claim of significant gains on T2I-CompBench compositional tasks depends on the dual-stage reward framework supplying reliable supervision for spatial accuracy and attribute binding. The framework prompts an MLLM to score reasoning traces and final images, yet no section reports inter-rater agreement or Pearson correlation between MLLM scores and human spatial ratings on the same outputs. Given known MLLM inconsistencies on fine-grained spatial predicates, this leaves open whether the policy acquires robust reasoning or exploits evaluator idiosyncrasies; the issue is load-bearing for the central result.

minor comments (1)

[Abstract] Abstract: The abstract asserts benchmark gains without any quantitative results, ablation details, or error analysis, which makes it harder to gauge the magnitude and robustness of the claimed improvements from the summary alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The concern regarding validation of the MLLM reward model is well-taken and directly relevant to the reliability of our central claims. We address it point-by-point below.

read point-by-point responses

Referee: [§3.2–3.3] §3.2–3.3: The headline claim of significant gains on T2I-CompBench compositional tasks depends on the dual-stage reward framework supplying reliable supervision for spatial accuracy and attribute binding. The framework prompts an MLLM to score reasoning traces and final images, yet no section reports inter-rater agreement or Pearson correlation between MLLM scores and human spatial ratings on the same outputs. Given known MLLM inconsistencies on fine-grained spatial predicates, this leaves open whether the policy acquires robust reasoning or exploits evaluator idiosyncrasies; the issue is load-bearing for the central result.

Authors: We agree that direct validation of the MLLM evaluator against human judgments is important for substantiating the reliability of the dual-stage reward. The current manuscript does not include such an analysis, focusing instead on end-to-end benchmark gains. In the revised version we will add a dedicated human study subsection. We will sample 200 reasoning traces and corresponding images (balanced across methods), collect ratings from at least three independent human annotators on semantic alignment, spatial accuracy, and visual quality using the same rubric as the MLLM, and report (i) inter-annotator agreement via Krippendorff’s alpha and (ii) Pearson correlation between the averaged human scores and the MLLM scores, with particular emphasis on the spatial-accuracy dimension. This addition will directly address whether the reward model aligns with human perception or merely exploits model-specific biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on external benchmarks and independent MLLM evaluation

full rationale

The paper describes an RL framework (GoT-R1) that builds on Generation Chain-of-Thought and uses a dual-stage MLLM-based reward to supervise semantic alignment, spatial accuracy, and visual quality. Central claims of improvement are demonstrated via performance on the external T2I-CompBench benchmark rather than any fitted parameter or self-defined quantity being relabeled as a prediction. No equations, derivations, or load-bearing self-citations reduce the reported gains to tautological inputs by construction. The reward model and benchmark evaluations are independent of the trained policy's outputs, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the assumption that MLLM-based evaluation is a faithful proxy for human judgment of semantics and spatial correctness; no explicit numerical free parameters or new physical entities are introduced in the abstract.

axioms (1)

domain assumption MLLMs can serve as reliable judges for semantic alignment, spatial accuracy, and visual quality in generated images and reasoning traces
The dual-stage reward system is built directly on this evaluation capability.

pith-pipeline@v0.9.0 · 5776 in / 1111 out tokens · 104542 ms · 2026-05-22T12:58:33.360037+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output... semantic alignment, spatial accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

B-GRTO extends GRPO by reusing rollouts to optimize auxiliary segmentation decoder objectives, yielding substantial gains over plain GRPO on referring segmentation tasks.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 5.0

GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
cs.CV 2025-12 conditional novelty 5.0

A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.