SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Aoxue Li; Chengqi Duan; Jiaqi Liao; Shenghua Gao; Weiyang Jin; Xihui Liu; Yuwei Niu

arxiv: 2510.12784 · v2 · pith:GBVNWFALnew · submitted 2025-10-14 · 💻 cs.CV · cs.CL

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin , Yuwei Niu , Jiaqi Liao , Chengqi Duan , Aoxue Li , Shenghua Gao , Xihui Liu This is my paper

classification 💻 cs.CV cs.CL

keywords generationrewardsrumunderstandingmoduletextbfmodelmodels

0 comments

read the original abstract

Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a model's strong visual understanding often fails to transfer to visual generation: it may correctly judge prompt-image alignment while failing to generate a faithful image from the same prompt. This raises a compelling question: Can a model improve itself by using its understanding module to reward its generation module? We introduce SRUM, a self-rewarding post-training framework directly applicable to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal ``evaluator'', providing corrective signals to improve generation without additional human-labeled data or external reward models. To provide comprehensive feedback, SRUM uses a global-local dual reward system: a \textbf{global reward} ensures overall visual semantics and layout, while a \textbf{local reward} refines fine-grained, object-level fidelity. SRUM shows strong generalization, boosting performance on T2I-CompBench from 82.18 to \textbf{88.37} and on T2I-ReasonBench from 43.82 to \textbf{46.75}. Overall, our work establishes a powerful paradigm for enabling a UMM's understanding module to guide and enhance its own generation via self-rewarding.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
cs.CV 2025-03 unverdicted novelty 7.0

Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement
cs.CV 2026-05 unverdicted novelty 6.0

DIVA factorizes visual representations in unified multimodal models into shared and unique components via complementary information flows and mutual information estimation to convert representation divergence into mut...
LatentUMM: Dual Latent Alignment for Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
cs.CV 2025-10 unverdicted novelty 6.0

UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.
Seeing Touch from Motion: A Unified Modality-Aware Visuo-Tactile Policy with Tactile Motion Correlation
cs.RO 2026-06 unverdicted novelty 5.0

A visuo-tactile policy learning method that exploits tactile motion correlation for contact state distinction and Mixture-of-Transformers for cross-modal fusion.