pith. sign in

arxiv: 2510.12784 · v2 · pith:GBVNWFALnew · submitted 2025-10-14 · 💻 cs.CV · cs.CL

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

classification 💻 cs.CV cs.CL
keywords generationrewardsrumunderstandingmoduletextbfmodelmodels
0
0 comments X
read the original abstract

Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a model's strong visual understanding often fails to transfer to visual generation: it may correctly judge prompt-image alignment while failing to generate a faithful image from the same prompt. This raises a compelling question: Can a model improve itself by using its understanding module to reward its generation module? We introduce SRUM, a self-rewarding post-training framework directly applicable to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal ``evaluator'', providing corrective signals to improve generation without additional human-labeled data or external reward models. To provide comprehensive feedback, SRUM uses a global-local dual reward system: a \textbf{global reward} ensures overall visual semantics and layout, while a \textbf{local reward} refines fine-grained, object-level fidelity. SRUM shows strong generalization, boosting performance on T2I-CompBench from 82.18 to \textbf{88.37} and on T2I-ReasonBench from 43.82 to \textbf{46.75}. Overall, our work establishes a powerful paradigm for enabling a UMM's understanding module to guide and enhance its own generation via self-rewarding.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

  2. DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement

    cs.CV 2026-05 unverdicted novelty 6.0

    DIVA factorizes visual representations in unified multimodal models into shared and unique components via complementary information flows and mutual information estimation to convert representation divergence into mut...

  3. LatentUMM: Dual Latent Alignment for Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 6.0

    LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.

  4. Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

    cs.CV 2025-10 unverdicted novelty 6.0

    UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.

  5. Seeing Touch from Motion: A Unified Modality-Aware Visuo-Tactile Policy with Tactile Motion Correlation

    cs.RO 2026-06 unverdicted novelty 5.0

    A visuo-tactile policy learning method that exploits tactile motion correlation for contact state distinction and Mixture-of-Transformers for cross-modal fusion.