arxiv: 2605.10445 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

Hao Liang, Ming Lu, Renrui Zhang, Ruichuan An, Sihan Yang, Wentao Zhang, Zijun Shen, Ziyu Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords personalized multimodal modelsreinforcement learningunderstanding-generation synergySync-R1ensemble rewardsDynamic Group ScalingUnifyBench++

0 comments

The pith

A single reinforcement learning loop lets personalized understanding guide generation while generation refines understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that unified multimodal models can bridge personalized understanding and generation by using an explicit reinforcement learning loop instead of implicit alignment methods. This matters because current models handle general tasks well but struggle with user-specific personalization, and capturing the mutual benefits between comprehension and creation could lead to more capable systems. Sync-R1 implements this through a joint optimization process where understanding guides creation and creation quality improves understanding via shared rewards. The method introduces Sync-GRPO for ensemble rewards and Dynamic Group Scaling to efficiently train by filtering trajectories, along with a new benchmark to test real-world scenarios. If the approach holds, it enables superior cross-task performance without extra cold-start steps.

Core claim

Sync-R1 is an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, personalized comprehension guides content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape orchestrated by Sync-GRPO and Dynamic Group Scaling, achieving state-of-the-art results on UnifyBench++ without complex cold-start procedures.

What carries the argument

The unified explicit reasoning loop that enables reciprocal refinement between personalized understanding and generation using an ensemble reward system and adaptive trajectory filtering.

If this is right

Personalized understanding directly guides and improves the quality of generated content.
Generation feedback in turn refines and strengthens the understanding capabilities.
The system achieves robust personalization without requiring complex cold-start procedures.
Superior cross-task reasoning emerges from the integrated optimization process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might apply to other areas where two related capabilities can be trained to improve each other through feedback loops.
It raises the possibility that explicit integration outperforms separate training for paired tasks in AI systems.
Testing the framework on additional benchmarks beyond UnifyBench++ could reveal how general the synergy effect is.

Load-bearing premise

The assumption that an explicit unified reasoning loop with ensemble rewards from Sync-GRPO and trajectory filtering via Dynamic Group Scaling will reliably produce synergistic improvements between personalized understanding and generation, rather than the gains coming mainly from the new benchmark or training scale.

What would settle it

An ablation experiment where the unified loop is disabled or DGS is removed, showing that performance on UnifyBench++ falls back to levels of prior supervised methods, would indicate the synergy is not the main driver.

Figures

Figures reproduced from arXiv: 2605.10445 by Hao Liang, Ming Lu, Renrui Zhang, Ruichuan An, Sihan Yang, Wentao Zhang, Zijun Shen, Ziyu Guo.

**Figure 1.** Figure 1: Capability Overview of Sync-R1. Our framework jointly optimizes personalized understanding and generation within a unified reasoning trajectory via Sync-GRPO. By establishing an explicit synergistic loop, SyncR1 leverages the reciprocal enhancement between comprehension and creation to achieve precise integration of personalized concepts. Beyond standard personalization, Sync-R1 excels in challenging scen… view at source ↗

**Figure 2.** Figure 2: Overview of the Sync-R1 Framework. We introduce a novel integration of personalized understanding and generation reasoning. Through our proposed Sync-GRPO, these two processes are coupled to mutually reinforce each other, jointly contributing to parameter updates. This mechanism significantly enhances the model’s reasoning capabilities regarding personalized concepts and achieves high-fidelity information … view at source ↗

**Figure 3.** Figure 3: Illustration of Dynamic Group Scaling (DGS). To improve efficiency, we employ a preliminary assessment at the 10th step of the Show-o generation process. The generation continues only if the intermediate score exceeds a dynamically updated threshold. This threshold is adjusted adaptively to maintain the probability of high-quality generation selection at a target level, effectively filtering out suboptimal… view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison. We present a visual comparison between Sync-R1 and baseline methods. The underlined text highlights specific information that requires the model to infer from the personalized context correctly [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Efficiency and Dynamics of DGS. (a) Convergence Efficiency: Sync-R1 (Red) reaches 98% of the baseline’s peak performance approx. 1.9× faster in wall-clock time, confirming that high sample efficiency outweighs the selection overhead. (b) Adaptive Dynamics: The threshold T (Purple) dynamically adapts to pass rate fluctuations (Orange) to anchor the selection ratio at TPR, ensuring robust training stability … view at source ↗

**Figure 6.** Figure 6: Snapshot of UnifyBench++. We present illustrative examples of concept entries within our constructed dataset. The italicized segments within the extra_info and Rea. Gen. columns denote the specific data subsets utilized for training Sync-R1, while the non-italicized text is reserved for evaluation. C Theoretical Formulation of Sync-GRPO Foundational Strategy. We build our optimization framework upon the Dr… view at source ↗

**Figure 7.** Figure 7: Qualitative Visualization of Concept ⟨f_h⟩. name: <butin> info: <butin> is a Siberian Husky dog with light brown and white fur, blue eyes, and expressive facial features often captured in humorous or relaxed poses. extra_info: 1.<butin> drinks from a large blue water bowl. 2.<butin> owns a bright orange rubber toy ball. 3.<butin> owns a crocheted bib. A photo of <butin>. A photo of <butin> drinking from hi… view at source ↗

**Figure 8.** Figure 8: Qualitative Visualization of Concept ⟨butin⟩. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: https://github.com/arctanxarc/UniCTokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sync-R1 offers a concrete RL framework for joint personalized understanding and generation, but the abstract leaves the causal role of its new components unverified.

read the letter

The main takeaway is that this paper introduces Sync-R1, an end-to-end RL method that runs understanding and generation through one explicit reasoning loop, using Sync-GRPO for ensemble rewards and Dynamic Group Scaling to drop weak trajectories. It also ships UnifyBench++ with denser contexts. That setup is distinct from the token-level SFT alignment in earlier multimodal work, and the plan to release code plus the dataset is straightforward and helpful for follow-up checks.

Referee Report

2 major / 2 minor

Summary. The paper proposes Sync-R1, an end-to-end reinforcement learning framework for unified multimodal models (UMMs) that jointly optimizes personalized understanding and generation inside a single explicit reasoning loop. It introduces Sync-GRPO (an ensemble-reward RL method) and Dynamic Group Scaling (DGS) for adaptive trajectory filtering, together with the new UnifyBench++ benchmark containing denser textual descriptions and richer user contexts. The central claim is that this unified feedback process produces synergistic improvements, yielding state-of-the-art cross-task reasoning and robust personalization without complex cold-start procedures; code and the UnifyBench++ dataset are promised for release.

Significance. If the experimental claims hold after proper controls, the work would be significant for the multimodal community: it offers an explicit mechanism to couple comprehension and generation via cooperative RL rather than implicit token-level SFT, potentially reducing reliance on separate cold-start stages. The public release of code and a richer benchmark would further aid reproducibility and future comparisons. However, the absence of any quantitative metrics, ablations, or variance statistics in the abstract makes it impossible to gauge whether the reported gains exceed what would be expected from increased benchmark density or standard RL scaling alone.

major comments (2)

[Abstract] Abstract: The manuscript asserts that 'Sync-R1 achieves state-of-the-art performance' and 'superior cross-task reasoning and robust personalization' yet supplies no numerical results, baseline comparisons, ablation tables, or error bars. Because the central claim rests on these experimental outcomes, the lack of any quantitative evidence prevents verification of the contribution of the unified loop, Sync-GRPO ensemble rewards, or DGS.
[Experiments] Experimental section (presumably §4–5): To establish that the observed gains arise from the proposed unified reasoning loop plus Sync-GRPO/DGS rather than from UnifyBench++ density or training scale, the paper must report ablations that (i) remove Sync-GRPO while holding model size, data volume, and benchmark fixed and (ii) remove DGS while keeping the rest constant, together with statistical significance tests. Without these controls the causal link between the invented components and the SOTA claim remains unverified.

minor comments (2)

[Title / Abstract] The title refers to 'Uni-Synergy' while the abstract and method are named 'Sync-R1'; a brief clarification of the relationship between the two names would improve readability.
The abstract states that code and UnifyBench++ will be released; the full manuscript should include a dedicated reproducibility section with exact hyper-parameters, reward weighting details for the Sync-GRPO ensemble, and the precise definition of 'Dynamic Group Scaling' to allow independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our experimental claims. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript asserts that 'Sync-R1 achieves state-of-the-art performance' and 'superior cross-task reasoning and robust personalization' yet supplies no numerical results, baseline comparisons, ablation tables, or error bars. Because the central claim rests on these experimental outcomes, the lack of any quantitative evidence prevents verification of the contribution of the unified loop, Sync-GRPO ensemble rewards, or DGS.

Authors: We agree that the abstract would benefit from including key quantitative highlights to immediately convey the scale of the reported improvements. In the revised version we will add concise numerical results (e.g., main accuracy gains on UnifyBench++ relative to strong baselines) while preserving the abstract's brevity. The full set of baseline comparisons, tables, and variance statistics already appears in Sections 4 and 5; the abstract revision will simply surface the most salient figures for readers. revision: yes
Referee: [Experiments] Experimental section (presumably §4–5): To establish that the observed gains arise from the proposed unified reasoning loop plus Sync-GRPO/DGS rather than from UnifyBench++ density or training scale, the paper must report ablations that (i) remove Sync-GRPO while holding model size, data volume, and benchmark fixed and (ii) remove DGS while keeping the rest constant, together with statistical significance tests. Without these controls the causal link between the invented components and the SOTA claim remains unverified.

Authors: We concur that explicit controls are necessary to attribute gains to the proposed components. Our current experimental suite already contains ablations that isolate Sync-GRPO (ensemble reward) and DGS (trajectory filtering) while holding model size, data volume, and the UnifyBench++ benchmark fixed. To make these controls fully transparent and address the referee's concern, we will expand the experimental section with dedicated ablation tables that report means and standard deviations across multiple random seeds together with statistical significance tests (paired t-tests). This will demonstrate that the observed cross-task and personalization gains exceed those attributable to benchmark density or standard RL scaling alone. revision: yes

Circularity Check

0 steps flagged

No circularity detected; new RL framework and benchmark are independent of inputs

full rationale

The paper introduces Sync-R1 as a novel end-to-end RL framework using Sync-GRPO ensemble rewards and Dynamic Group Scaling for trajectory filtering, plus the new UnifyBench++ dataset. No equations, derivations, or predictions are described that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central claims rest on experimental SOTA results rather than any re-expression of prior results as new predictions. The method and benchmark are presented as additive contributions, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented physical entities; the new method names (Sync-R1, Sync-GRPO, DGS) are algorithmic constructs rather than postulated entities with independent evidence.

axioms (1)

domain assumption Reinforcement learning with ensemble rewards can capture synergy between understanding and generation tasks
Invoked implicitly when stating that the unified feedback process enables reciprocal refinement

invented entities (2)

Sync-GRPO no independent evidence
purpose: Ensemble reward system for joint optimization of dual tasks
New RL variant introduced to orchestrate the synergy; no independent evidence outside the paper
Dynamic Group Scaling (DGS) no independent evidence
purpose: Adaptive filtering of low-potential trajectories to reduce variance
New technique proposed to accelerate convergence; no external validation provided

pith-pipeline@v0.9.0 · 5545 in / 1427 out tokens · 36398 ms · 2026-05-12T05:16:41.992506+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 3 internal anchors

[1]

2025 , eprint=

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step , author=. 2025 , eprint=

work page 2025
[2]

arXiv preprint arXiv:2508.11433 , year=

MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation , author=. arXiv preprint arXiv:2508.11433 , year=

work page arXiv
[3]

2025 , Eprint =

Ruixiang Jiang and Changwen Chen , Title =. 2025 , Eprint =

work page 2025
[4]

2025 , eprint=

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT , author=. 2025 , eprint=

work page 2025
[5]

LLaVA-OneVision: Easy Visual Task Transfer

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

International Conference on Human-Computer Interaction , pages=

DesignEva: A Design-Supported Tool with Multi-faceted Perceptual Evaluation , author=. International Conference on Human-Computer Interaction , pages=. 2022 , organization=

work page 2022
[8]

2024 , eprint=

NExT-GPT: Any-to-Any Multimodal LLM , author=. 2024 , eprint=

work page 2024
[9]

2025 , eprint=

Emerging Properties in Unified Multimodal Pretraining , author=. 2025 , eprint=

work page 2025
[10]

Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

Grounded Reinforcement Learning for Visual Reasoning , author=. arXiv preprint arXiv:2505.23678 , year=

work page arXiv
[11]

2025 , eprint=

DanceGRPO: Unleashing GRPO on Visual Generation , author=. 2025 , eprint=

work page 2025
[12]

Show-o2: Improved Native Unified Multimodal Models

Show-o2: Improved Native Unified Multimodal Models , author=. arXiv preprint arXiv:2506.15564 , year=

work page internal anchor Pith review arXiv
[13]

2025 , eprint=

Visual Planning: Let's Think Only with Images , author=. 2025 , eprint=

work page 2025
[14]

2025 , eprint=

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO , author=. 2025 , eprint=

work page 2025
[15]

2025 , eprint=

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation , author=. 2025 , eprint=

work page 2025
[16]

2025 , eprint=

UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens , author=. 2025 , eprint=

work page 2025
[17]

GRPO , author=

Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO , author=. 2025 , eprint=

work page 2025
[18]

2025 , eprint=

RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[19]

2025 , eprint=

Thinking with Generated Images , author=. 2025 , eprint=

work page 2025
[20]

2025 , eprint=

Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

A Survey of Generative Categories and Techniques in Multimodal Large Language Models , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[23]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022
[24]

2025 , eprint=

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , author=. 2025 , eprint=

work page 2025
[25]

2025 , eprint=

What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret , author=. 2025 , eprint=

work page 2025
[26]

2025 , eprint=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

work page 2025
[27]

2025 , eprint=

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks , author=. 2025 , eprint=

work page 2025
[28]

2025 , eprint=

GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning , author=. 2025 , eprint=

work page 2025
[29]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[30]

2025 , eprint=

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author=. 2025 , eprint=

work page 2025
[31]

2025 , eprint=

Visual-RFT: Visual Reinforcement Fine-Tuning , author=. 2025 , eprint=

work page 2025
[32]

2025 , eprint=

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[33]

Aha Moment

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model , author=. 2025 , eprint=

work page 2025
[34]

2025 , eprint=

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement , author=. 2025 , eprint=

work page 2025
[35]

2025 , eprint=

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model , author=. 2025 , eprint=

work page 2025
[36]

2025 , eprint=

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning , author=. 2025 , eprint=

work page 2025
[37]

2024 , eprint=

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? , author=. 2024 , eprint=

work page 2024
[38]

2024 , eprint=

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine , author=. 2024 , eprint=

work page 2024
[39]

2021 , eprint=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

work page 2021
[40]

2021 , eprint=

Program Synthesis with Large Language Models , author=. 2021 , eprint=

work page 2021
[41]

2024 , eprint=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. 2024 , eprint=

work page 2024
[42]

2025 , eprint=

MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse , author=. 2025 , eprint=

work page 2025
[43]

2025 , eprint=

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL , author=. 2025 , eprint=

work page 2025
[44]

2025 , eprint=

YoChameleon: Personalized Vision and Language Generation , author=. 2025 , eprint=

work page 2025
[45]

2025 , eprint=

MedGR ^2 : Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning , author=. 2025 , eprint=

work page 2025
[46]

2025 , eprint=

Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering , author=. 2025 , eprint=

work page 2025
[47]

2025 , eprint=

Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models , author=. 2025 , eprint=

work page 2025
[48]

2025 , eprint=

Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=

work page 2025
[49]

2025 , eprint=

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation , author=. 2025 , eprint=

work page 2025
[50]

2022 , eprint=

MaskGIT: Masked Generative Image Transformer , author=. 2022 , eprint=

work page 2022
[51]

2025 , eprint=

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning , author=. 2025 , eprint=

work page 2025
[52]

2021 , eprint=

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , author=. 2021 , eprint=

work page 2021
[53]

2023 , eprint=

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , eprint=

work page 2023
[54]

2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 815–823 (2015)

Schroff, Florian and Kalenichenko, Dmitry and Philbin, James , year=. FaceNet: A unified embedding for face recognition and clustering , url=. doi:10.1109/cvpr.2015.7298682 , booktitle=

work page doi:10.1109/cvpr.2015.7298682 2015
[55]

Deep face recognition: A survey , journal =

Mei Wang and Weihong Deng , keywords =. Deep face recognition: A survey , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.neucom.2020.10.081 , url =

work page doi:10.1016/j.neucom.2020.10.081 2021
[56]

Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , volume=

Zhang, Kaipeng and Zhang, Zhanpeng and Li, Zhifeng and Qiao, Yu , year=. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , volume=. IEEE Signal Processing Letters , publisher=. doi:10.1109/lsp.2016.2603342 , number=

work page doi:10.1109/lsp.2016.2603342 2016
[57]

2024 , eprint=

MyVLM: Personalizing VLMs for User-Specific Queries , author=. 2024 , eprint=

work page 2024
[58]

2025 , eprint=

MC-LLaVA: Multi-Concept Personalized Vision-Language Model , author=. 2025 , eprint=

work page 2025
[59]

2025 , eprint=

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation , author=. 2025 , eprint=

work page 2025
[60]

2025 , eprint=

Chameleon: Mixed-Modal Early-Fusion Foundation Models , author=. 2025 , eprint=

work page 2025
[61]

2025 , eprint=

Generating Feasible and Diverse Synthetic Populations Using Diffusion Models , author=. 2025 , eprint=

work page 2025
[62]

2025 , eprint=

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer , author=. 2025 , eprint=

work page 2025
[63]

2025 , eprint=

Robix: A Unified Model for Robot Interaction, Reasoning and Planning , author=. 2025 , eprint=

work page 2025
[64]

2025 , eprint=

One RL to See Them All: Visual Triple Unified Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[65]

2025 , eprint=

Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning , author=. 2025 , eprint=

work page 2025
[66]

2025 , eprint=

Semantic IDs for Joint Generative Search and Recommendation , author=. 2025 , eprint=

work page 2025
[67]

2025 , eprint=

The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs , author=. 2025 , eprint=

work page 2025
[68]

2025 , eprint=

Instruction Tuning for Large Language Models: A Survey , author=. 2025 , eprint=

work page 2025
[69]

2023 , eprint=

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , author=. 2023 , eprint=

work page 2023
[70]

2022 , eprint=

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , author=. 2022 , eprint=

work page 2022
[71]

2024 , eprint=

Personalized Visual Instruction Tuning , author=. 2024 , eprint=

work page 2024
[72]

2025 , eprint=

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models , author=. 2025 , eprint=

work page 2025
[73]

2023 , eprint=

Planting a SEED of Vision in Large Language Model , author=. 2023 , eprint=

work page 2023
[74]

2025 , eprint=

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation , author=. 2025 , eprint=

work page 2025
[75]

2024 , eprint=

DreamLLM: Synergistic Multimodal Comprehension and Creation , author=. 2024 , eprint=

work page 2024
[76]

2024 , eprint=

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation , author=. 2024 , eprint=

work page 2024
[77]

2024 , eprint=

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. 2024 , eprint=

work page 2024
[78]

2024 , eprint=

Emu3: Next-Token Prediction is All You Need , author=. 2024 , eprint=

work page 2024
[79]

2025 , eprint=

Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning , author=. 2025 , eprint=

work page 2025
[80]

2024 , eprint=

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model , author=. 2024 , eprint=

work page 2024

Showing first 80 references.