pith. machine review for the scientific record. sign in

arxiv: 2605.10445 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

Hao Liang, Ming Lu, Renrui Zhang, Ruichuan An, Sihan Yang, Wentao Zhang, Zijun Shen, Ziyu Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords personalized multimodal modelsreinforcement learningunderstanding-generation synergySync-R1ensemble rewardsDynamic Group ScalingUnifyBench++
0
0 comments X

The pith

A single reinforcement learning loop lets personalized understanding guide generation while generation refines understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that unified multimodal models can bridge personalized understanding and generation by using an explicit reinforcement learning loop instead of implicit alignment methods. This matters because current models handle general tasks well but struggle with user-specific personalization, and capturing the mutual benefits between comprehension and creation could lead to more capable systems. Sync-R1 implements this through a joint optimization process where understanding guides creation and creation quality improves understanding via shared rewards. The method introduces Sync-GRPO for ensemble rewards and Dynamic Group Scaling to efficiently train by filtering trajectories, along with a new benchmark to test real-world scenarios. If the approach holds, it enables superior cross-task performance without extra cold-start steps.

Core claim

Sync-R1 is an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, personalized comprehension guides content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape orchestrated by Sync-GRPO and Dynamic Group Scaling, achieving state-of-the-art results on UnifyBench++ without complex cold-start procedures.

What carries the argument

The unified explicit reasoning loop that enables reciprocal refinement between personalized understanding and generation using an ensemble reward system and adaptive trajectory filtering.

If this is right

  • Personalized understanding directly guides and improves the quality of generated content.
  • Generation feedback in turn refines and strengthens the understanding capabilities.
  • The system achieves robust personalization without requiring complex cold-start procedures.
  • Superior cross-task reasoning emerges from the integrated optimization process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might apply to other areas where two related capabilities can be trained to improve each other through feedback loops.
  • It raises the possibility that explicit integration outperforms separate training for paired tasks in AI systems.
  • Testing the framework on additional benchmarks beyond UnifyBench++ could reveal how general the synergy effect is.

Load-bearing premise

The assumption that an explicit unified reasoning loop with ensemble rewards from Sync-GRPO and trajectory filtering via Dynamic Group Scaling will reliably produce synergistic improvements between personalized understanding and generation, rather than the gains coming mainly from the new benchmark or training scale.

What would settle it

An ablation experiment where the unified loop is disabled or DGS is removed, showing that performance on UnifyBench++ falls back to levels of prior supervised methods, would indicate the synergy is not the main driver.

Figures

Figures reproduced from arXiv: 2605.10445 by Hao Liang, Ming Lu, Renrui Zhang, Ruichuan An, Sihan Yang, Wentao Zhang, Zijun Shen, Ziyu Guo.

Figure 1
Figure 1. Figure 1: Capability Overview of Sync-R1. Our framework jointly optimizes personalized understanding and generation within a unified reasoning trajectory via Sync-GRPO. By establishing an explicit synergistic loop, Sync￾R1 leverages the reciprocal enhancement between comprehension and creation to achieve precise integration of personalized concepts. Beyond standard personalization, Sync-R1 excels in challenging scen… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Sync-R1 Framework. We introduce a novel integration of personalized understanding and generation reasoning. Through our proposed Sync-GRPO, these two processes are coupled to mutually reinforce each other, jointly contributing to parameter updates. This mechanism significantly enhances the model’s reasoning capabilities regarding personalized concepts and achieves high-fidelity information … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Dynamic Group Scaling (DGS). To improve efficiency, we employ a preliminary assessment at the 10th step of the Show-o generation process. The generation continues only if the intermediate score exceeds a dynamically updated threshold. This threshold is adjusted adaptively to maintain the probability of high-quality generation selection at a target level, effectively filtering out suboptimal… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison. We present a visual comparison between Sync-R1 and baseline methods. The underlined text highlights specific information that requires the model to infer from the personalized context correctly [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Efficiency and Dynamics of DGS. (a) Convergence Efficiency: Sync-R1 (Red) reaches 98% of the baseline’s peak performance approx. 1.9× faster in wall-clock time, confirming that high sample efficiency outweighs the selection overhead. (b) Adaptive Dynamics: The threshold T (Purple) dynamically adapts to pass rate fluctuations (Orange) to anchor the selection ratio at TPR, ensuring robust training stability … view at source ↗
Figure 6
Figure 6. Figure 6: Snapshot of UnifyBench++. We present illustrative examples of concept entries within our constructed dataset. The italicized segments within the extra_info and Rea. Gen. columns denote the specific data subsets utilized for training Sync-R1, while the non-italicized text is reserved for evaluation. C Theoretical Formulation of Sync-GRPO Foundational Strategy. We build our optimization framework upon the Dr… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Visualization of Concept ⟨f_h⟩. name: <butin> info: <butin> is a Siberian Husky dog with light brown and white fur, blue eyes, and expressive facial features often captured in humorous or relaxed poses. extra_info: 1.<butin> drinks from a large blue water bowl. 2.<butin> owns a bright orange rubber toy ball. 3.<butin> owns a crocheted bib. A photo of <butin>. A photo of <butin> drinking from hi… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Visualization of Concept ⟨butin⟩. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
read the original abstract

Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: https://github.com/arctanxarc/UniCTokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Sync-R1, an end-to-end reinforcement learning framework for unified multimodal models (UMMs) that jointly optimizes personalized understanding and generation inside a single explicit reasoning loop. It introduces Sync-GRPO (an ensemble-reward RL method) and Dynamic Group Scaling (DGS) for adaptive trajectory filtering, together with the new UnifyBench++ benchmark containing denser textual descriptions and richer user contexts. The central claim is that this unified feedback process produces synergistic improvements, yielding state-of-the-art cross-task reasoning and robust personalization without complex cold-start procedures; code and the UnifyBench++ dataset are promised for release.

Significance. If the experimental claims hold after proper controls, the work would be significant for the multimodal community: it offers an explicit mechanism to couple comprehension and generation via cooperative RL rather than implicit token-level SFT, potentially reducing reliance on separate cold-start stages. The public release of code and a richer benchmark would further aid reproducibility and future comparisons. However, the absence of any quantitative metrics, ablations, or variance statistics in the abstract makes it impossible to gauge whether the reported gains exceed what would be expected from increased benchmark density or standard RL scaling alone.

major comments (2)
  1. [Abstract] Abstract: The manuscript asserts that 'Sync-R1 achieves state-of-the-art performance' and 'superior cross-task reasoning and robust personalization' yet supplies no numerical results, baseline comparisons, ablation tables, or error bars. Because the central claim rests on these experimental outcomes, the lack of any quantitative evidence prevents verification of the contribution of the unified loop, Sync-GRPO ensemble rewards, or DGS.
  2. [Experiments] Experimental section (presumably §4–5): To establish that the observed gains arise from the proposed unified reasoning loop plus Sync-GRPO/DGS rather than from UnifyBench++ density or training scale, the paper must report ablations that (i) remove Sync-GRPO while holding model size, data volume, and benchmark fixed and (ii) remove DGS while keeping the rest constant, together with statistical significance tests. Without these controls the causal link between the invented components and the SOTA claim remains unverified.
minor comments (2)
  1. [Title / Abstract] The title refers to 'Uni-Synergy' while the abstract and method are named 'Sync-R1'; a brief clarification of the relationship between the two names would improve readability.
  2. The abstract states that code and UnifyBench++ will be released; the full manuscript should include a dedicated reproducibility section with exact hyper-parameters, reward weighting details for the Sync-GRPO ensemble, and the precise definition of 'Dynamic Group Scaling' to allow independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our experimental claims. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript asserts that 'Sync-R1 achieves state-of-the-art performance' and 'superior cross-task reasoning and robust personalization' yet supplies no numerical results, baseline comparisons, ablation tables, or error bars. Because the central claim rests on these experimental outcomes, the lack of any quantitative evidence prevents verification of the contribution of the unified loop, Sync-GRPO ensemble rewards, or DGS.

    Authors: We agree that the abstract would benefit from including key quantitative highlights to immediately convey the scale of the reported improvements. In the revised version we will add concise numerical results (e.g., main accuracy gains on UnifyBench++ relative to strong baselines) while preserving the abstract's brevity. The full set of baseline comparisons, tables, and variance statistics already appears in Sections 4 and 5; the abstract revision will simply surface the most salient figures for readers. revision: yes

  2. Referee: [Experiments] Experimental section (presumably §4–5): To establish that the observed gains arise from the proposed unified reasoning loop plus Sync-GRPO/DGS rather than from UnifyBench++ density or training scale, the paper must report ablations that (i) remove Sync-GRPO while holding model size, data volume, and benchmark fixed and (ii) remove DGS while keeping the rest constant, together with statistical significance tests. Without these controls the causal link between the invented components and the SOTA claim remains unverified.

    Authors: We concur that explicit controls are necessary to attribute gains to the proposed components. Our current experimental suite already contains ablations that isolate Sync-GRPO (ensemble reward) and DGS (trajectory filtering) while holding model size, data volume, and the UnifyBench++ benchmark fixed. To make these controls fully transparent and address the referee's concern, we will expand the experimental section with dedicated ablation tables that report means and standard deviations across multiple random seeds together with statistical significance tests (paired t-tests). This will demonstrate that the observed cross-task and personalization gains exceed those attributable to benchmark density or standard RL scaling alone. revision: yes

Circularity Check

0 steps flagged

No circularity detected; new RL framework and benchmark are independent of inputs

full rationale

The paper introduces Sync-R1 as a novel end-to-end RL framework using Sync-GRPO ensemble rewards and Dynamic Group Scaling for trajectory filtering, plus the new UnifyBench++ dataset. No equations, derivations, or predictions are described that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central claims rest on experimental SOTA results rather than any re-expression of prior results as new predictions. The method and benchmark are presented as additive contributions, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented physical entities; the new method names (Sync-R1, Sync-GRPO, DGS) are algorithmic constructs rather than postulated entities with independent evidence.

axioms (1)
  • domain assumption Reinforcement learning with ensemble rewards can capture synergy between understanding and generation tasks
    Invoked implicitly when stating that the unified feedback process enables reciprocal refinement
invented entities (2)
  • Sync-GRPO no independent evidence
    purpose: Ensemble reward system for joint optimization of dual tasks
    New RL variant introduced to orchestrate the synergy; no independent evidence outside the paper
  • Dynamic Group Scaling (DGS) no independent evidence
    purpose: Adaptive filtering of low-potential trajectories to reduce variance
    New technique proposed to accelerate convergence; no external validation provided

pith-pipeline@v0.9.0 · 5545 in / 1427 out tokens · 36398 ms · 2026-05-12T05:16:41.992506+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 3 internal anchors

  1. [1]

    2025 , eprint=

    Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step , author=. 2025 , eprint=

  2. [2]

    arXiv preprint arXiv:2508.11433 , year=

    MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation , author=. arXiv preprint arXiv:2508.11433 , year=

  3. [3]

    2025 , Eprint =

    Ruixiang Jiang and Changwen Chen , Title =. 2025 , Eprint =

  4. [4]

    2025 , eprint=

    T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT , author=. 2025 , eprint=

  5. [5]

    LLaVA-OneVision: Easy Visual Task Transfer

    Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

  6. [6]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  7. [7]

    International Conference on Human-Computer Interaction , pages=

    DesignEva: A Design-Supported Tool with Multi-faceted Perceptual Evaluation , author=. International Conference on Human-Computer Interaction , pages=. 2022 , organization=

  8. [8]

    2024 , eprint=

    NExT-GPT: Any-to-Any Multimodal LLM , author=. 2024 , eprint=

  9. [9]

    2025 , eprint=

    Emerging Properties in Unified Multimodal Pretraining , author=. 2025 , eprint=

  10. [10]

    Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

    Grounded Reinforcement Learning for Visual Reasoning , author=. arXiv preprint arXiv:2505.23678 , year=

  11. [11]

    2025 , eprint=

    DanceGRPO: Unleashing GRPO on Visual Generation , author=. 2025 , eprint=

  12. [12]

    Show-o2: Improved Native Unified Multimodal Models

    Show-o2: Improved Native Unified Multimodal Models , author=. arXiv preprint arXiv:2506.15564 , year=

  13. [13]

    2025 , eprint=

    Visual Planning: Let's Think Only with Images , author=. 2025 , eprint=

  14. [14]

    2025 , eprint=

    MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO , author=. 2025 , eprint=

  15. [15]

    2025 , eprint=

    UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation , author=. 2025 , eprint=

  16. [16]

    2025 , eprint=

    UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens , author=. 2025 , eprint=

  17. [17]

    GRPO , author=

    Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO , author=. 2025 , eprint=

  18. [18]

    2025 , eprint=

    RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning , author=. 2025 , eprint=

  19. [19]

    2025 , eprint=

    Thinking with Generated Images , author=. 2025 , eprint=

  20. [20]

    2025 , eprint=

    Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    A Survey of Generative Categories and Techniques in Multimodal Large Language Models , author=. 2025 , eprint=

  22. [22]

    2025 , eprint=

    UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning , author=. 2025 , eprint=

  23. [23]

    2022 , eprint=

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

  24. [24]

    2025 , eprint=

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , author=. 2025 , eprint=

  25. [25]

    2025 , eprint=

    What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret , author=. 2025 , eprint=

  26. [26]

    2025 , eprint=

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

  27. [27]

    2025 , eprint=

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks , author=. 2025 , eprint=

  28. [28]

    2025 , eprint=

    GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning , author=. 2025 , eprint=

  29. [29]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  30. [30]

    2025 , eprint=

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author=. 2025 , eprint=

  31. [31]

    2025 , eprint=

    Visual-RFT: Visual Reinforcement Fine-Tuning , author=. 2025 , eprint=

  32. [32]

    2025 , eprint=

    MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning , author=. 2025 , eprint=

  33. [33]

    Aha Moment

    R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement , author=. 2025 , eprint=

  35. [35]

    2025 , eprint=

    video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model , author=. 2025 , eprint=

  36. [36]

    2025 , eprint=

    TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning , author=. 2025 , eprint=

  37. [37]

    2024 , eprint=

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? , author=. 2024 , eprint=

  38. [38]

    2024 , eprint=

    MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine , author=. 2024 , eprint=

  39. [39]

    2021 , eprint=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

  40. [40]

    2021 , eprint=

    Program Synthesis with Large Language Models , author=. 2021 , eprint=

  41. [41]

    2024 , eprint=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. 2024 , eprint=

  42. [42]

    2025 , eprint=

    MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse , author=. 2025 , eprint=

  43. [43]

    2025 , eprint=

    ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL , author=. 2025 , eprint=

  44. [44]

    2025 , eprint=

    YoChameleon: Personalized Vision and Language Generation , author=. 2025 , eprint=

  45. [45]

    2025 , eprint=

    MedGR ^2 : Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning , author=. 2025 , eprint=

  46. [46]

    2025 , eprint=

    Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering , author=. 2025 , eprint=

  47. [47]

    2025 , eprint=

    Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models , author=. 2025 , eprint=

  48. [48]

    2025 , eprint=

    Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=

  49. [49]

    2025 , eprint=

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation , author=. 2025 , eprint=

  50. [50]

    2022 , eprint=

    MaskGIT: Masked Generative Image Transformer , author=. 2022 , eprint=

  51. [51]

    2025 , eprint=

    Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning , author=. 2025 , eprint=

  52. [52]

    2021 , eprint=

    ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , author=. 2021 , eprint=

  53. [53]

    2023 , eprint=

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , eprint=

  54. [54]

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 815–823 (2015)

    Schroff, Florian and Kalenichenko, Dmitry and Philbin, James , year=. FaceNet: A unified embedding for face recognition and clustering , url=. doi:10.1109/cvpr.2015.7298682 , booktitle=

  55. [55]

    Deep face recognition: A survey , journal =

    Mei Wang and Weihong Deng , keywords =. Deep face recognition: A survey , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.neucom.2020.10.081 , url =

  56. [56]

    Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , volume=

    Zhang, Kaipeng and Zhang, Zhanpeng and Li, Zhifeng and Qiao, Yu , year=. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , volume=. IEEE Signal Processing Letters , publisher=. doi:10.1109/lsp.2016.2603342 , number=

  57. [57]

    2024 , eprint=

    MyVLM: Personalizing VLMs for User-Specific Queries , author=. 2024 , eprint=

  58. [58]

    2025 , eprint=

    MC-LLaVA: Multi-Concept Personalized Vision-Language Model , author=. 2025 , eprint=

  59. [59]

    2025 , eprint=

    Co-Reinforcement Learning for Unified Multimodal Understanding and Generation , author=. 2025 , eprint=

  60. [60]

    2025 , eprint=

    Chameleon: Mixed-Modal Early-Fusion Foundation Models , author=. 2025 , eprint=

  61. [61]

    2025 , eprint=

    Generating Feasible and Diverse Synthetic Populations Using Diffusion Models , author=. 2025 , eprint=

  62. [62]

    2025 , eprint=

    MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer , author=. 2025 , eprint=

  63. [63]

    2025 , eprint=

    Robix: A Unified Model for Robot Interaction, Reasoning and Planning , author=. 2025 , eprint=

  64. [64]

    2025 , eprint=

    One RL to See Them All: Visual Triple Unified Reinforcement Learning , author=. 2025 , eprint=

  65. [65]

    2025 , eprint=

    Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning , author=. 2025 , eprint=

  66. [66]

    2025 , eprint=

    Semantic IDs for Joint Generative Search and Recommendation , author=. 2025 , eprint=

  67. [67]

    2025 , eprint=

    The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs , author=. 2025 , eprint=

  68. [68]

    2025 , eprint=

    Instruction Tuning for Large Language Models: A Survey , author=. 2025 , eprint=

  69. [69]

    2023 , eprint=

    DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , author=. 2023 , eprint=

  70. [70]

    2022 , eprint=

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , author=. 2022 , eprint=

  71. [71]

    2024 , eprint=

    Personalized Visual Instruction Tuning , author=. 2024 , eprint=

  72. [72]

    2025 , eprint=

    RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models , author=. 2025 , eprint=

  73. [73]

    2023 , eprint=

    Planting a SEED of Vision in Large Language Model , author=. 2023 , eprint=

  74. [74]

    2025 , eprint=

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation , author=. 2025 , eprint=

  75. [75]

    2024 , eprint=

    DreamLLM: Synergistic Multimodal Comprehension and Creation , author=. 2024 , eprint=

  76. [76]

    2024 , eprint=

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation , author=. 2024 , eprint=

  77. [77]

    2024 , eprint=

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. 2024 , eprint=

  78. [78]

    2024 , eprint=

    Emu3: Next-Token Prediction is All You Need , author=. 2024 , eprint=

  79. [79]

    2025 , eprint=

    Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning , author=. 2025 , eprint=

  80. [80]

    2024 , eprint=

    LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model , author=. 2024 , eprint=

Showing first 80 references.