Recognition: 2 theorem links
· Lean TheoremShow-o2: Improved Native Unified Multimodal Models
Pith reviewed 2026-05-12 18:45 UTC · model grok-4.3
The pith
Show-o2 enables unified multimodal understanding and generation by combining autoregressive modeling and flow matching on shared visual representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial-temporal fusion. This enables autoregressive modeling on the language head for text token prediction and flow matching on the flow head for image and video generation, all based on a language model. A two-stage training recipe supports learning and scaling, allowing the Show-o2 models to handle a wide range of multimodal understanding and generation tasks across text, images, and videos.
What carries the argument
Dual-path spatial-temporal fusion in the 3D causal variational autoencoder space, which creates unified visual representations scalable from images to videos for both understanding and generation.
If this is right
- The two-stage training allows scaling to larger models while maintaining performance across modalities.
- Models can perform both understanding and generation tasks without modality-specific components.
- Scalability to video is achieved by extending the fusion path to include temporal information.
Where Pith is reading between the lines
- Such unified models might simplify the deployment of multimodal AI systems by reducing the number of separate components needed.
- Extending the dual-path fusion idea could apply to other sequential data like audio or 3D scenes.
- Combining autoregressive and flow-based methods in this way may lead to better coherence between text understanding and visual generation.
Load-bearing premise
The dual-path fusion of spatial and temporal information in the 3D causal variational autoencoder produces representations that work well for both image and video modalities in understanding and generation tasks.
What would settle it
Observing whether Show-o2 models achieve competitive performance on standard video generation and understanding benchmarks compared to specialized models, while also excelling in text-based tasks.
read the original abstract
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Show-o2, an improved native unified multimodal model that combines autoregressive modeling and flow matching. It is built on a 3D causal variational autoencoder to create unified visual representations via a dual-path spatial (-temporal) fusion mechanism, which is intended to scale across image and video modalities. Autoregressive modeling is applied to the language head for text token prediction, while flow matching is used in the flow head for image/video generation. A two-stage training recipe is employed to learn and scale the models, with the resulting Show-o2 claimed to handle a wide range of multimodal understanding and generation tasks across text, images, and videos. Code and models are released publicly.
Significance. If the empirical results and ablations confirm the claims, this work could advance unified multimodal architectures by demonstrating a native integration of AR and flow-based objectives in a shared 3D causal VAE latent space, potentially offering better scalability than modality-specific models. The public release of code and models supports reproducibility and further research in the field.
major comments (3)
- [Method (architecture and dual-path fusion description)] The central claim of versatility and scalability rests on the dual-path spatial (-temporal) fusion in the 3D causal VAE space, yet the manuscript provides no ablation studies, quantitative comparisons to single-path or concatenation baselines, or metrics (e.g., FID, CLIP scores, or temporal consistency measures) showing that this fusion avoids temporal artifacts on pure images or spatial degradation on videos. This is load-bearing for the versatility assertion.
- [Method (3D causal VAE and fusion)] No equations or implementation details are supplied for how the dual paths are fused in the 3D causal VAE latent space or how distribution shift is prevented between image-only and video inputs; without these, it is unclear whether the shared representation truly supports both AR text prediction and flow-based generation without modality-specific compromises.
- [Training recipe and experiments] The two-stage training recipe is described as enabling effective learning and scaling, but the paper lacks analysis of how the stages interact with the dual-path fusion (e.g., stage-wise loss curves, ablation on stage ordering, or scaling behavior with model size) to support the claim of improved performance over prior unified models.
minor comments (2)
- The parenthetical notation 'spatial (-temporal)' is ambiguous; explicit definitions or diagrams distinguishing the image-only path from the video path would improve clarity.
- [Abstract] The abstract and introduction would benefit from a brief table summarizing key quantitative results (e.g., benchmark scores on understanding and generation tasks) to ground the versatility claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below. We will incorporate the requested clarifications, details, and analyses into the revised manuscript to strengthen the presentation of the dual-path fusion and training recipe.
read point-by-point responses
-
Referee: [Method (architecture and dual-path fusion description)] The central claim of versatility and scalability rests on the dual-path spatial (-temporal) fusion in the 3D causal VAE space, yet the manuscript provides no ablation studies, quantitative comparisons to single-path or concatenation baselines, or metrics (e.g., FID, CLIP scores, or temporal consistency measures) showing that this fusion avoids temporal artifacts on pure images or spatial degradation on videos. This is load-bearing for the versatility assertion.
Authors: We agree that targeted ablations would provide stronger support for the versatility claims. While the original manuscript demonstrates overall performance gains across understanding and generation benchmarks, it does not include direct comparisons to single-path or concatenation baselines. In the revision, we will add ablation studies that compare the dual-path spatial-temporal fusion against these alternatives, reporting quantitative metrics including FID and CLIP scores for images as well as temporal consistency measures for videos. This will explicitly show the fusion's role in avoiding artifacts and degradation. revision: yes
-
Referee: [Method (3D causal VAE and fusion)] No equations or implementation details are supplied for how the dual paths are fused in the 3D causal VAE latent space or how distribution shift is prevented between image-only and video inputs; without these, it is unclear whether the shared representation truly supports both AR text prediction and flow-based generation without modality-specific compromises.
Authors: We acknowledge that the manuscript lacks explicit equations and implementation details on the fusion process. In the revised version, we will add the mathematical formulation of the dual-path fusion within the 3D causal VAE latent space and describe the mechanisms used to mitigate distribution shift between image-only and video inputs. These additions will clarify how the shared representation enables native integration of autoregressive text prediction and flow-based generation without modality-specific compromises. revision: yes
-
Referee: [Training recipe and experiments] The two-stage training recipe is described as enabling effective learning and scaling, but the paper lacks analysis of how the stages interact with the dual-path fusion (e.g., stage-wise loss curves, ablation on stage ordering, or scaling behavior with model size) to support the claim of improved performance over prior unified models.
Authors: We agree that further analysis of the two-stage training's interaction with the dual-path fusion would strengthen the claims. The original manuscript describes the recipe at a high level but does not provide stage-wise loss curves, ordering ablations, or scaling behavior with model size. In the revision, we will include these analyses to demonstrate how the stages enable effective learning and scaling, supporting the performance improvements over prior unified models. revision: yes
Circularity Check
No circularity: claims rest on empirical results, not self-referential definitions or fitted predictions.
full rationale
The paper describes an architectural design (3D causal VAE with dual-path spatial-temporal fusion, AR + flow heads, two-stage training) and reports empirical versatility on multimodal tasks. No equations or derivations are presented that reduce by construction to their own inputs. The central claim of scalability and effectiveness is framed as an outcome of training rather than a tautological redefinition. Self-citations to prior Show-o work, if present, are not load-bearing for the core assertions, which depend on experimental validation instead of imported uniqueness theorems or ansatzes.
Axiom & Free-Parameter Ledger
free parameters (1)
- model scale and training hyperparameters
axioms (2)
- domain assumption A 3D causal variational autoencoder produces unified visual representations suitable for both images and videos.
- domain assumption Autoregressive modeling on the language head and flow matching on the flow head can be applied natively within the same model.
Forward citations
Cited by 28 Pith papers
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
-
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
-
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
-
Exploring Spatial Intelligence from a Generative Perspective
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.
-
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning
Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.
-
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
-
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
-
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
-
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...
-
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
-
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
StepSTEM benchmark and step-level DP evaluation show top MLLMs achieve only 38.29% accuracy on fine-grained multimodal STEM reasoning, relying primarily on textual cues.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
Nucleus-Image: Sparse MoE for Image Generation
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
-
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
Motus: A Unified Latent Action World Model
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
-
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
-
Qwen-Image Technical Report
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
-
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Accessed September 25, 2023 [Online] https://research.runwayml.com/gen2, 2023
Gen-2. Accessed September 25, 2023 [Online] https://research.runwayml.com/gen2, 2023
work page 2023
-
[2]
Accessed December 28, 2023 [Online]https://www.pika.art/, 2023
Pika 1.0. Accessed December 28, 2023 [Online]https://www.pika.art/, 2023
work page 2023
-
[3]
Accessed June 17, 2024 [Online] https://runwayml.com/research/ introducing-gen-3-alpha, 2024
Gen-3. Accessed June 17, 2024 [Online] https://runwayml.com/research/ introducing-gen-3-alpha, 2024
work page 2024
-
[4]
Accessed June 6, 2024 [Online]https://klingai.kuaishou.com/, 2024
Kling. Accessed June 6, 2024 [Online]https://klingai.kuaishou.com/, 2024
work page 2024
-
[5]
Ming-omni: A unified multimodal model for perception and generation, 2025
Inclusion AI. Ming-omni: A unified multimodal model for perception and generation, 2025
work page 2025
-
[6]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InIEEE International Conference on Computer Vision, 2021
work page 2021
-
[9]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InCVPR, 2023
work page 2023
-
[10]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jianfeng Wang, Linjie Li, † LongOuyang, † Jun- tangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, † PrafullaDhariwal, † CaseyChu, † YunxinJiao, and Aditya Ramesh. Improving image generation with better captions
-
[11]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Coyo-700m: Image-text pair dataset
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022
work page 2022
-
[13]
Oneig-bench: Omni-dimensional nuanced evaluation for image generation
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation.arXiv preprint arxiv:2506.07977, 2025
-
[14]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InCVPR, pages 3558–3568, 2021
work page 2021
-
[15]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023
work page internal anchor Pith review arXiv 2023
-
[16]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024
work page 2024
-
[17]
Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024
work page 2024
-
[18]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR. OpenReview.net, 2024
work page 2024
-
[20]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023
work page internal anchor Pith review arXiv 2023
-
[21]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024
work page internal anchor Pith review arXiv 2024
-
[22]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. InICML, pages 1691–1703, 2020
work page 2020
-
[23]
Panda-70m: Cap- tioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Cap- tioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[24]
Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, and Long Chen. Comm: A coherent interleaved image-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2406.10462, 2024. 16
-
[25]
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction.arXiv preprint arXiv:2310.20700, 2023
-
[27]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, and Xiaodan Liang. Semhitok: A unified image tokenizer via semantic-guided hierarchical codebook for multimodal understanding and generation.arXiv preprint arXiv:2503.06764, 2025
-
[29]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024
work page 2024
-
[31]
Fine- grained open domain image animation with motion guidance.arXiv preprint arXiv:2311.12886, 2023
Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Fine- grained open domain image animation with motion guidance.arXiv preprint arXiv:2311.12886, 2023
-
[32]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009
work page 2009
-
[34]
Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832, 2024
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832, 2024
-
[35]
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models.arXiv preprint arXiv:2502.06788, 2025
-
[36]
DreamLLM: Synergistic multimodal comprehension and creation
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. InICLR, 2024
work page 2024
-
[37]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024
work page 2024
-
[38]
Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025
-
[39]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models.CoRR, abs/2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024
-
[41]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023
work page 2023
-
[42]
Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, et al. Ming-lite-uni: Advancements in unified architecture for natural multimodal interaction.arXiv preprint arXiv:2505.02471, 2025
-
[43]
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.International Conference on Learning Representations, 2024
work page 2024
-
[44]
Hidream-i1.https://github.com/HiDream-ai/HiDream-I1, 2025
HiDream-ai. Hidream-i1.https://github.com/HiDream-ai/HiDream-I1, 2025
work page 2025
-
[45]
Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024
work page 2024
-
[46]
Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement
Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, and Hang Xu. Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement.arXiv preprint arXiv:2504.01934, 2025
-
[47]
Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), 2016. 17
work page 2016
-
[48]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[49]
Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, pages 6700–6709. Computer Vision Foundation / IEEE, 2019
work page 2019
-
[50]
Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding.arXiv preprint arXiv:2504.04423, 2025
-
[51]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016
work page 2016
-
[52]
Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023
-
[53]
Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, and Zhijie Deng. Or- thus: Autoregressive interleaved image-text generation with modality-specific heads.arXiv preprint arXiv:2412.00127, 2024
-
[54]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[55]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.ArXiv, abs/2402.17245, 2024
-
[58]
Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, and Jifeng Dai. Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding.arXiv preprint arXiv:2412.09604, 2024
-
[59]
Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568,
Haopeng Li, Jinyue Yang, Guoqi Li, and Huan Wang. Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568, 2025
-
[60]
Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text
Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[61]
Autoregressive Image Generation Without Vector Quantization
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.arXiv preprint arXiv:2406.11838, 2024
-
[62]
Densefusion-1m: Merging vision experts for comprehensive multimodal perception.2407.08303, 2024
Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Ling-Yu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.2407.08303, 2024
-
[63]
Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024
Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024
-
[64]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...
work page 2024
-
[65]
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472, 2025
-
[66]
Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
-
[67]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and Ying Shan. Toklip: Marry visual tokens to clip for multimodal comprehension and generation.arXiv preprint arXiv:2505.05422, 2025. 18
-
[69]
Vila: On pre- training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre- training for visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024
work page 2024
-
[70]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[71]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024
work page 2024
-
[72]
Visual instruction tuning.NeurIPS, 36, 2024
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2024
work page 2024
-
[73]
Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C. Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024
-
[74]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[75]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024
work page 2024
-
[76]
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action.arXiv preprint arXiv:2312.17172, 2023
-
[77]
arXiv preprint arXiv:2502.20321 (2025) 9
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025
-
[78]
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang...
work page 2025
-
[79]
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024
work page 2024
-
[80]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024
work page internal anchor Pith review arXiv 2024
-
[81]
Gpt-4v.https://openai.com/index/gpt-4v-system-card/, 2023
OpenAI. Gpt-4v.https://openai.com/index/gpt-4v-system-card/, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.