arxiv: 2506.15564 · v3 · submitted 2025-06-18 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie , Zhenheng Yang , Mike Zheng Shou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 18:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified multimodal modelsautoregressive modelingflow matching3D causal variational autoencodermultimodal generationimage video understanding

0 comments

The pith

Show-o2 enables unified multimodal understanding and generation by combining autoregressive modeling and flow matching on shared visual representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Show-o2 to create native unified multimodal models that handle understanding and generation across text, images, and videos in one system. It builds on a language model by applying autoregressive techniques for text and flow matching for visuals. The approach uses a 3D causal variational autoencoder with dual-path fusion of spatial and temporal features to create representations that work for both still images and moving videos. A two-stage training method helps the model learn effectively and grow to larger sizes. If successful, this shows that multimodal tasks do not require entirely separate models for each type of data or task.

Core claim

Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial-temporal fusion. This enables autoregressive modeling on the language head for text token prediction and flow matching on the flow head for image and video generation, all based on a language model. A two-stage training recipe supports learning and scaling, allowing the Show-o2 models to handle a wide range of multimodal understanding and generation tasks across text, images, and videos.

What carries the argument

Dual-path spatial-temporal fusion in the 3D causal variational autoencoder space, which creates unified visual representations scalable from images to videos for both understanding and generation.

If this is right

The two-stage training allows scaling to larger models while maintaining performance across modalities.
Models can perform both understanding and generation tasks without modality-specific components.
Scalability to video is achieved by extending the fusion path to include temporal information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such unified models might simplify the deployment of multimodal AI systems by reducing the number of separate components needed.
Extending the dual-path fusion idea could apply to other sequential data like audio or 3D scenes.
Combining autoregressive and flow-based methods in this way may lead to better coherence between text understanding and visual generation.

Load-bearing premise

The dual-path fusion of spatial and temporal information in the 3D causal variational autoencoder produces representations that work well for both image and video modalities in understanding and generation tasks.

What would settle it

Observing whether Show-o2 models achieve competitive performance on standard video generation and understanding benchmarks compared to specialized models, while also excelling in text-based tasks.

read the original abstract

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Show-o2 refines unified multimodal models with a 3D causal VAE dual-path fusion plus AR and flow heads, but the abstract states versatility without any numbers, ablations, or comparisons to support it.

read the letter

The main takeaway is that this is a practical next step on native unified multimodal models rather than a big conceptual shift. They build on earlier Show-o work by putting everything in a 3D causal VAE latent space, using a dual spatial-temporal fusion path so the same backbone can handle still images and video, then attaching an autoregressive head for text and a flow-matching head for visual generation, all trained in two stages to scale up.

Referee Report

3 major / 2 minor

Summary. The paper introduces Show-o2, an improved native unified multimodal model that combines autoregressive modeling and flow matching. It is built on a 3D causal variational autoencoder to create unified visual representations via a dual-path spatial (-temporal) fusion mechanism, which is intended to scale across image and video modalities. Autoregressive modeling is applied to the language head for text token prediction, while flow matching is used in the flow head for image/video generation. A two-stage training recipe is employed to learn and scale the models, with the resulting Show-o2 claimed to handle a wide range of multimodal understanding and generation tasks across text, images, and videos. Code and models are released publicly.

Significance. If the empirical results and ablations confirm the claims, this work could advance unified multimodal architectures by demonstrating a native integration of AR and flow-based objectives in a shared 3D causal VAE latent space, potentially offering better scalability than modality-specific models. The public release of code and models supports reproducibility and further research in the field.

major comments (3)

[Method (architecture and dual-path fusion description)] The central claim of versatility and scalability rests on the dual-path spatial (-temporal) fusion in the 3D causal VAE space, yet the manuscript provides no ablation studies, quantitative comparisons to single-path or concatenation baselines, or metrics (e.g., FID, CLIP scores, or temporal consistency measures) showing that this fusion avoids temporal artifacts on pure images or spatial degradation on videos. This is load-bearing for the versatility assertion.
[Method (3D causal VAE and fusion)] No equations or implementation details are supplied for how the dual paths are fused in the 3D causal VAE latent space or how distribution shift is prevented between image-only and video inputs; without these, it is unclear whether the shared representation truly supports both AR text prediction and flow-based generation without modality-specific compromises.
[Training recipe and experiments] The two-stage training recipe is described as enabling effective learning and scaling, but the paper lacks analysis of how the stages interact with the dual-path fusion (e.g., stage-wise loss curves, ablation on stage ordering, or scaling behavior with model size) to support the claim of improved performance over prior unified models.

minor comments (2)

The parenthetical notation 'spatial (-temporal)' is ambiguous; explicit definitions or diagrams distinguishing the image-only path from the video path would improve clarity.
[Abstract] The abstract and introduction would benefit from a brief table summarizing key quantitative results (e.g., benchmark scores on understanding and generation tasks) to ground the versatility claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below. We will incorporate the requested clarifications, details, and analyses into the revised manuscript to strengthen the presentation of the dual-path fusion and training recipe.

read point-by-point responses

Referee: [Method (architecture and dual-path fusion description)] The central claim of versatility and scalability rests on the dual-path spatial (-temporal) fusion in the 3D causal VAE space, yet the manuscript provides no ablation studies, quantitative comparisons to single-path or concatenation baselines, or metrics (e.g., FID, CLIP scores, or temporal consistency measures) showing that this fusion avoids temporal artifacts on pure images or spatial degradation on videos. This is load-bearing for the versatility assertion.

Authors: We agree that targeted ablations would provide stronger support for the versatility claims. While the original manuscript demonstrates overall performance gains across understanding and generation benchmarks, it does not include direct comparisons to single-path or concatenation baselines. In the revision, we will add ablation studies that compare the dual-path spatial-temporal fusion against these alternatives, reporting quantitative metrics including FID and CLIP scores for images as well as temporal consistency measures for videos. This will explicitly show the fusion's role in avoiding artifacts and degradation. revision: yes
Referee: [Method (3D causal VAE and fusion)] No equations or implementation details are supplied for how the dual paths are fused in the 3D causal VAE latent space or how distribution shift is prevented between image-only and video inputs; without these, it is unclear whether the shared representation truly supports both AR text prediction and flow-based generation without modality-specific compromises.

Authors: We acknowledge that the manuscript lacks explicit equations and implementation details on the fusion process. In the revised version, we will add the mathematical formulation of the dual-path fusion within the 3D causal VAE latent space and describe the mechanisms used to mitigate distribution shift between image-only and video inputs. These additions will clarify how the shared representation enables native integration of autoregressive text prediction and flow-based generation without modality-specific compromises. revision: yes
Referee: [Training recipe and experiments] The two-stage training recipe is described as enabling effective learning and scaling, but the paper lacks analysis of how the stages interact with the dual-path fusion (e.g., stage-wise loss curves, ablation on stage ordering, or scaling behavior with model size) to support the claim of improved performance over prior unified models.

Authors: We agree that further analysis of the two-stage training's interaction with the dual-path fusion would strengthen the claims. The original manuscript describes the recipe at a high level but does not provide stage-wise loss curves, ordering ablations, or scaling behavior with model size. In the revision, we will include these analyses to demonstrate how the stages enable effective learning and scaling, supporting the performance improvements over prior unified models. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical results, not self-referential definitions or fitted predictions.

full rationale

The paper describes an architectural design (3D causal VAE with dual-path spatial-temporal fusion, AR + flow heads, two-stage training) and reports empirical versatility on multimodal tasks. No equations or derivations are presented that reduce by construction to their own inputs. The central claim of scalability and effectiveness is framed as an outcome of training rather than a tautological redefinition. Self-citations to prior Show-o work, if present, are not load-bearing for the core assertions, which depend on experimental validation instead of imported uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the empirical success of standard machine-learning components whose effectiveness is not derived from first principles within the paper; free parameters include all learned weights and the many hyperparameters of the two-stage training process.

free parameters (1)

model scale and training hyperparameters
The two-stage training recipe requires choosing model sizes, learning rates, and schedule details that are fitted or selected to achieve the reported versatility.

axioms (2)

domain assumption A 3D causal variational autoencoder produces unified visual representations suitable for both images and videos.
Invoked in the abstract as the foundation for the dual-path fusion without further justification.
domain assumption Autoregressive modeling on the language head and flow matching on the flow head can be applied natively within the same model.
Stated as the core mechanism for text prediction and visual generation.

pith-pipeline@v0.9.0 · 5442 in / 1432 out tokens · 67880 ms · 2026-05-12T18:45:53.462795+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 7.0

A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
cs.AI 2026-05 unverdicted novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
Exploring Spatial Intelligence from a Generative Perspective
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
cs.SD 2026-04 unverdicted novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
cs.CV 2026-04 unverdicted novelty 7.0

StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
cs.AI 2026-05 unverdicted novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
cs.CV 2026-05 unverdicted novelty 6.0

STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
cs.CV 2026-04 unverdicted novelty 6.0

StepSTEM benchmark and step-level DP evaluation show top MLLMs achieve only 38.29% accuracy on fine-grained multimodal STEM reasoning, relying primarily on textual cues.
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Nucleus-Image: Sparse MoE for Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
cs.CV 2026-05 unverdicted novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
Motus: A Unified Latent Action World Model
cs.CV 2025-12 unverdicted novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
cs.CV 2025-11 unverdicted novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
Qwen-Image Technical Report
cs.CV 2025-08 unverdicted novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

147 extracted references · 147 canonical work pages · cited by 27 Pith papers · 32 internal anchors

[1]

Accessed September 25, 2023 [Online] https://research.runwayml.com/gen2, 2023

Gen-2. Accessed September 25, 2023 [Online] https://research.runwayml.com/gen2, 2023

work page 2023
[2]

Accessed December 28, 2023 [Online]https://www.pika.art/, 2023

Pika 1.0. Accessed December 28, 2023 [Online]https://www.pika.art/, 2023

work page 2023
[3]

Accessed June 17, 2024 [Online] https://runwayml.com/research/ introducing-gen-3-alpha, 2024

Gen-3. Accessed June 17, 2024 [Online] https://runwayml.com/research/ introducing-gen-3-alpha, 2024

work page 2024
[4]

Accessed June 6, 2024 [Online]https://klingai.kuaishou.com/, 2024

Kling. Accessed June 6, 2024 [Online]https://klingai.kuaishou.com/, 2024

work page 2024
[5]

Ming-omni: A unified multimodal model for perception and generation, 2025

Inclusion AI. Ming-omni: A unified multimodal model for perception and generation, 2025

work page 2025
[6]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InIEEE International Conference on Computer Vision, 2021

work page 2021
[9]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InCVPR, 2023

work page 2023
[10]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jianfeng Wang, Linjie Li, † LongOuyang, † Jun- tangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, † PrafullaDhariwal, † CaseyChu, † YunxinJiao, and Aditya Ramesh. Improving image generation with better captions

work page
[11]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022

work page 2022
[13]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation.arXiv preprint arxiv:2506.07977, 2025

work page arXiv 2025
[14]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InCVPR, pages 3558–3568, 2021

work page 2021
[15]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review arXiv 2023
[16]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024

work page 2024
[17]

Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024

work page 2024
[18]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR. OpenReview.net, 2024

work page 2024
[20]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023

work page internal anchor Pith review arXiv 2023
[21]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review arXiv 2024
[22]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. InICML, pages 1691–1703, 2020

work page 2020
[23]

Panda-70m: Cap- tioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Cap- tioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[24]

Comm: A coherent interleaved image-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2406.10462, 2024

Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, and Long Chen. Comm: A coherent interleaved image-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2406.10462, 2024. 16

work page arXiv 2024
[25]

Seine: Short-to-long video diffusion model for generative transition and prediction.arXiv preprint arXiv:2310.20700, 2023

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction.arXiv preprint arXiv:2310.20700, 2023

work page arXiv 2023
[27]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Semhitok: A unified image tokenizer via semantic-guided hierarchical codebook for multimodal understanding and generation.arXiv preprint arXiv:2503.06764, 2025

Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, and Xiaodan Liang. Semhitok: A unified image tokenizer via semantic-guided hierarchical codebook for multimodal understanding and generation.arXiv preprint arXiv:2503.06764, 2025

work page arXiv 2025
[29]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024

work page 2024
[31]

Fine- grained open domain image animation with motion guidance.arXiv preprint arXiv:2311.12886, 2023

Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Fine- grained open domain image animation with motion guidance.arXiv preprint arXiv:2311.12886, 2023

work page arXiv 2023
[32]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

work page 2009
[34]

Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832, 2024

Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832, 2024

work page arXiv 2024
[35]

Evev2: Improved baselines for encoder-free vision-language models.arXiv preprint arXiv:2502.06788, 2025

Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models.arXiv preprint arXiv:2502.06788, 2025

work page arXiv 2025
[36]

DreamLLM: Synergistic multimodal comprehension and creation

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. InICLR, 2024

work page 2024
[37]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[38]

Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025

Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025

work page arXiv 2025
[39]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models.CoRR, abs/2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396,

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

work page arXiv 2024
[41]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

work page 2023
[42]

Ming-lite-uni: Advancements in unified architecture for natural multimodal interaction.arXiv preprint arXiv:2505.02471, 2025

Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, et al. Ming-lite-uni: Advancements in unified architecture for natural multimodal interaction.arXiv preprint arXiv:2505.02471, 2025

work page arXiv 2025
[43]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.International Conference on Learning Representations, 2024

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.International Conference on Learning Representations, 2024

work page 2024
[44]

Hidream-i1.https://github.com/HiDream-ai/HiDream-I1, 2025

HiDream-ai. Hidream-i1.https://github.com/HiDream-ai/HiDream-I1, 2025

work page 2025
[45]

Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

work page 2024
[46]

Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement

Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, and Hang Xu. Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement.arXiv preprint arXiv:2504.01934, 2025

work page arXiv 2025
[47]

Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al

Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), 2016. 17

work page 2016
[48]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[49]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, pages 6700–6709. Computer Vision Foundation / IEEE, 2019

work page 2019
[50]

Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding.arXiv preprint arXiv:2504.04423, 2025

Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding.arXiv preprint arXiv:2504.04423, 2025

work page arXiv 2025
[51]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016

work page 2016
[52]

Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

work page arXiv 2023
[53]

Or- thus: Autoregressive interleaved image-text generation with modality-specific heads.arXiv preprint arXiv:2412.00127, 2024

Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, and Zhijie Deng. Or- thus: Autoregressive interleaved image-text generation with modality-specific heads.arXiv preprint arXiv:2412.00127, 2024

work page arXiv 2024
[54]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[55]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.ArXiv, abs/2402.17245, 2024

work page arXiv 2024
[58]

Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding.arXiv preprint arXiv:2412.09604, 2024

Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, and Jifeng Dai. Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding.arXiv preprint arXiv:2412.09604, 2024

work page arXiv 2024
[59]

Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568,

Haopeng Li, Jinyue Yang, Guoqi Li, and Huan Wang. Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568, 2025

work page arXiv 2025
[60]

Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[61]

Autoregressive Image Generation Without Vector Quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.arXiv preprint arXiv:2406.11838, 2024

work page arXiv 2024
[62]

Densefusion-1m: Merging vision experts for comprehensive multimodal perception.2407.08303, 2024

Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Ling-Yu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.2407.08303, 2024

work page arXiv 2024
[63]

Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024

work page arXiv 2024
[64]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

work page 2024
[65]

Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472, 2025

work page arXiv 2025
[66]

Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page arXiv 2024
[67]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Toklip: Marry visual tokens to clip for multimodal comprehension and generation.arXiv preprint arXiv:2505.05422, 2025

Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and Ying Shan. Toklip: Marry visual tokens to clip for multimodal comprehension and generation.arXiv preprint arXiv:2505.05422, 2025. 18

work page arXiv 2025
[69]

Vila: On pre- training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre- training for visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024

work page 2024
[70]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[71]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024

work page 2024
[72]

Visual instruction tuning.NeurIPS, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2024

work page 2024
[73]

Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa

Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C. Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

work page arXiv 2024
[74]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[75]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[76]

Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action.arXiv preprint arXiv:2312.17172, 2023

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action.arXiv preprint arXiv:2312.17172, 2023

work page arXiv 2023
[77]

arXiv preprint arXiv:2502.20321 (2025) 9

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

work page arXiv 2025
[78]

Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang...

work page 2025
[79]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

work page 2024
[80]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024

work page internal anchor Pith review arXiv 2024
[81]

Gpt-4v.https://openai.com/index/gpt-4v-system-card/, 2023

OpenAI. Gpt-4v.https://openai.com/index/gpt-4v-system-card/, 2023

work page 2023

Showing first 80 references.