pith. machine review for the scientific record. sign in

arxiv: 2506.15564 · v3 · submitted 2025-06-18 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Show-o2: Improved Native Unified Multimodal Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 18:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelsautoregressive modelingflow matching3D causal variational autoencodermultimodal generationimage video understanding
0
0 comments X

The pith

Show-o2 enables unified multimodal understanding and generation by combining autoregressive modeling and flow matching on shared visual representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Show-o2 to create native unified multimodal models that handle understanding and generation across text, images, and videos in one system. It builds on a language model by applying autoregressive techniques for text and flow matching for visuals. The approach uses a 3D causal variational autoencoder with dual-path fusion of spatial and temporal features to create representations that work for both still images and moving videos. A two-stage training method helps the model learn effectively and grow to larger sizes. If successful, this shows that multimodal tasks do not require entirely separate models for each type of data or task.

Core claim

Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial-temporal fusion. This enables autoregressive modeling on the language head for text token prediction and flow matching on the flow head for image and video generation, all based on a language model. A two-stage training recipe supports learning and scaling, allowing the Show-o2 models to handle a wide range of multimodal understanding and generation tasks across text, images, and videos.

What carries the argument

Dual-path spatial-temporal fusion in the 3D causal variational autoencoder space, which creates unified visual representations scalable from images to videos for both understanding and generation.

If this is right

  • The two-stage training allows scaling to larger models while maintaining performance across modalities.
  • Models can perform both understanding and generation tasks without modality-specific components.
  • Scalability to video is achieved by extending the fusion path to include temporal information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such unified models might simplify the deployment of multimodal AI systems by reducing the number of separate components needed.
  • Extending the dual-path fusion idea could apply to other sequential data like audio or 3D scenes.
  • Combining autoregressive and flow-based methods in this way may lead to better coherence between text understanding and visual generation.

Load-bearing premise

The dual-path fusion of spatial and temporal information in the 3D causal variational autoencoder produces representations that work well for both image and video modalities in understanding and generation tasks.

What would settle it

Observing whether Show-o2 models achieve competitive performance on standard video generation and understanding benchmarks compared to specialized models, while also excelling in text-based tasks.

read the original abstract

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Show-o2, an improved native unified multimodal model that combines autoregressive modeling and flow matching. It is built on a 3D causal variational autoencoder to create unified visual representations via a dual-path spatial (-temporal) fusion mechanism, which is intended to scale across image and video modalities. Autoregressive modeling is applied to the language head for text token prediction, while flow matching is used in the flow head for image/video generation. A two-stage training recipe is employed to learn and scale the models, with the resulting Show-o2 claimed to handle a wide range of multimodal understanding and generation tasks across text, images, and videos. Code and models are released publicly.

Significance. If the empirical results and ablations confirm the claims, this work could advance unified multimodal architectures by demonstrating a native integration of AR and flow-based objectives in a shared 3D causal VAE latent space, potentially offering better scalability than modality-specific models. The public release of code and models supports reproducibility and further research in the field.

major comments (3)
  1. [Method (architecture and dual-path fusion description)] The central claim of versatility and scalability rests on the dual-path spatial (-temporal) fusion in the 3D causal VAE space, yet the manuscript provides no ablation studies, quantitative comparisons to single-path or concatenation baselines, or metrics (e.g., FID, CLIP scores, or temporal consistency measures) showing that this fusion avoids temporal artifacts on pure images or spatial degradation on videos. This is load-bearing for the versatility assertion.
  2. [Method (3D causal VAE and fusion)] No equations or implementation details are supplied for how the dual paths are fused in the 3D causal VAE latent space or how distribution shift is prevented between image-only and video inputs; without these, it is unclear whether the shared representation truly supports both AR text prediction and flow-based generation without modality-specific compromises.
  3. [Training recipe and experiments] The two-stage training recipe is described as enabling effective learning and scaling, but the paper lacks analysis of how the stages interact with the dual-path fusion (e.g., stage-wise loss curves, ablation on stage ordering, or scaling behavior with model size) to support the claim of improved performance over prior unified models.
minor comments (2)
  1. The parenthetical notation 'spatial (-temporal)' is ambiguous; explicit definitions or diagrams distinguishing the image-only path from the video path would improve clarity.
  2. [Abstract] The abstract and introduction would benefit from a brief table summarizing key quantitative results (e.g., benchmark scores on understanding and generation tasks) to ground the versatility claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below. We will incorporate the requested clarifications, details, and analyses into the revised manuscript to strengthen the presentation of the dual-path fusion and training recipe.

read point-by-point responses
  1. Referee: [Method (architecture and dual-path fusion description)] The central claim of versatility and scalability rests on the dual-path spatial (-temporal) fusion in the 3D causal VAE space, yet the manuscript provides no ablation studies, quantitative comparisons to single-path or concatenation baselines, or metrics (e.g., FID, CLIP scores, or temporal consistency measures) showing that this fusion avoids temporal artifacts on pure images or spatial degradation on videos. This is load-bearing for the versatility assertion.

    Authors: We agree that targeted ablations would provide stronger support for the versatility claims. While the original manuscript demonstrates overall performance gains across understanding and generation benchmarks, it does not include direct comparisons to single-path or concatenation baselines. In the revision, we will add ablation studies that compare the dual-path spatial-temporal fusion against these alternatives, reporting quantitative metrics including FID and CLIP scores for images as well as temporal consistency measures for videos. This will explicitly show the fusion's role in avoiding artifacts and degradation. revision: yes

  2. Referee: [Method (3D causal VAE and fusion)] No equations or implementation details are supplied for how the dual paths are fused in the 3D causal VAE latent space or how distribution shift is prevented between image-only and video inputs; without these, it is unclear whether the shared representation truly supports both AR text prediction and flow-based generation without modality-specific compromises.

    Authors: We acknowledge that the manuscript lacks explicit equations and implementation details on the fusion process. In the revised version, we will add the mathematical formulation of the dual-path fusion within the 3D causal VAE latent space and describe the mechanisms used to mitigate distribution shift between image-only and video inputs. These additions will clarify how the shared representation enables native integration of autoregressive text prediction and flow-based generation without modality-specific compromises. revision: yes

  3. Referee: [Training recipe and experiments] The two-stage training recipe is described as enabling effective learning and scaling, but the paper lacks analysis of how the stages interact with the dual-path fusion (e.g., stage-wise loss curves, ablation on stage ordering, or scaling behavior with model size) to support the claim of improved performance over prior unified models.

    Authors: We agree that further analysis of the two-stage training's interaction with the dual-path fusion would strengthen the claims. The original manuscript describes the recipe at a high level but does not provide stage-wise loss curves, ordering ablations, or scaling behavior with model size. In the revision, we will include these analyses to demonstrate how the stages enable effective learning and scaling, supporting the performance improvements over prior unified models. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical results, not self-referential definitions or fitted predictions.

full rationale

The paper describes an architectural design (3D causal VAE with dual-path spatial-temporal fusion, AR + flow heads, two-stage training) and reports empirical versatility on multimodal tasks. No equations or derivations are presented that reduce by construction to their own inputs. The central claim of scalability and effectiveness is framed as an outcome of training rather than a tautological redefinition. Self-citations to prior Show-o work, if present, are not load-bearing for the core assertions, which depend on experimental validation instead of imported uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the empirical success of standard machine-learning components whose effectiveness is not derived from first principles within the paper; free parameters include all learned weights and the many hyperparameters of the two-stage training process.

free parameters (1)
  • model scale and training hyperparameters
    The two-stage training recipe requires choosing model sizes, learning rates, and schedule details that are fitted or selected to achieve the reported versatility.
axioms (2)
  • domain assumption A 3D causal variational autoencoder produces unified visual representations suitable for both images and videos.
    Invoked in the abstract as the foundation for the dual-path fusion without further justification.
  • domain assumption Autoregressive modeling on the language head and flow matching on the flow head can be applied natively within the same model.
    Stated as the core mechanism for text prediction and visual generation.

pith-pipeline@v0.9.0 · 5442 in / 1432 out tokens · 67880 ms · 2026-05-12T18:45:53.462795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  2. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  3. UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

    cs.MM 2026-05 unverdicted novelty 7.0

    UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...

  4. What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 7.0

    A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.

  5. Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

    cs.AI 2026-05 unverdicted novelty 7.0

    A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.

  6. Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...

  7. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  8. ATIR: Towards Audio-Text Interleaved Contextual Retrieval

    cs.SD 2026-04 unverdicted novelty 7.0

    Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.

  9. Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.

  10. Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

  11. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  12. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  13. Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.

  14. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

    cs.AI 2026-05 unverdicted novelty 6.0

    Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...

  15. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  16. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  17. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...

  18. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  19. Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    StepSTEM benchmark and step-level DP evaluation show top MLLMs achieve only 38.29% accuracy on fine-grained multimodal STEM reasoning, relying primarily on textual cues.

  20. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  21. Nucleus-Image: Sparse MoE for Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

  22. Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

  23. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  24. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

  25. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    cs.CV 2025-11 unverdicted novelty 5.0

    Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

  26. Qwen-Image Technical Report

    cs.CV 2025-08 unverdicted novelty 5.0

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...

  27. TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

    cs.AI 2026-04 unverdicted novelty 4.0

    TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

  28. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

147 extracted references · 147 canonical work pages · cited by 27 Pith papers · 32 internal anchors

  1. [1]

    Accessed September 25, 2023 [Online] https://research.runwayml.com/gen2, 2023

    Gen-2. Accessed September 25, 2023 [Online] https://research.runwayml.com/gen2, 2023

  2. [2]

    Accessed December 28, 2023 [Online]https://www.pika.art/, 2023

    Pika 1.0. Accessed December 28, 2023 [Online]https://www.pika.art/, 2023

  3. [3]

    Accessed June 17, 2024 [Online] https://runwayml.com/research/ introducing-gen-3-alpha, 2024

    Gen-3. Accessed June 17, 2024 [Online] https://runwayml.com/research/ introducing-gen-3-alpha, 2024

  4. [4]

    Accessed June 6, 2024 [Online]https://klingai.kuaishou.com/, 2024

    Kling. Accessed June 6, 2024 [Online]https://klingai.kuaishou.com/, 2024

  5. [5]

    Ming-omni: A unified multimodal model for perception and generation, 2025

    Inclusion AI. Ming-omni: A unified multimodal model for perception and generation, 2025

  6. [6]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  7. [7]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  8. [8]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InIEEE International Conference on Computer Vision, 2021

  9. [9]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InCVPR, 2023

  10. [10]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jianfeng Wang, Linjie Li, † LongOuyang, † Jun- tangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, † PrafullaDhariwal, † CaseyChu, † YunxinJiao, and Aditya Ramesh. Improving image generation with better captions

  11. [11]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  12. [12]

    Coyo-700m: Image-text pair dataset

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022

  13. [13]

    Oneig-bench: Omni-dimensional nuanced evaluation for image generation

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation.arXiv preprint arxiv:2506.07977, 2025

  14. [14]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InCVPR, pages 3558–3568, 2021

  15. [15]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023

  16. [16]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024

  17. [17]

    Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024

  18. [18]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

  19. [19]

    Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR. OpenReview.net, 2024

  20. [20]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023

  21. [21]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

  22. [22]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. InICML, pages 1691–1703, 2020

  23. [23]

    Panda-70m: Cap- tioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Cap- tioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  24. [24]

    Comm: A coherent interleaved image-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2406.10462, 2024

    Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, and Long Chen. Comm: A coherent interleaved image-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2406.10462, 2024. 16

  25. [25]

    Seine: Short-to-long video diffusion model for generative transition and prediction.arXiv preprint arXiv:2310.20700, 2023

    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction.arXiv preprint arXiv:2310.20700, 2023

  26. [27]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  27. [28]

    Semhitok: A unified image tokenizer via semantic-guided hierarchical codebook for multimodal understanding and generation.arXiv preprint arXiv:2503.06764, 2025

    Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, and Xiaodan Liang. Semhitok: A unified image tokenizer via semantic-guided hierarchical codebook for multimodal understanding and generation.arXiv preprint arXiv:2503.06764, 2025

  28. [29]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  29. [30]

    Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024

  30. [31]

    Fine- grained open domain image animation with motion guidance.arXiv preprint arXiv:2311.12886, 2023

    Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Fine- grained open domain image animation with motion guidance.arXiv preprint arXiv:2311.12886, 2023

  31. [32]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  32. [33]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

  33. [34]

    Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832, 2024

    Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832, 2024

  34. [35]

    Evev2: Improved baselines for encoder-free vision-language models.arXiv preprint arXiv:2502.06788, 2025

    Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models.arXiv preprint arXiv:2502.06788, 2025

  35. [36]

    DreamLLM: Synergistic multimodal comprehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. InICLR, 2024

  36. [37]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

  37. [38]

    Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025

    Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025

  38. [39]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models.CoRR, abs/2306.13394, 2023

  39. [40]

    Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396,

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

  40. [41]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

  41. [42]

    Ming-lite-uni: Advancements in unified architecture for natural multimodal interaction.arXiv preprint arXiv:2505.02471, 2025

    Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, et al. Ming-lite-uni: Advancements in unified architecture for natural multimodal interaction.arXiv preprint arXiv:2505.02471, 2025

  42. [43]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.International Conference on Learning Representations, 2024

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.International Conference on Learning Representations, 2024

  43. [44]

    Hidream-i1.https://github.com/HiDream-ai/HiDream-I1, 2025

    HiDream-ai. Hidream-i1.https://github.com/HiDream-ai/HiDream-I1, 2025

  44. [45]

    Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

  45. [46]

    Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement

    Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, and Hang Xu. Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement.arXiv preprint arXiv:2504.01934, 2025

  46. [47]

    Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al

    Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), 2016. 17

  47. [48]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  48. [49]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, pages 6700–6709. Computer Vision Foundation / IEEE, 2019

  49. [50]

    Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding.arXiv preprint arXiv:2504.04423, 2025

    Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding.arXiv preprint arXiv:2504.04423, 2025

  50. [51]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016

  51. [52]

    Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

  52. [53]

    Or- thus: Autoregressive interleaved image-text generation with modality-specific heads.arXiv preprint arXiv:2412.00127, 2024

    Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, and Zhijie Deng. Or- thus: Autoregressive interleaved image-text generation with modality-specific heads.arXiv preprint arXiv:2412.00127, 2024

  53. [54]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  54. [55]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

  55. [56]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  56. [57]

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.ArXiv, abs/2402.17245, 2024

  57. [58]

    Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding.arXiv preprint arXiv:2412.09604, 2024

    Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, and Jifeng Dai. Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding.arXiv preprint arXiv:2412.09604, 2024

  58. [59]

    Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568,

    Haopeng Li, Jinyue Yang, Guoqi Li, and Huan Wang. Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568, 2025

  59. [60]

    Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text

    Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. InThe Thirteenth International Conference on Learning Representations, 2025

  60. [61]

    Autoregressive Image Generation Without Vector Quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.arXiv preprint arXiv:2406.11838, 2024

  61. [62]

    Densefusion-1m: Merging vision experts for comprehensive multimodal perception.2407.08303, 2024

    Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Ling-Yu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.2407.08303, 2024

  62. [63]

    Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024

    Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024

  63. [64]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

  64. [65]

    Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472, 2025

  65. [66]

    Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

  66. [67]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

  67. [68]

    Toklip: Marry visual tokens to clip for multimodal comprehension and generation.arXiv preprint arXiv:2505.05422, 2025

    Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, and Ying Shan. Toklip: Marry visual tokens to clip for multimodal comprehension and generation.arXiv preprint arXiv:2505.05422, 2025. 18

  68. [69]

    Vila: On pre- training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre- training for visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024

  69. [70]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023

  70. [71]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024

  71. [72]

    Visual instruction tuning.NeurIPS, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2024

  72. [73]

    Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa

    Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C. Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

  73. [74]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  74. [75]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

  75. [76]

    Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action.arXiv preprint arXiv:2312.17172, 2023

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action.arXiv preprint arXiv:2312.17172, 2023

  76. [77]

    arXiv preprint arXiv:2502.20321 (2025) 9

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

  77. [78]

    Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang...

  78. [79]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

  79. [80]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024

  80. [81]

    Gpt-4v.https://openai.com/index/gpt-4v-system-card/, 2023

    OpenAI. Gpt-4v.https://openai.com/index/gpt-4v-system-card/, 2023

Showing first 80 references.