pith. machine review for the scientific record. sign in

arxiv: 2410.13848 · v1 · submitted 2024-10-17 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multimodalunderstandinggenerationjanusunifiedvisualencodingdecoupling
0
0 comments X

The pith

Decoupling the visual encoder into separate pathways lets a single transformer handle both multimodal understanding and generation without performance trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that prior unified models suffer because one visual encoder cannot simultaneously serve the fine-grained details needed for understanding and the coarser requirements of generation. Janus fixes this by routing the two tasks through independent visual encoders while keeping a single autoregressive transformer that processes the combined outputs. This split removes the conflict at the encoding stage and gives each task freedom to pick its preferred encoding method. Experiments indicate the resulting model beats earlier unified systems and reaches or exceeds the accuracy of models built for only one task. The approach matters because it shows a practical route to versatile multimodal systems that do not force compromises between seeing and creating.

Core claim

Janus is an autoregressive framework that unifies multimodal understanding and generation. Prior work such as Chameleon relies on a single visual encoder for both tasks, which creates suboptimal performance due to differing information granularity. Janus decouples visual encoding into separate pathways while retaining a single unified transformer for processing. The separation alleviates the role conflict for the visual encoder and increases flexibility, allowing each component to select its most suitable encoding method independently. Experiments show that Janus surpasses previous unified models and matches or exceeds the performance of task-specific models.

What carries the argument

Decoupled visual encoding pathways that feed a shared autoregressive transformer.

If this is right

  • Unified models can now reach parity with specialized understanding and generation systems without one task degrading the other.
  • Each task can adopt the visual encoder best suited to its granularity needs while the transformer remains shared.
  • The architecture remains simple and autoregressive, preserving the ability to generate sequences that mix text and images.
  • Flexibility increases because understanding and generation pipelines can be tuned or swapped independently.
  • The same decoupling pattern could be applied to additional modalities without redesigning the core transformer.
  • pith_inferences=[
  • The split may reduce training conflicts and allow faster convergence when scaling to larger models.
  • Similar encoder separation could resolve granularity mismatches in other multimodal settings such as video or audio.

Load-bearing premise

The main performance bottleneck in unified multimodal models is a conflict inside a single visual encoder caused by mismatched granularity needs for understanding versus generation.

What would settle it

A direct comparison in which a single-encoder unified model matches or exceeds Janus on multimodal understanding benchmarks while keeping generation quality intact would show that decoupling is not required.

read the original abstract

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Janus, an autoregressive framework for unified multimodal understanding and generation. It decouples visual encoding into separate pathways for the two tasks (to resolve conflicts from differing information granularity) while retaining a single shared transformer. The approach is claimed to improve flexibility by allowing independent encoder selection per task, with experiments showing it surpasses prior unified models (e.g., Chameleon) and matches or exceeds task-specific models.

Significance. If the performance gains hold under controlled conditions, the work offers a simple, flexible architecture for next-generation unified multimodal models. The explicit separation of visual pathways while preserving a shared transformer is a pragmatic design choice that could reduce task interference without requiring fully separate models.

major comments (1)
  1. [Experiments] Experiments section: the central claim that decoupling resolves the granularity conflict and drives the reported gains requires an ablation that holds encoder capacity, training data, and transformer size fixed while toggling only the single-vs-dual visual pathway. Current comparisons to Chameleon and task-specific baselines do not isolate this factor, so improvements could arise from stronger per-task encoders or training details rather than removal of the hypothesized conflict.
minor comments (1)
  1. [Abstract] Abstract: 'surpasses previous unified model' should read 'surpasses previous unified models' for grammatical correctness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The suggestion to include a controlled ablation isolating the dual visual pathways is valuable and will strengthen the central claim. We address the major comment below and will revise the experiments section accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that decoupling resolves the granularity conflict and drives the reported gains requires an ablation that holds encoder capacity, training data, and transformer size fixed while toggling only the single-vs-dual visual pathway. Current comparisons to Chameleon and task-specific baselines do not isolate this factor, so improvements could arise from stronger per-task encoders or training details rather than removal of the hypothesized conflict.

    Authors: We agree that the current comparisons do not fully isolate the contribution of the dual visual pathways. In the revised manuscript we will add a controlled ablation that trains a single-visual-encoder baseline using the same total encoder capacity, identical training data mixture, and the same transformer size and training schedule as Janus. This will directly measure the effect of toggling between single and dual pathways while holding all other factors fixed, thereby providing clearer evidence that the observed gains arise from reduced task interference rather than encoder strength or training details alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an architectural proposal to decouple visual encoders for multimodal understanding and generation within a shared transformer, motivated by differing granularity requirements. No mathematical derivation, equations, or first-principles predictions are provided that reduce by construction to fitted inputs, self-citations, or renamed known results. The central claim rests on empirical comparisons to prior models rather than any self-referential logic or load-bearing self-citation chain. This is a standard empirical architecture paper whose reasoning chain is independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that separate visual pathways can be integrated into one transformer without performance loss; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Decoupling visual encoding alleviates the conflict between understanding and generation roles.
    Stated as the core motivation and solution in the abstract.

pith-pipeline@v0.9.0 · 5491 in / 1070 out tokens · 27726 ms · 2026-05-15T22:06:37.001843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    both the multimodal understanding and generation components can independently select their most suitable encoding methods

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Unified and Controllable Framework for Layered Image Generation with Visual Effects

    cs.CV 2026-01 unverdicted novelty 7.0

    LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.

  2. WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

  3. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...

  4. MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

    cs.CL 2026-04 unverdicted novelty 6.0

    MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.

  5. MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.

  6. CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.

  7. Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.

  8. Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models

    cs.LG 2026-03 unverdicted novelty 6.0

    Cornserve introduces a task abstraction and record-and-replay runtime for Any-to-Any multimodal models, achieving up to 3.81x higher throughput and 5.79x lower tail latency through component disaggregation and direct ...

  9. F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

    cs.RO 2025-09 unverdicted novelty 6.0

    F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.

  10. MMaDA: Multimodal Large Diffusion Language Models

    cs.CV 2025-05 unverdicted novelty 6.0

    MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...

  11. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    cs.CV 2025-03 unverdicted novelty 6.0

    CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.

  12. HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    cs.CV 2025-03 unverdicted novelty 6.0

    HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

  13. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  14. Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

    cs.RO 2026-03 unverdicted novelty 5.0

    Fine-tuning Qwen-Math-7B with LoRA and GRPO on BlueSky simulator data improves LLM accuracy and consistency in cooperative sUAS tactical deconfliction, reducing near mid-air collisions.

  15. UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    cs.CV 2025-06 unverdicted novelty 5.0

    UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

  16. Emerging Properties in Unified Multimodal Pretraining

    cs.CV 2025-05 unverdicted novelty 5.0

    BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

  17. BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    cs.CV 2025-05 conditional novelty 5.0

    BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.

  18. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  19. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

  20. When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

    cs.CV 2026-05 unverdicted novelty 4.0

    Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt augmentation and preprocessing offering only partial mitigation.

  21. When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

    cs.CV 2026-05 unverdicted novelty 4.0

    Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.

  22. TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

    cs.AI 2026-04 unverdicted novelty 4.0

    TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

  23. OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

    cs.CV 2026-02 unverdicted novelty 4.0

    OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.

  24. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · cited by 23 Pith papers · 36 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic. com, 2024

  3. [3]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A fron- tier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

  4. [4]

    Y. Bai, X. Wang, Y.-p. Cao, Y. Ge, C. Yuan, and Y. Shan. Dreamdiffusion: Generating high-quality images from brain eeg signals. arXiv preprint arXiv:2306.16934, 2023

  5. [5]

    X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024

  6. [6]

    T. B. Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

  7. [7]

    Chang, H

    H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

  8. [8]

    arXiv preprint arXiv:2301.00704 , year=

    H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023

  9. [9]

    J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P . Luo, H. Lu, et al. Pixart- 𝑎𝑙 𝑝ℎ𝑎: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

  10. [10]

    L. Chen, J. Li, X. Dong, P . Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023

  11. [11]

    X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P . Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

  12. [12]

    Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024

  13. [13]

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  14. [14]

    X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023. 18

  15. [15]

    X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024

  16. [16]

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

  17. [17]

    Laion-aesthetics-umap

    dclure. Laion-aesthetics-umap. https://huggingface.co/datasets/dclure/laion -aesthetics-12m-umap , 2022

  18. [18]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchi- cal image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  19. [19]

    J. Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  20. [20]

    Dhariwal and A

    P . Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

  21. [21]

    R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, et al. Dream- llm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023

  22. [22]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  23. [23]

    Detailed caption dataset

    Echo840. Detailed caption dataset. https://huggingface.co/datasets/echo840/ Detailed_Caption, 2023

  24. [24]

    Esser, R

    P . Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  25. [25]

    C. Fu, P . Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

  26. [26]

    Gafni, A

    O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022

  27. [27]

    Y. Ge, Y. Ge, Z. Zeng, X. Wang, and Y. Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023

  28. [28]

    Y. Ge, S. Zhao, Z. Zeng, Y. Ge, C. Li, X. Wang, and Y. Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023

  29. [29]

    Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

  30. [30]

    Ghosh, H

    D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024. 19

  31. [31]

    Goyal, T

    Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  32. [32]

    Hai-llm: Efficient and lightweight training tool for large models, 2023

    High-flyer. Hai-llm: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

  33. [33]

    J. Ho, A. Jain, and P . Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  34. [34]

    Hsiao, F

    Y.-C. Hsiao, F. Zubach, M. Wang, et al. Screenqa: Large-scale question-answer pairs over mobile app screenshots. arXiv preprint arXiv:2209.08199, 2022

  35. [35]

    D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  36. [36]

    Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, B. Chen, C. Lei, A. Liu, C. Song, X. Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023

  37. [37]

    Kang, J.-Y

    M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023

  38. [38]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

  39. [39]

    WikiHow: A Large Scale Text Summarization Dataset

    M. Koupaee and W. Y. Wang. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018

  40. [40]

    Kuznetsova, H

    A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V . Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020

  41. [41]

    Laurençon, D

    H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, and et al. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/id efics

  42. [42]

    B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

  43. [43]

    B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y. Li, Z. Liu, and C. Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

  44. [44]

    D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024

  45. [45]

    L. Li, Y. Wang, R. Xu, P . Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231, 2024. 20

  46. [46]

    T. Li, Y. Tian, H. Li, M. Deng, and K. He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024

  47. [47]

    X. Li, F. Zhang, H. Diao, Y. Wang, X. Wang, and L.-Y. Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception. arXiv preprint arXiv:2407.08303, 2024

  48. [48]

    Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023

  49. [49]

    Z. Li, X. Yang, K. Choi, W. Zhu, R. Hsieh, H. Kim, J. H. Lim, S. Ji, B. Lee, X. Yan, et al. Mmsci: A multimodal multi-discipline dataset for phd-level scientific comprehension. arXiv preprint arXiv:2407.04903, 2024

  50. [50]

    H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  51. [51]

    H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

  52. [52]

    H. Liu, W. Yan, M. Zaharia, and P . Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024

  53. [53]

    M. Liu, R. Shi, K. Kuang, Y. Zhu, X. Li, S. Han, H. Cai, F. Porikli, and H. Su. Openshape: Scaling up 3d shape representation towards open-world understanding. Advances in neural information processing systems, 36, 2024

  54. [54]

    Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mm- bench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

  55. [55]

    H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, Y. Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024

  56. [56]

    P . Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021

  57. [57]

    Megalith-huggingface

    madebyollin. Megalith-huggingface. https://huggingface.co/datasets/madebyol lin/megalith-10m, 2024

  58. [58]

    Yfcc-huggingface

    mehdidc. Yfcc-huggingface. https://huggingface.co/datasets/mehdidc/yfcc15 m, 2024

  59. [59]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    A. Nichol, P . Dhariwal, A. Ramesh, P . Shyam, P . Mishkin, B. McGrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

  60. [60]

    J. Pan, K. Sun, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, J. Dai, Y. Qiao, and H. Li. Journeydb: A benchmark for generative image understanding, 2023

  61. [61]

    Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022. 21

  62. [62]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

  63. [63]

    Dalle3-high-quality-captions

    ProGamerGov. Dalle3-high-quality-captions. https://huggingface.co/datasets/Pr oGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions , 2024

  64. [64]

    A. Radford. Improving language understanding by generative pre-training. 2018

  65. [65]

    Ramesh, M

    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

  66. [66]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P . Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  67. [67]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  68. [68]

    Saharia, W

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gon- tijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022

  69. [69]

    S. Shah, A. Mishra, N. Yadati, and P . P . Talukdar. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8876–8884, 2019

  70. [70]

    Singla, K

    V . Singla, K. Yue, S. Paul, R. Shirkavand, M. Jayawardhana, A. Ganjdanesh, H. Huang, A. Bhatele, G. Somepalli, and T. Goldstein. From pixels to prose: A large dataset of dense image captions. arXiv preprint arXiv:2406.10328, 2024

  71. [71]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  72. [72]

    Srinivasan, K

    K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 2443–2449, 2021

  73. [73]

    P . Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P . Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024

  74. [74]

    Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

  75. [75]

    Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023

  76. [76]

    Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Y. Wang, Y. Rao, J. Liu, T. Huang, and X. Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024. 22

  77. [77]

    C. Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024

  78. [78]

    G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  79. [79]

    K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024

  80. [80]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

Showing first 80 references.