arxiv: 2404.14396 · v2 · submitted 2024-04-22 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge , Sijie Zhao , Jinguo Zhu , Yixiao Ge , Kun Yi , Lin Song , Chen Li , Xiaohan Ding

show 1 more author

Ying Shan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal foundation modelsimage comprehensionimage generationmulti-granularityarbitrary size imagesvision-language modelsSEED-X

0 comments

The pith

SEED-X is a single multimodal model that comprehends arbitrary-sized images and generates at multiple levels of detail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SEED-X as a unified foundation model that combines comprehension of images in any size or ratio with generation of images at varying granularities. This builds on prior work like SEED-LLaMA by addressing limits in handling diverse visual data and user instructions. The approach shows competitive results on public benchmarks and improved performance in real-world applications after instruction tuning. If successful, it narrows the gap between current multimodal capabilities and practical use across domains.

Core claim

SEED-X is a unified and versatile foundation model that models multi-granularity visual semantics for both comprehension and generation tasks, specifically by integrating arbitrary-size and arbitrary-ratio image comprehension with multi-granularity image generation.

What carries the argument

SEED-X, the unified model architecture that integrates multi-granularity visual semantics handling for comprehension and generation.

If this is right

The model handles images of arbitrary sizes and ratios during comprehension.
It supports image generation at multiple levels of granularity.
It achieves competitive performance on existing public benchmarks.
After instruction tuning it works effectively across various real-world domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This unification may reduce reliance on separate models for different visual tasks in applications like editing or design.
It opens a path for testing whether similar multi-granularity approaches improve other modalities such as video or audio.
Future work could measure whether the unified approach scales to more complex interactive scenarios without separate fine-tuning.

Load-bearing premise

That adding arbitrary-size image comprehension and multi-granularity generation will close the real-world applicability gap without instruction tuning introducing new performance limits.

What would settle it

A direct comparison showing SEED-X underperforms specialized models on arbitrary-size images or multi-granularity generation tasks, or fails to improve outcomes in real-world instruction-tuned applications.

read the original abstract

The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains after instruction tuning. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. The models, codes, and datasets are released in https://github.com/AILab-CVC/SEED-X.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEED-X adds arbitrary-size image handling and multi-granularity generation to the SEED-LLaMA line with open releases, but the abstract leaves the performance trade-offs from instruction tuning unaddressed.

read the letter

The core update here is extending SEED-LLaMA to handle images of any size or ratio and to generate at multiple levels of detail inside one model, followed by instruction tuning for broader real-world use. They report competitive benchmark numbers and practical effectiveness across domains, and they release the models, code, and datasets on GitHub. That release is the most concrete part of the work so far; it lets others test whether the new capabilities actually deliver without hidden costs. The focus on closing the gap to varied visual inputs is a sensible practical target, and treating comprehension and generation in a unified way avoids the usual separate-model overhead. The stress-test note about possible degradation from tuning is worth watching, though. The abstract does not show before-and-after comparisons on standard VQA or captioning tasks, nor any ablations that isolate what the new features cost. Without those numbers it is hard to know if the unified claim holds or if the tuning step introduced the usual trade-offs. The architecture changes for arbitrary sizes are also not described at the level needed to judge reproducibility. This is the sort of incremental multimodal paper that people building applied systems would want to download and try once the full details are out. It shows honest attention to real limitations even if the evidence presented is still high-level. It deserves peer review so referees can check the methods, the ablations, and the actual numbers behind the competitive claims.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SEED-X as a unified multimodal foundation model extending prior SEED-LLaMA work. It integrates two features—comprehension of arbitrary-size and arbitrary-ratio images plus multi-granularity image generation—to model multi-granularity visual semantics for both comprehension and generation tasks. The paper reports competitive results on public benchmarks and real-world effectiveness across domains after instruction tuning, with models, code, and datasets released.

Significance. If the central claims hold, the work is significant for advancing multimodal models toward real-world applicability by addressing limitations in handling diverse visual inputs and instructions. The open release of models and code is a clear strength that supports reproducibility and further research on versatile foundation models.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The claim that SEED-X preserves prior SEED-LLaMA capabilities after adding arbitrary-size comprehension and multi-granularity generation via instruction tuning is load-bearing for the unified-model thesis, yet no explicit before/after comparisons or ablations isolating the tuning stage's impact on standard VQA, captioning, or generation metrics are shown. This leaves open the possibility that new features introduce trade-offs.
[§3 (Method)] §3 (Method): The description of how the architecture unifies multi-granularity visual semantics for both comprehension and generation lacks sufficient detail on parameter sharing, resolution handling, and any mechanisms that prevent degradation of core performance when scaling to arbitrary image sizes.

minor comments (2)

[Abstract] The abstract states 'competitive results on public benchmarks' without naming the specific benchmarks or providing quantitative deltas versus SEED-LLaMA; adding these would improve clarity.
Figure captions and table headers could more explicitly label which metrics correspond to comprehension versus generation tasks to aid cross-referencing with the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments help clarify how to strengthen the presentation of SEED-X's unified capabilities. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The claim that SEED-X preserves prior SEED-LLaMA capabilities after adding arbitrary-size comprehension and multi-granularity generation via instruction tuning is load-bearing for the unified-model thesis, yet no explicit before/after comparisons or ablations isolating the tuning stage's impact on standard VQA, captioning, or generation metrics are shown. This leaves open the possibility that new features introduce trade-offs.

Authors: We agree that explicit before/after comparisons would strengthen the claim. The current manuscript shows competitive results on public benchmarks after instruction tuning but does not include direct ablations against SEED-LLaMA on standard VQA and captioning tasks. In the revision we will add a new table (and corresponding text in §4) reporting SEED-X versus SEED-LLaMA performance on VQAv2, OK-VQA, COCO captioning, and text-to-image generation metrics to quantify any trade-offs introduced by the new features. revision: yes
Referee: [§3 (Method)] §3 (Method): The description of how the architecture unifies multi-granularity visual semantics for both comprehension and generation lacks sufficient detail on parameter sharing, resolution handling, and any mechanisms that prevent degradation of core performance when scaling to arbitrary image sizes.

Authors: We acknowledge the description in §3 is concise and will expand it. The core language model and vision encoder parameters are shared with SEED-LLaMA; arbitrary-resolution handling is achieved via an adaptive positional embedding layer and multi-scale feature pooling before the LLM; multi-granularity generation uses a single unified decoder with granularity-specific tokens. To mitigate degradation we freeze the base SEED-LLaMA weights during the first stage of instruction tuning and only update the newly added modules. The revised §3 will include a detailed diagram, equations for the resolution-adaptive module, and a paragraph explaining the staged training schedule. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to SEED-LLaMA baseline; extension claims remain independent

full rationale

The paper positions SEED-X as an extension of prior SEED-LLaMA work by adding arbitrary-size image comprehension and multi-granularity generation, followed by instruction tuning. The abstract and description reference the prior model only as a starting point for capability gaps, without any equations, fitted parameters, or predictions that reduce to self-cited values by construction. Competitive benchmark results and real-world application claims are presented as independent evidence, so the derivation chain is self-contained with only non-load-bearing self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions that large multimodal models can be trained to acquire the described capabilities and that instruction tuning will translate benchmark gains into real-world utility; no new entities are postulated.

free parameters (1)

training hyperparameters
Typical large-model training requires many fitted values such as learning rates and batch sizes, though none are specified in the abstract.

axioms (1)

domain assumption Multimodal foundation models can be extended to handle arbitrary image sizes and multi-granularity tasks through architectural and training modifications.
Invoked when claiming the new features bridge the applicability gap.

pith-pipeline@v0.9.0 · 5511 in / 1270 out tokens · 85244 ms · 2026-05-15T22:45:38.031890+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Transfer between Modalities with MetaQueries
cs.CV 2025-04 unverdicted novelty 7.0

MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
cs.CV 2025-03 unverdicted novelty 7.0

Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
cs.CV 2024-10 unverdicted novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
cs.CV 2026-04 unverdicted novelty 6.0

IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
Multimodal Large Language Models for Multi-Subject In-Context Image Generation
cs.LG 2026-04 unverdicted novelty 6.0

MUSIC is the first MLLM for multi-subject in-context image generation that uses an automatic data pipeline, vision chain-of-thought reasoning, and semantics-driven spatial layout planning to outperform prior methods o...
MMaDA: Multimodal Large Diffusion Language Models
cs.CV 2025-05 unverdicted novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
cs.CV 2025-03 unverdicted novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
cs.CV 2026-05 unverdicted novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
cs.CV 2025-05 conditional novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
cs.CV 2024-08 unverdicted novelty 5.0

Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 18 Pith papers · 23 internal anchors

[1]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023

work page 2023
[2]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition

Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023

work page arXiv 2023
[8]

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023

work page arXiv 2023
[9]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[11]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Scaling autoregressive multi-modal models: Pretraining and instruction tuning

Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023

work page arXiv 2023
[14]

Planting a seed of vision in large language model

Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023

work page arXiv 2023
[15]

Making llama see and draw with seed tokenizer

Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023

work page arXiv 2023
[17]

Dreamllm: Synergistic multimodal comprehension and creation

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023

work page arXiv 2023
[18]

Generative pretraining in multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023

work page arXiv 2023
[19]

Vl-gpt: A generative pre-trained transformer for vision and language understanding and generation

Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, and Ying Shan. Vl-gpt: A generative pre-trained transformer for vision and language understanding and generation. arXiv preprint arXiv:2312.09251, 2023

work page arXiv 2023
[20]

Unified language-vision pretraining with dynamic discrete visual tokenization

Yang Jin, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023

work page arXiv 2023
[21]

Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023

work page arXiv 2023
[22]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024

work page internal anchor Pith review arXiv 2024
[25]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023

work page arXiv 2023
[26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Journeydb: A benchmark for generative image understanding

Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[28]

Laion-aesthetics

Christoph Schuhmann and Romain Beaumont. Laion-aesthetics. https://laion.ai/blog/ laion-aesthetics/, 2022

work page 2022
[29]

Unsplash.https://github.com/unsplash/datasets, 2023

Zahid Ali, Chesser Luke, and Carbone Timothy. Unsplash.https://github.com/unsplash/datasets, 2023

work page 2023
[30]

Laion-coco: 600m synthetic captions from laion2b-en

Christoph Schuhmann, Andreas Köpf, Richard Vencu, Theo Coombes, and Romain Beaumont. Laion-coco: 600m synthetic captions from laion2b-en. https://laion.ai/blog/laion-coco/, 2023

work page 2023
[31]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

work page 2023
[32]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. arXiv preprint arXiv:2306.10012, 2023

work page arXiv 2023
[33]

Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023

work page arXiv 2023
[34]

Mobilevlm v2: Faster and stronger baseline for vision language model

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024

work page arXiv 2024
[35]

Llava-phi: Efficient multi-modal assistant with small language model

Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024

work page arXiv 2024
[36]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

work page 2024
[37]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[38]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[39]

Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023

Hugo Laurençon, Daniel van Strien, Stas Bekman, Leo Tronchon, Lucile Saulnier, Thomas Wang, Siddharth Karamcheti, Amanpreet Singh, Giada Pistilli, Yacine Jernite, and et al. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023

work page 2023
[40]

Next-gpt: Any-to-any multimodal llm

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023

work page arXiv 2023
[41]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

World model on million-length video and language with ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024

work page arXiv 2024
[43]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017
[44]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[45]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024
[51]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[52]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[53]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024

work page arXiv 2024
[56]

Laion coco: 600m synthetic captions from laion2b-en

Schuhmann Christoph, Köpf Andreas, Vencu Richard, Coombes Theo, and Beaumont Romain. Laion coco: 600m synthetic captions from laion2b-en. [EB/OL], 2022. https://laion.ai/blog/laion-coco/

work page 2022
[57]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

work page 2023
[58]

Journeydb: A benchmark for generative image understanding

Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding. arXiv preprint arXiv:2307.00716, 2023

work page arXiv 2023
[59]

Capsfusion: Rethinking image-text data at scale

Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Xinlong Wang, and Jingjing Liu. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023

work page arXiv 2023
[60]

Multimodal c4: An open, billion-scale corpus of images interleaved with text

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023

work page arXiv 2023
[61]

Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023

work page 2023
[62]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Llavar: Enhanced visual instruction tuning for text-rich image understanding

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023

work page arXiv 2023
[64]

Mimic-it: Multi-modal in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023

work page arXiv 2023
[65]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[66]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[67]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016

work page 2016
[68]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022

work page 2022
[69]

Kvqa: Knowledge-aware visual question answering

Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8876–8884, 2019. 14

work page 2019
[70]

Dvqa: Understanding data visualizations via question answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018

work page 2018
[71]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Vision-language instruction tuning: A review and analysis

Chen Li, Yixiao Ge, Dian Li, and Ying Shan. Vision-language instruction tuning: A review and analysis. arXiv preprint arXiv:2311.08172, 2023

work page arXiv 2023
[73]

To see is to believe: Prompting gpt-4v for better visual instruction tuning

Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023

work page arXiv 2023
[74]

Vision-flan: Scaling human-labeled tasks in visual instruction tuning

Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, and Lifu Huang. Vision-flan: Scaling human-labeled tasks in visual instruction tuning. arXiv preprint arXiv:2402.11690, 2024

work page arXiv 2024
[75]

Allava: Harnessing gpt4v-synthesized data for a lite vision-language model, 2024

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model, 2024

work page 2024
[76]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7):1956–1981, 2020

work page 1956
[77]

Visual storytelling

Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 1233–1239, 2016

work page 2016
[78]

person standing in a small boat

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021. 15 Table 4: Overview of the pre-training and instruction tuning datasets. Type Dataset Pre-training Image-C...

work page 2021