Recognition: no theorem link
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Pith reviewed 2026-05-11 05:56 UTC · model grok-4.3
The pith
Treating multi-image, video, and 3D inputs as one interleaved format lets a single model handle them all without losing single-image performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating the interleaved data format as a general template and training on the M4-Instruct dataset of 1,177.6k samples across four domains, LLaVA-NeXT-Interleave reaches leading results on multi-image, video, and 3D benchmarks while retaining single-image performance and gaining the capacity to transfer tasks across settings and modalities.
What carries the argument
The interleaved data format, used as a single template to represent multi-image, multi-frame video, multi-view 3D, and multi-patch single-image inputs uniformly.
If this is right
- The model achieves leading performance on multi-image benchmarks.
- It maintains prior levels of accuracy on single-image tasks.
- It records strong results on video and 3D benchmarks.
- It gains the ability to transfer learned tasks between different input settings and modalities.
Where Pith is reading between the lines
- The same training approach might reduce the need to maintain separate models for each visual input type.
- Longer sequences that mix still images with video clips or 3D views could become practical to handle in one forward pass.
- Real applications that combine reference images with video or 3D data, such as scene reconstruction from multiple views, would become simpler to implement.
Load-bearing premise
That training one model on the combined set of interleaved examples will let it perform well on every scenario without any drop in accuracy for the original single-image case.
What would settle it
A clear drop in single-image benchmark scores when the model is compared against a version trained only on single-image data would show that the unified approach creates a trade-off.
read the original abstract
Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLaVA-NeXT-Interleave, an LMM that unifies multi-image, video (multi-frame), 3D (multi-view), and single-image (multi-patch) scenarios by treating them as instances of an interleaved data format. It compiles the M4-Instruct dataset (1,177.6k samples across 4 domains, 14 tasks, and 41 datasets) and the LLaVA-Interleave Bench. Through training and experiments, the model is claimed to achieve leading results on multi-image, video, and 3D benchmarks while preserving single-image performance and exhibiting emerging capabilities such as cross-scenario and cross-modality task transfer.
Significance. If the empirical results hold, the work is significant for advancing unified multimodal models beyond single-image focus. The M4-Instruct dataset and LLaVA-Interleave Bench provide reusable resources for studying generalization across visual input types. Demonstrating no performance trade-offs and cross-setting transfer would support broader applicability of interleaved training paradigms in LMMs.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The central claim that the model 'achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks' is load-bearing but unsupported by any quantitative metrics, baselines, ablation tables, or error analysis in the abstract. The full experimental section must supply these (e.g., specific benchmark scores vs. prior LMMs, per-domain breakdowns) to allow assessment of whether the interleaved format truly avoids trade-offs.
- [§3 and §4] §3 (Method) and §4: The weakest assumption—that a single interleaved training regime suffices for generalization across domains without degradation—requires explicit validation. An ablation comparing the unified model against separately trained domain-specific variants (or against LLaVA-NeXT baselines) is needed to confirm the 'no trade-off' result; without it, the cross-scenario transfer claims rest on untested design choices.
minor comments (2)
- [Abstract] Abstract: The dataset size '1,177.6k' should be accompanied by a per-domain breakdown (e.g., how many samples per multi-image vs. video) to clarify coverage.
- [Introduction] Introduction: Long sentences describing prior LMM limitations could be split for readability; consider adding a figure illustrating the interleaved format template.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. These suggestions have helped us identify areas for improvement in clarity and validation. We address each major comment point by point below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that the model 'achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks' is load-bearing but unsupported by any quantitative metrics, baselines, ablation tables, or error analysis in the abstract. The full experimental section must supply these (e.g., specific benchmark scores vs. prior LMMs, per-domain breakdowns) to allow assessment of whether the interleaved format truly avoids trade-offs.
Authors: We agree that the abstract would be strengthened by including specific quantitative results to support the central claim. In the revised version, we will update the abstract to report key benchmark scores (e.g., leading performance on multi-image and video tasks relative to prior LMMs, and preserved single-image accuracy). The experimental section (§4) already contains extensive quantitative support, including tables with direct comparisons to prior LMMs such as LLaVA-NeXT and other specialized models, per-domain and per-task breakdowns across the 14 tasks and 41 datasets in M4-Instruct, ablation studies on training configurations, and analysis confirming no degradation on single-image benchmarks. We will expand the section with additional baseline results, explicit cross-references to the tables, and further error analysis to make the evidence for no trade-offs fully transparent. revision: yes
-
Referee: [§3 and §4] §3 (Method) and §4: The weakest assumption—that a single interleaved training regime suffices for generalization across domains without degradation—requires explicit validation. An ablation comparing the unified model against separately trained domain-specific variants (or against LLaVA-NeXT baselines) is needed to confirm the 'no trade-off' result; without it, the cross-scenario transfer claims rest on untested design choices.
Authors: We appreciate the emphasis on rigorously validating the unified interleaved training paradigm. Our current experiments in §4 already include comparisons of the unified LLaVA-NeXT-Interleave model against the LLaVA-NeXT baseline on single-image tasks, where performance is maintained or improved, and against domain-specialized models on video and 3D benchmarks. To provide the requested direct validation, we will add a new ablation study in the revised manuscript: we will train separate domain-specific variants on the corresponding subsets of M4-Instruct (e.g., video-only and 3D-only) and compare their performance to the unified model, particularly on cross-scenario transfer tasks. This will explicitly demonstrate the benefits of the single interleaved regime without degradation. revision: yes
Circularity Check
No significant circularity detected; empirical results rest on new data and training
full rationale
The paper's core contribution is empirical: it compiles a new M4-Instruct dataset (1,177.6k interleaved samples) and LLaVA-Interleave Bench, trains LLaVA-NeXT-Interleave on them, and reports benchmark numbers. No equations, predictions, or first-principles derivations are present that reduce to fitted inputs or self-citations by construction. The unifying 'interleaved format' is an explicit design choice and data-compilation strategy, not a tautology. Self-citations to prior LLaVA work describe the base model but do not bear the load of the new multi-domain results, which are externally evaluated on held-out benchmarks. This is a standard non-circular empirical ML paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Interleaved data format serves as a general template enabling generalization across multi-image, video, 3D, and single-image scenarios
Forward citations
Cited by 44 Pith papers
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
-
Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG
FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
-
Don't Pause! Every prediction matters in a streaming video
SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
-
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
-
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Kn...
-
DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning
DUALVISION is a new lightweight fusion module using localized cross-attention to integrate infrared with RGB data in MLLMs, improving robustness to degradations and supported by the new DV-204K training dataset and DV...
-
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
-
See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs
LVSpec introduces the first training-free loosely speculative decoding framework for Video-LLMs that identifies sparse visual-relevant tokens for strict verification while tolerating position shifts for semantic fille...
-
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
-
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics
VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
-
Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios
Event-Causal RAG segments videos into events represented as SES graphs, merges them into a causal knowledge graph, and uses bidirectional retrieval to supply relevant event chains to a video foundation model for impro...
-
MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG
MEG-RAG defines a new MEG metric based on Semantic Certainty Anchoring and trains a multimodal reranker to select evidence aligned with ground-truth semantic anchors, yielding higher accuracy and consistency on the M²...
-
ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding
ChangeQuery is a new multimodal framework for semantic disaster change analysis that combines optical and SAR data with a custom dataset and annotation pipeline to support interactive damage assessment.
-
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.
-
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
-
Mitigating Multimodal Hallucination via Phase-wise Self-reward
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
-
Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models
Vid-LLMs exhibit pervasive spatiotemporal sycophancy by reversing visually grounded judgments and fabricating justifications under negation-based gaslighting.
-
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
-
PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.
-
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
-
Towards Design Compositing
GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
SMART: When is it Actually Worth Expanding a Speculative Tree?
SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
-
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
-
Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation
Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.
-
SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment
SynerMedGen introduces generation-aligned understanding tasks and a two-stage training strategy that enables strong zero-shot medical image synthesis performance and outperforms specialized models when generation trai...
-
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
-
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
-
Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation
A new CoVQD-guided retrieval-augmented generation framework improves multimodal LLMs on visual question answering by using structured reasoning to retrieve better external knowledge.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint, 2022
work page 2022
-
[2]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hes- sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for train- ing large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
Scanqa: 3d question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022
work page 2022
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Visual question answering on image sets
Ankan Bansal, Yuting Zhang, and Rama Chellappa. Visual question answering on image sets. In Com- puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 51–67. Springer, 2020
work page 2020
-
[7]
Videollm: Modeling video sequence with large language models
Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video se- quence with large language models. arXiv preprint arXiv:2305.13292, 2023
-
[8]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024
-
[11]
arXiv preprint arXiv:2402.05935 (2024) 3
Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 , 2024
-
[12]
Gemini: A Family of Highly Capable Multimodal Models
Google Gemini Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [13]
-
[14]
arXiv preprint arXiv:2309.00615 (2023) 3
Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction fol- lowing. arXiv preprint arXiv:2309.00615, 2023
-
[15]
arXiv preprint arXiv:2309.03905 , year=
Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905, 2023
-
[16]
3d-llm: Injecting the 3d world into large language models
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36:20482–20494, 2023
work page 2023
-
[17]
3d-llm: Injecting the 3d world into large language models, 2023
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models, 2023
work page 2023
-
[18]
Language is not all you need: Aligning perception with language models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Ag- garwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045, 2023
-
[19]
Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Inter- leaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024
-
[20]
Many-shot in-context learning in multimodal foundation models.arXiv preprint arXiv:2405.09798, 2024
Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muham- mad Ahmed Chaudhry, Jonathan H Chen, and An- drew Y Ng. Many-shot in-context learning in multimodal foundation models. arXiv preprint arXiv:2405.09798, 2024
-
[21]
Remi: A dataset for reasoning with multi- ple images
Mehran Kazemi, Nishanth Dikkala, Ankit Anand, Petar Devic, Ishita Dasgupta, Fangyu Liu, Bahare Fatemi, Pranjal Awasthi, Dee Guo, Sreenivas Golla- pudi, et al. Remi: A dataset for reasoning with multi- ple images. arXiv preprint arXiv:2406.09175, 2024
-
[22]
Obelics: An open web-scale filtered dataset of interleaved image-text documents
Hugo Laurenc ¸on, Lucile Saulnier, L ´eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems , 36, 2024
work page 2024
-
[23]
What matters when building vision-language models?, 2024
Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models? arXiv preprint arXiv:2405.02246, 2024
-
[24]
Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024
Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Ren- rui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024
work page 2024
-
[25]
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tun- ing. arXiv preprint arXiv:2306.05425, 2023
-
[26]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion. In International Conference on Machine Learn- ing, pages 12888–12900. PMLR, 2022
work page 2022
-
[27]
Fine-tuning multimodal llms to follow zero-shot demonstrative in- structions
Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative in- structions. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[28]
Fine-tuning multimodal llms to follow zero-shot demonstrative in- structions, 2024
Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative in- structions, 2024
work page 2024
-
[29]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review arXiv 2023
-
[30]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024
work page 2024
-
[31]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023
-
[32]
Video-llava: Learning united visual representation by alignment before projection, 2023
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023
work page 2023
-
[33]
Vila: On pre- training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mo- hammad Shoeybi, and Song Han. Vila: On pre- training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024
work page 2024
-
[34]
Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Long- tian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mix- ing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023
-
[35]
Improved baselines with visual instruction tun- ing, 2023
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing, 2023
work page 2023
-
[36]
Llava- next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[37]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023
work page 2023
-
[38]
Vista-llama: Reliable video narra- tor via equal distance to visual tokens, 2023
Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reliable video narra- tor via equal distance to visual tokens, 2023
work page 2023
-
[39]
Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models, 2024
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models, 2024
work page 2024
-
[40]
Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (ACL 2024), 2024
work page 2024
-
[41]
Brandon McKinzie, Zhe Gan, Jean-Philippe Faucon- nier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024
- [42]
- [43]
-
[44]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust vi- sual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[46]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beau- mont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Ko- matsuzaki. Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021
work page internal anchor Pith review arXiv 2021
-
[47]
Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018
work page 2018
-
[48]
Alfred: A benchmark for interpreting grounded instructions for everyday tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 10740– 10749, 2020
work page 2020
-
[49]
Mustafa Shukor, Alexandre Rame, Corentin Dancette, and Matthieu Cord. Beyond task performance: Eval- uating and reducing the flaws of large multimodal models with in-context learning. arXiv preprint arXiv:2310.00647, 2023
-
[50]
arXiv preprint arXiv:1811.00491 , year=
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reason- ing about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018
-
[51]
Generative multi- modal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multi- modal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024
work page 2024
-
[52]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Ham- bro, Faisal Azhar, et al. Llama: Open and effi- cient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A com- prehensive benchmark for robust multi-image under- standing. arXiv preprint arXiv:2406.09411, 2024
-
[55]
Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenen- baum, and Chuang Gan. Star: A benchmark for situ- ated reasoning in real-world videos. arXiv preprint arXiv:2405.09711, 2024
-
[56]
Q-bench: A benchmark for general-purpose foundation models on low-level vision
Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181 , 2023
-
[57]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021
work page 2021
-
[58]
arXiv preprint arXiv:2308.16911 (2023)
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empower- ing large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023
-
[59]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Con- ference on Artificial Intelligence , volume 33, pages 9127–9134, 2019
work page 2019
-
[60]
Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024
work page 2024
-
[61]
Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for e...
-
[62]
Sigmoid loss for language im- age pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language im- age pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11975–11986, 2023
work page 2023
-
[63]
Direct preference optimization of video large multimodal models from language model reward,
Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chun- yuan Li, Alexander Hauptmann, Yonatan Bisk, et al. Direct preference optimization of video large multi- modal models from language model reward. arXiv preprint arXiv:2404.01258, 2024
-
[64]
LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[65]
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624 , 2024
- [66]
-
[67]
Llava-next: A strong zero-shot video under- standing model, April 2024
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chun- yuan Li. Llava-next: A strong zero-shot video under- standing model, April 2024
work page 2024
-
[68]
Multimodal c4: An open, billion-scale corpus of images interleaved with text
Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems , 36, 2024. A. Data Statistics The detailed data statistics of M4-Instruct is s...
work page 2024
-
[69]
during inference improves performance. B.2. Impact of video DPO training on other tasks. In Table 14, we compare the results of doing video DPO on other tasks. Though DPO significantly improves the video performance as shown in Table 2, it slightly impacts the performance of other tasks. Training Inference #frames # Image tokens Act Avg VDD VideoChatGPT C...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.