pith. machine review for the scientific record. sign in

arxiv: 2407.07895 · v2 · submitted 2024-07-10 · 💻 cs.CV · cs.CL· cs.LG

Recognition: no theorem link

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Bo Li, Chunyuan Li, Feng Li, Hao Zhang, Renrui Zhang, Wei Li, Yuanhan Zhang, Zejun Ma

Pith reviewed 2026-05-11 05:56 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords large multimodal modelsmulti-image understandingvideo understanding3D visioninterleaved datavisual instruction tuningcross-scenario transferLLaVA-NeXT-Interleave
0
0 comments X

The pith

Treating multi-image, video, and 3D inputs as one interleaved format lets a single model handle them all without losing single-image performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to prove that large multimodal models can move beyond single-image focus by unifying multi-image, video, and 3D scenarios under one shared data structure. It does this by building a dataset that mixes these inputs together and training a model to process them uniformly. If correct, this would create more flexible systems that reason across different kinds of visual information in the same session. The authors test the idea with a new model that tops specialized benchmarks for multi-image, video, and 3D tasks while matching prior results on ordinary single-image problems. The work also notes that the unified training produces the ability to move skills from one input type to another.

Core claim

By treating the interleaved data format as a general template and training on the M4-Instruct dataset of 1,177.6k samples across four domains, LLaVA-NeXT-Interleave reaches leading results on multi-image, video, and 3D benchmarks while retaining single-image performance and gaining the capacity to transfer tasks across settings and modalities.

What carries the argument

The interleaved data format, used as a single template to represent multi-image, multi-frame video, multi-view 3D, and multi-patch single-image inputs uniformly.

If this is right

  • The model achieves leading performance on multi-image benchmarks.
  • It maintains prior levels of accuracy on single-image tasks.
  • It records strong results on video and 3D benchmarks.
  • It gains the ability to transfer learned tasks between different input settings and modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training approach might reduce the need to maintain separate models for each visual input type.
  • Longer sequences that mix still images with video clips or 3D views could become practical to handle in one forward pass.
  • Real applications that combine reference images with video or 3D data, such as scene reconstruction from multiple views, would become simpler to implement.

Load-bearing premise

That training one model on the combined set of interleaved examples will let it perform well on every scenario without any drop in accuracy for the original single-image case.

What would settle it

A clear drop in single-image benchmark scores when the model is compared against a version trained only on single-image data would show that the unified approach creates a trade-off.

read the original abstract

Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLaVA-NeXT-Interleave, an LMM that unifies multi-image, video (multi-frame), 3D (multi-view), and single-image (multi-patch) scenarios by treating them as instances of an interleaved data format. It compiles the M4-Instruct dataset (1,177.6k samples across 4 domains, 14 tasks, and 41 datasets) and the LLaVA-Interleave Bench. Through training and experiments, the model is claimed to achieve leading results on multi-image, video, and 3D benchmarks while preserving single-image performance and exhibiting emerging capabilities such as cross-scenario and cross-modality task transfer.

Significance. If the empirical results hold, the work is significant for advancing unified multimodal models beyond single-image focus. The M4-Instruct dataset and LLaVA-Interleave Bench provide reusable resources for studying generalization across visual input types. Demonstrating no performance trade-offs and cross-setting transfer would support broader applicability of interleaved training paradigms in LMMs.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central claim that the model 'achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks' is load-bearing but unsupported by any quantitative metrics, baselines, ablation tables, or error analysis in the abstract. The full experimental section must supply these (e.g., specific benchmark scores vs. prior LMMs, per-domain breakdowns) to allow assessment of whether the interleaved format truly avoids trade-offs.
  2. [§3 and §4] §3 (Method) and §4: The weakest assumption—that a single interleaved training regime suffices for generalization across domains without degradation—requires explicit validation. An ablation comparing the unified model against separately trained domain-specific variants (or against LLaVA-NeXT baselines) is needed to confirm the 'no trade-off' result; without it, the cross-scenario transfer claims rest on untested design choices.
minor comments (2)
  1. [Abstract] Abstract: The dataset size '1,177.6k' should be accompanied by a per-domain breakdown (e.g., how many samples per multi-image vs. video) to clarify coverage.
  2. [Introduction] Introduction: Long sentences describing prior LMM limitations could be split for readability; consider adding a figure illustrating the interleaved format template.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. These suggestions have helped us identify areas for improvement in clarity and validation. We address each major comment point by point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that the model 'achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks' is load-bearing but unsupported by any quantitative metrics, baselines, ablation tables, or error analysis in the abstract. The full experimental section must supply these (e.g., specific benchmark scores vs. prior LMMs, per-domain breakdowns) to allow assessment of whether the interleaved format truly avoids trade-offs.

    Authors: We agree that the abstract would be strengthened by including specific quantitative results to support the central claim. In the revised version, we will update the abstract to report key benchmark scores (e.g., leading performance on multi-image and video tasks relative to prior LMMs, and preserved single-image accuracy). The experimental section (§4) already contains extensive quantitative support, including tables with direct comparisons to prior LMMs such as LLaVA-NeXT and other specialized models, per-domain and per-task breakdowns across the 14 tasks and 41 datasets in M4-Instruct, ablation studies on training configurations, and analysis confirming no degradation on single-image benchmarks. We will expand the section with additional baseline results, explicit cross-references to the tables, and further error analysis to make the evidence for no trade-offs fully transparent. revision: yes

  2. Referee: [§3 and §4] §3 (Method) and §4: The weakest assumption—that a single interleaved training regime suffices for generalization across domains without degradation—requires explicit validation. An ablation comparing the unified model against separately trained domain-specific variants (or against LLaVA-NeXT baselines) is needed to confirm the 'no trade-off' result; without it, the cross-scenario transfer claims rest on untested design choices.

    Authors: We appreciate the emphasis on rigorously validating the unified interleaved training paradigm. Our current experiments in §4 already include comparisons of the unified LLaVA-NeXT-Interleave model against the LLaVA-NeXT baseline on single-image tasks, where performance is maintained or improved, and against domain-specialized models on video and 3D benchmarks. To provide the requested direct validation, we will add a new ablation study in the revised manuscript: we will train separate domain-specific variants on the corresponding subsets of M4-Instruct (e.g., video-only and 3D-only) and compare their performance to the unified model, particularly on cross-scenario transfer tasks. This will explicitly demonstrate the benefits of the single interleaved regime without degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; empirical results rest on new data and training

full rationale

The paper's core contribution is empirical: it compiles a new M4-Instruct dataset (1,177.6k interleaved samples) and LLaVA-Interleave Bench, trains LLaVA-NeXT-Interleave on them, and reports benchmark numbers. No equations, predictions, or first-principles derivations are present that reduce to fitted inputs or self-citations by construction. The unifying 'interleaved format' is an explicit design choice and data-compilation strategy, not a tautology. Self-citations to prior LLaVA work describe the base model but do not bear the load of the new multi-domain results, which are externally evaluated on held-out benchmarks. This is a standard non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that interleaved format generalizes across visual scenarios; no free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption Interleaved data format serves as a general template enabling generalization across multi-image, video, 3D, and single-image scenarios
    Explicitly stated in the abstract as the mechanism to enable the listed capabilities.

pith-pipeline@v0.9.0 · 5565 in / 1366 out tokens · 39685 ms · 2026-05-11T05:56:55.199353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 44 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

    cs.CV 2026-05 unverdicted novelty 8.0

    VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.

  2. HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

    cs.CV 2026-04 accept novelty 8.0

    HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

  3. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  4. UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.

  5. VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.

  6. Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

    cs.IR 2026-04 unverdicted novelty 7.0

    FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.

  7. Don't Pause! Every prediction matters in a streaming video

    cs.CV 2026-04 unverdicted novelty 7.0

    SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.

  8. X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

    cs.CV 2026-04 unverdicted novelty 7.0

    X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.

  9. SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

    cs.CV 2026-04 unverdicted novelty 7.0

    SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Kn...

  10. DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    DUALVISION is a new lightweight fusion module using localized cross-attention to integrate infrared with RGB data in MLLMs, improving robustness to degradations and supported by the new DV-204K training dataset and DV...

  11. ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

    cs.LG 2026-04 unverdicted novelty 7.0

    ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.

  12. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  13. Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.

  14. See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

    cs.CL 2026-04 unverdicted novelty 7.0

    LVSpec introduces the first training-free loosely speculative decoding framework for Video-LLMs that identifies sparse visual-relevant tokens for strict verification while tolerating position shifts for semantic fille...

  15. Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.

  16. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  17. VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

    cs.SE 2026-05 unverdicted novelty 6.0

    VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...

  18. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

    cs.LG 2026-05 unverdicted novelty 6.0

    A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

  19. NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics

    cs.CV 2026-05 unverdicted novelty 6.0

    VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.

  20. Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

    cs.AI 2026-05 unverdicted novelty 6.0

    Event-Causal RAG segments videos into events represented as SES graphs, merges them into a causal knowledge graph, and uses bidirectional retrieval to supply relevant event chains to a video foundation model for impro...

  21. MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG

    cs.CL 2026-04 unverdicted novelty 6.0

    MEG-RAG defines a new MEG metric based on Semantic Certainty Anchoring and trains a multimodal reranker to select evidence aligned with ground-truth semantic anchors, yielding higher accuracy and consistency on the M²...

  22. ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    ChangeQuery is a new multimodal framework for semantic disaster change analysis that combines optical and SAR data with a custom dataset and annotation pipeline to support interactive damage assessment.

  23. Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

    cs.CV 2026-04 unverdicted novelty 6.0

    Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.

  24. V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.

  25. Mitigating Multimodal Hallucination via Phase-wise Self-reward

    cs.CV 2026-04 unverdicted novelty 6.0

    PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.

  26. Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Vid-LLMs exhibit pervasive spatiotemporal sycophancy by reversing visually grounded judgments and fabricating justifications under negation-based gaslighting.

  27. Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.

  28. PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

    cs.CV 2026-04 unverdicted novelty 6.0

    PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.

  29. SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

    cs.CV 2026-04 unverdicted novelty 6.0

    SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.

  30. Towards Design Compositing

    cs.CV 2026-04 unverdicted novelty 6.0

    GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.

  31. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  32. SMART: When is it Actually Worth Expanding a Speculative Tree?

    cs.DC 2026-04 unverdicted novelty 6.0

    SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.

  33. CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

    cs.DC 2026-04 unverdicted novelty 6.0

    CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...

  34. Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

    cs.CV 2026-04 conditional novelty 6.0

    Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.

  35. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    cs.CL 2024-04 accept novelty 6.0

    Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

  36. Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

    cs.CV 2026-05 unverdicted novelty 5.0

    ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.

  37. SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment

    cs.CV 2026-05 unverdicted novelty 5.0

    SynerMedGen introduces generation-aligned understanding tasks and a two-stage training strategy that enables strong zero-shot medical image synthesis performance and outperforms specialized models when generation trai...

  38. UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.

  39. EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

    cs.CV 2026-04 unverdicted novelty 5.0

    EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.

  40. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  41. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  42. ZAYA1-VL-8B Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...

  43. Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

    cs.CV 2026-05 unverdicted novelty 4.0

    A new CoVQD-guided retrieval-augmented generation framework improves multimodal LLMs on visual question answering by using structured reasoning to retrieve better external knowledge.

  44. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 43 Pith papers · 11 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint, 2022

  2. [2]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hes- sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for train- ing large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023

  3. [3]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  6. [6]

    Visual question answering on image sets

    Ankan Bansal, Yuting Zhang, and Rama Chellappa. Visual question answering on image sets. In Com- puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 51–67. Springer, 2020

  7. [7]

    Videollm: Modeling video sequence with large language models

    Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video se- quence with large language models. arXiv preprint arXiv:2305.13292, 2023

  8. [8]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  10. [10]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024

  11. [11]

    arXiv preprint arXiv:2402.05935 (2024) 3

    Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 , 2024

  12. [12]

    Gemini: A Family of Highly Capable Multimodal Models

    Google Gemini Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  13. [13]

    Sciverse

    Ziyu Guo, Renrui Zhang, Hao Chen, Jialin Gao, Peng Gao, Hongsheng Li, and Pheng-Ann Heng. Sciverse. https://sciverse-cuhk.github.io, 2024

  14. [14]

    arXiv preprint arXiv:2309.00615 (2023) 3

    Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction fol- lowing. arXiv preprint arXiv:2309.00615, 2023

  15. [15]

    arXiv preprint arXiv:2309.03905 , year=

    Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905, 2023

  16. [16]

    3d-llm: Injecting the 3d world into large language models

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36:20482–20494, 2023

  17. [17]

    3d-llm: Injecting the 3d world into large language models, 2023

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models, 2023

  18. [18]

    Language is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Ag- garwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045, 2023

  19. [19]

    Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Inter- leaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024

  20. [20]

    Many-shot in-context learning in multimodal foundation models.arXiv preprint arXiv:2405.09798, 2024

    Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muham- mad Ahmed Chaudhry, Jonathan H Chen, and An- drew Y Ng. Many-shot in-context learning in multimodal foundation models. arXiv preprint arXiv:2405.09798, 2024

  21. [21]

    Remi: A dataset for reasoning with multi- ple images

    Mehran Kazemi, Nishanth Dikkala, Ankit Anand, Petar Devic, Ishita Dasgupta, Fangyu Liu, Bahare Fatemi, Pranjal Awasthi, Dee Guo, Sreenivas Golla- pudi, et al. Remi: A dataset for reasoning with multi- ple images. arXiv preprint arXiv:2406.09175, 2024

  22. [22]

    Obelics: An open web-scale filtered dataset of interleaved image-text documents

    Hugo Laurenc ¸on, Lucile Saulnier, L ´eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems , 36, 2024

  23. [23]

    What matters when building vision-language models?, 2024

    Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models? arXiv preprint arXiv:2405.02246, 2024

  24. [24]

    Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

    Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Ren- rui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

  25. [25]

    Li, and Ziwei Liu

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tun- ing. arXiv preprint arXiv:2306.05425, 2023

  26. [26]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion. In International Conference on Machine Learn- ing, pages 12888–12900. PMLR, 2022

  27. [27]

    Fine-tuning multimodal llms to follow zero-shot demonstrative in- structions

    Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative in- structions. In The Twelfth International Conference on Learning Representations, 2023

  28. [28]

    Fine-tuning multimodal llms to follow zero-shot demonstrative in- structions, 2024

    Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative in- structions, 2024

  29. [29]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023

  30. [30]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  31. [31]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023

  32. [32]

    Video-llava: Learning united visual representation by alignment before projection, 2023

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023

  33. [33]

    Vila: On pre- training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mo- hammad Shoeybi, and Song Han. Vila: On pre- training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024

  34. [34]

    Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

    Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Long- tian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mix- ing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023

  35. [35]

    Improved baselines with visual instruction tun- ing, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing, 2023

  36. [36]

    Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

  37. [37]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  38. [38]

    Vista-llama: Reliable video narra- tor via equal distance to visual tokens, 2023

    Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reliable video narra- tor via equal distance to visual tokens, 2023

  39. [39]

    Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models, 2024

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models, 2024

  40. [40]

    Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (ACL 2024), 2024

  41. [41]

    McKinzie, Z

    Brandon McKinzie, Zhe Gan, Jean-Philippe Faucon- nier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024

  42. [42]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  43. [43]

    GPT-4V(ision) system card, 2023

    OpenAI. GPT-4V(ision) system card, 2023

  44. [44]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust vi- sual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  45. [45]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  46. [46]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beau- mont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Ko- matsuzaki. Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

  47. [47]

    Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hy- pernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018

  48. [48]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 10740– 10749, 2020

  49. [49]

    Beyond task performance: Eval- uating and reducing the flaws of large multimodal models with in-context learning

    Mustafa Shukor, Alexandre Rame, Corentin Dancette, and Matthieu Cord. Beyond task performance: Eval- uating and reducing the flaws of large multimodal models with in-context learning. arXiv preprint arXiv:2310.00647, 2023

  50. [50]

    arXiv preprint arXiv:1811.00491 , year=

    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reason- ing about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018

  51. [51]

    Generative multi- modal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multi- modal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024

  52. [52]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Ham- bro, Faisal Azhar, et al. Llama: Open and effi- cient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  53. [53]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  54. [54]

    Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

    Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A com- prehensive benchmark for robust multi-image under- standing. arXiv preprint arXiv:2406.09411, 2024

  55. [55]

    Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024

    Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenen- baum, and Chuang Gan. Star: A benchmark for situ- ated reasoning in real-world videos. arXiv preprint arXiv:2405.09711, 2024

  56. [56]

    Q-bench: A benchmark for general-purpose foundation models on low-level vision

    Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181 , 2023

  57. [57]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

  58. [58]

    arXiv preprint arXiv:2308.16911 (2023)

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empower- ing large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023

  59. [59]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Con- ference on Artificial Intelligence , volume 33, pages 9127–9134, 2019

  60. [60]

    Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  61. [61]

    Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for e...

  62. [62]

    Sigmoid loss for language im- age pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language im- age pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11975–11986, 2023

  63. [63]

    Direct preference optimization of video large multimodal models from language model reward,

    Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chun- yuan Li, Alexander Hauptmann, Yonatan Bisk, et al. Direct preference optimization of video large multi- modal models from language model reward. arXiv preprint arXiv:2404.01258, 2024

  64. [64]

    LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In The Twelfth International Conference on Learning Representations, 2024

  65. [65]

    12 Preprint

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624 , 2024

  66. [66]

    Zhang, X

    Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Ao- jun Zhou, Bin Wei, Shanghang Zhang, et al. Mavis: Mathematical visual instruction tuning. arXiv preprint arXiv:2407.08739, 2024

  67. [67]

    Llava-next: A strong zero-shot video under- standing model, April 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chun- yuan Li. Llava-next: A strong zero-shot video under- standing model, April 2024

  68. [68]

    Multimodal c4: An open, billion-scale corpus of images interleaved with text

    Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems , 36, 2024. A. Data Statistics The detailed data statistics of M4-Instruct is s...

  69. [69]

    during inference improves performance. B.2. Impact of video DPO training on other tasks. In Table 14, we compare the results of doing video DPO on other tasks. Though DPO significantly improves the video performance as shown in Table 2, it slightly impacts the performance of other tasks. Training Inference #frames # Image tokens Act Avg VDD VideoChatGPT C...