pith. sign in

arxiv: 2408.04840 · v2 · pith:EUK7U6NQnew · submitted 2024-08-09 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Pith reviewed 2026-05-20 06:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords multi-modal large language modelslong image sequence understandinghyper attention blocksvideo benchmarksmulti-image tasksdistraction resistance
0
0 comments X p. Extension
pith:EUK7U6NQ Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{EUK7U6NQ}

Prints a linked pith:EUK7U6NQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

mPLUG-Owl3 uses hyper attention blocks to process long sequences of images and videos in multi-modal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents mPLUG-Owl3 as a multi-modal large language model focused on understanding long image sequences in tasks like video analysis and interleaved image-text data. It introduces hyper attention blocks that combine visual and language information into a shared semantic space guided by language. This design helps manage extended inputs without major increases in computation or loss of detail. The work also introduces a Distractor Resistance test to evaluate focus on key elements amid irrelevant images. Results show strong performance across single-image, multi-image, and video benchmarks, particularly for very long sequences.

Core claim

mPLUG-Owl3 shows that hyper attention blocks allow efficient integration of vision and language into a common language-guided semantic space, supporting state-of-the-art results on single-image, multi-image, and video tasks while excelling on ultra-long visual sequences.

What carries the argument

Hyper attention blocks that integrate vision and language into a common semantic space.

If this is right

  • Models can handle retrieved image-text knowledge and lengthy videos more effectively.
  • Performance remains high even as the number of images in a sequence increases significantly.
  • New evaluations like Distractor Resistance highlight the importance of maintaining focus in long contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might apply to other long-context multimodal tasks beyond images and text.
  • Future work could test these blocks on even longer sequences or different data types to confirm scalability.

Load-bearing premise

The hyper attention blocks integrate vision and language efficiently without losing information or requiring too much computation for long sequences.

What would settle it

Observe whether performance on long sequence benchmarks drops or computation costs rise sharply when sequence length exceeds the tested limits.

read the original abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces mPLUG-Owl3, a multi-modal large language model that incorporates novel hyper attention blocks to integrate vision and language into a shared semantic space. This architecture is intended to support long image-sequence understanding in settings with retrieved image-text knowledge, interleaved image-text, and lengthy videos. The work claims state-of-the-art results among similarly sized models on single-image, multi-image, and video benchmarks, introduces a Distractor Resistance evaluation to test focus amid distractions, and reports outstanding performance on ultra-long visual sequence inputs.

Significance. If the efficiency and information-preservation properties of the hyper attention blocks are substantiated, the model would represent a meaningful step toward practical long-context multimodal reasoning, with the Distractor Resistance benchmark providing a useful new diagnostic for evaluating distraction robustness. The SOTA claims on standard benchmarks, if accompanied by rigorous controls, would strengthen the case for the architecture's advantages over prior MLLMs of comparable scale.

major comments (2)
  1. [Architecture / Methods] Architecture section describing hyper attention blocks: the manuscript introduces these blocks to enable efficient processing of extended sequences without information loss or prohibitive compute, yet provides no complexity analysis (attention cost as a function of sequence length), no ablation isolating their contribution to long-context retention, and no memory or latency scaling measurements beyond standard benchmarks. This directly bears on the central claim of outstanding performance on ultra-long visual sequences.
  2. [Experiments] Experimental results section: SOTA performance and Distractor Resistance results are reported without error bars, full ablation studies on the hyper attention components, or details on hyperparameter sensitivity, leaving open whether post-hoc choices affect the performance claims.
minor comments (2)
  1. [Abstract] The abstract states that results 'suggest' SOTA performance; a more precise statement of the exact metrics and number of benchmarks would improve clarity.
  2. [Architecture] Notation for the hyper attention block inputs/outputs could be defined more explicitly when first introduced to aid readers in following the integration mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of the hyper attention blocks and the experimental results. We address each major comment below and will incorporate revisions to provide the requested analyses and details.

read point-by-point responses
  1. Referee: [Architecture / Methods] Architecture section describing hyper attention blocks: the manuscript introduces these blocks to enable efficient processing of extended sequences without information loss or prohibitive compute, yet provides no complexity analysis (attention cost as a function of sequence length), no ablation isolating their contribution to long-context retention, and no memory or latency scaling measurements beyond standard benchmarks. This directly bears on the central claim of outstanding performance on ultra-long visual sequences.

    Authors: We agree that a formal complexity analysis and targeted ablations would better substantiate the efficiency and information-preservation properties of the hyper attention blocks. The design integrates vision and language in a shared semantic space to support extended sequences, but the initial submission focused on empirical results rather than explicit scaling derivations. In the revised manuscript, we will add a dedicated analysis of attention cost as a function of sequence length, along with memory and latency measurements on ultra-long inputs. We will also include new ablations that isolate the hyper attention blocks' contribution to long-context retention. revision: yes

  2. Referee: [Experiments] Experimental results section: SOTA performance and Distractor Resistance results are reported without error bars, full ablation studies on the hyper attention components, or details on hyperparameter sensitivity, leaving open whether post-hoc choices affect the performance claims.

    Authors: We acknowledge that including error bars, expanded ablations, and hyperparameter details would increase confidence in the reported results. The current experiments demonstrate SOTA performance among comparable models and strong results on the Distractor Resistance benchmark, but additional statistical reporting was not included. In the revision, we will add error bars from multiple runs for key benchmarks, provide fuller ablations on the hyper attention components, and include details on hyperparameter choices along with sensitivity analysis. These updates will clarify the robustness of the findings. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external benchmarks

full rationale

The paper introduces an architecture with hyper attention blocks and reports empirical results on standard single-image, multi-image, video, and custom long-sequence benchmarks. No equations, derivations, or first-principles results are presented that reduce any claimed capability to fitted parameters or self-referential definitions by construction. Claims of SOTA performance and ultra-long sequence handling are supported by measured outcomes on held-out evaluation sets rather than by renaming or fitting inputs. Prior mPLUG-Owl citations exist but are not load-bearing for the new architectural or performance assertions, which remain independently verifiable through the reported experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the proposed hyper attention blocks and standard training assumptions for MLLMs; no new physical entities or unproven mathematical axioms are introduced beyond typical deep learning design choices.

free parameters (1)
  • hyper attention block hyperparameters
    Block dimensions, number of layers, and fusion ratios chosen to enable long-sequence processing.
axioms (1)
  • domain assumption Standard transformer attention can be extended to multi-image inputs via language-guided semantic space integration
    Invoked when describing how hyper attention blocks facilitate extended multi-image scenarios.
invented entities (1)
  • hyper attention blocks no independent evidence
    purpose: Efficiently integrate vision and language for long sequences
    New architectural component introduced to address limitations in prior MLLMs.

pith-pipeline@v0.9.0 · 5764 in / 1224 out tokens · 33236 ms · 2026-05-20T06:14:39.033742+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks while demonstrating outstanding performance on ultra-long visual sequence inputs.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

  2. FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    FineBench is a new dense VQA benchmark for fine-grained human activity understanding in long videos, revealing weaknesses in open VLMs and showing that FineAgent improves them via localization and description modules.

  3. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.

  4. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  5. Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

    cs.CV 2026-03 unverdicted novelty 7.0

    SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

  6. TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

    cs.CV 2025-09 unverdicted novelty 7.0

    Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.

  7. SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

    cs.CV 2025-06 conditional novelty 7.0

    SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.

  8. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    cs.CV 2025-05 unverdicted novelty 7.0

    DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

  9. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    cs.CV 2025-02 unverdicted novelty 7.0

    WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

  10. OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

    cs.CV 2024-12 accept novelty 7.0

    OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

  11. LVBench: An Extreme Long Video Understanding Benchmark

    cs.CV 2024-06 accept novelty 7.0

    LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

  12. Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.

  13. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

  14. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  15. VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

    cs.CV 2026-04 unverdicted novelty 6.0

    VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.

  16. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  17. StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

    cs.CV 2025-10 conditional novelty 6.0

    StableSketcher improves text-to-sketch generation by fine-tuning a diffusion VAE and adding a VQA-based RL reward, while releasing the SketchDUO dataset of sketches with captions and QA pairs.

  18. HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding

    cs.LG 2025-06 unverdicted novelty 6.0

    HeartcareGPT proposes Dual Stream Projection Alignment (DSPA) on a structure-aware tokenizer for unified ECG signal-image modeling, supported by Heartcare-400K dataset and Heartcare-Bench.

  19. EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

    cs.CV 2026-04 unverdicted novelty 5.0

    EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.

  20. Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction

    cs.CV 2026-04 unverdicted novelty 5.0

    MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.

  21. InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    cs.CV 2025-01 unverdicted novelty 5.0

    InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...

  22. Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 4.0

    A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on aver...

Reference graph

Works this paper leans on

238 extracted references · 238 canonical work pages · cited by 21 Pith papers · 48 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  5. [5]

    CoRR , volume =

    Jiabo Ye and Anwen Hu and Haiyang Xu and Qinghao Ye and Ming Yan and Yuhao Dan and Chenlin Zhao and Guohai Xu and Chenliang Li and Junfeng Tian and Qian Qi and Ji Zhang and Fei Huang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2307.02499 , eprinttype =. 2307.02499 , timestamp =

  6. [7]

    ArXiv , year=

    Language Models are Few-Shot Learners , author=. ArXiv , year=

  7. [8]

    ArXiv , year=

    GPT-4 Technical Report , author=. ArXiv , year=

  8. [9]

    2023 , url=

    GPT-4V(ision) System Card , author=. 2023 , url=

  9. [10]

    ArXiv , year=

    LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=

  10. [11]

    ArXiv , year=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. ArXiv , year=

  11. [12]

    ArXiv , year=

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. ArXiv , year=

  12. [13]

    ArXiv , year=

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. ArXiv , year=

  13. [15]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  14. [16]

    ArXiv , year=

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , author=. ArXiv , year=

  15. [17]

    ArXiv , year=

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. ArXiv , year=

  16. [18]

    ArXiv , year=

    Visual Instruction Tuning , author=. ArXiv , year=

  17. [19]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  18. [20]

    ArXiv , year=

    Aligning Large Multimodal Models with Factually Augmented RLHF , author=. ArXiv , year=

  19. [24]

    ArXiv , year=

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic , author=. ArXiv , year=

  20. [25]

    International Conference on Machine Learning , year=

    PaLM-E: An Embodied Multimodal Language Model , author=. International Conference on Machine Learning , year=

  21. [26]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  22. [27]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites , author=. arXiv preprint arXiv:2404.16821 , year=

  23. [28]

    International Conference on Machine Learning , year=

    mPLUG-2: A modularized multi-modal foundation model across text, image and video , author=. International Conference on Machine Learning , year=

  24. [29]

    Advances in Neural Information Processing Systems , volume=

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=

  25. [30]

    PaLM: Scaling Language Modeling with Pathways , author=. J. Mach. Learn. Res. , year=

  26. [31]

    GIT: A Generative Image-to-text Transformer for Vision and Language

    Git: A generative image-to-text transformer for vision and language , author=. arXiv preprint arXiv:2205.14100 , year=

  27. [32]

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Pali: A jointly-scaled multilingual language-image model , author=. arXiv preprint arXiv:2209.06794 , year=

  28. [33]

    ArXiv , year=

    Otter: A Multi-Modal Model with In-Context Instruction Tuning , author=. ArXiv , year=

  29. [34]

    ArXiv , year=

    Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , author=. ArXiv , year=

  30. [35]

    Advances in Neural Information Processing Systems , volume=

    Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

  31. [36]

    Advances in neural information processing systems , volume=

    Im2text: Describing images using 1 million captioned photographs , author=. Advances in neural information processing systems , volume=

  32. [37]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Tap: Text-aware pre-training for text-vqa and text-caption , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  33. [39]

    Advances in Neural Information Processing Systems , volume=

    Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering , author=. Advances in Neural Information Processing Systems , volume=

  34. [40]

    arXiv preprint arXiv:2211.12561 , year=

    Retrieval-augmented multimodal language modeling , author=. arXiv preprint arXiv:2211.12561 , year=

  35. [42]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Document understanding dataset and evaluation (dude) , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  36. [43]

    NeurIPS , year =

    Zero-Shot Video Question Answering via Frozen Bidirectional Language Models , author =. NeurIPS , year =

  37. [45]

    ArXiv , year=

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. ArXiv , year=

  38. [46]

    Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

    Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

  39. [48]

    ArXiv , year=

    Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities , author=. ArXiv , year=

  40. [49]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Hitea: Hierarchical temporal-aware video-language pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  41. [51]

    ArXiv , year=

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author=. ArXiv , year=

  42. [52]

    ArXiv , year=

    Language Is Not All You Need: Aligning Perception with Language Models , author=. ArXiv , year=

  43. [53]

    ArXiv , year=

    Kosmos-2: Grounding Multimodal Large Language Models to the World , author=. ArXiv , year=

  44. [54]

    European conference on computer vision , pages=

    End-to-end object detection with transformers , author=. European conference on computer vision , pages=. 2020 , organization=

  45. [55]

    GLU Variants Improve Transformer

    Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=

  46. [56]

    ArXiv , year=

    WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. ArXiv , year=

  47. [57]

    2023 , eprint=

    Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  48. [58]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  49. [61]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Aligning Large Multi-Modal Model with Robust Instruction Tuning , author=. arXiv preprint arXiv:2306.14565 , year=

  50. [62]

    arXiv preprint arXiv:2307.04087 , year=

    Svit: Scaling up visual instruction tuning , author=. arXiv preprint arXiv:2307.04087 , year=

  51. [63]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  52. [64]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  53. [65]

    2023 , publisher =

    SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification , author =. 2023 , publisher =

  54. [66]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv preprint arXiv:2306.13394 , year=

  55. [67]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. arXiv preprint arXiv:2307.16125 , year=

  56. [68]

    Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=

    A diagram is worth a dozen images , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=. 2016 , organization=

  57. [71]

    Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181,

    Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision , author=. arXiv preprint arXiv:2309.14181 , year=

  58. [75]

    ArXiv , year=

    Evaluating Object Hallucination in Large Vision-Language Models , author=. ArXiv , year=

  59. [76]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  60. [77]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

  61. [78]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    Agieval: A human-centric benchmark for evaluating foundation models , author=. arXiv preprint arXiv:2304.06364 , year=

  62. [79]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

  63. [80]

    Proceedings of the 25th ACM international conference on Multimedia , pages=

    Video question answering via gradually refined attention over appearance and motion , author=. Proceedings of the 25th ACM international conference on Multimedia , pages=

  64. [81]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  65. [82]

    2023 , eprint=

    OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

  66. [83]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Vatex: A large-scale, high-quality multilingual dataset for video-and-language research , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  67. [85]

    The 2023 Conference on Empirical Methods in Natural Language Processing , year=

    UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

  68. [86]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  69. [87]

    Fixing weight decay regularization in adam , author=

  70. [88]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  71. [89]

    Advances in Neural Information Processing Systems , volume=

    Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=

  72. [91]

    2022 , howpublished =

    COYO-700M: Image-Text Pair Dataset , author =. 2022 , howpublished =

  73. [92]

    Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=

    Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=

  74. [93]

    Making the

    Yash Goyal and Tejas Khot and Douglas Summers. Making the. Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  75. [94]

    2019 international conference on document analysis and recognition (ICDAR) , pages=

    Ocr-vqa: Visual question answering by reading text in images , author=. 2019 international conference on document analysis and recognition (ICDAR) , pages=. 2019 , organization=

  76. [95]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  77. [96]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  78. [97]

    Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=

    Textcaps: a dataset for image captioning with reading comprehension , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=

  79. [98]

    Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=

    Ok-vqa: A visual question answering benchmark requiring external knowledge , author=. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=

  80. [99]

    European Conference on Computer Vision , pages=

    A-okvqa: A benchmark for visual question answering using world knowledge , author=. European Conference on Computer Vision , pages=. 2022 , organization=

Showing first 80 references.