mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

arxiv: 2408.04840 · v2 · pith:EUK7U6NQnew · submitted 2024-08-09 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye , Haiyang Xu , Haowei Liu , Anwen Hu , Ming Yan , Qi Qian , Ji Zhang , Fei Huang

show 1 more author

Jingren Zhou

This is my paper

Pith reviewed 2026-05-20 06:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords multi-modal large language modelslong image sequence understandinghyper attention blocksvideo benchmarksmulti-image tasksdistraction resistance

0 comments p. Extension

pith:EUK7U6NQ Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{EUK7U6NQ}

Prints a linked pith:EUK7U6NQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

mPLUG-Owl3 uses hyper attention blocks to process long sequences of images and videos in multi-modal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents mPLUG-Owl3 as a multi-modal large language model focused on understanding long image sequences in tasks like video analysis and interleaved image-text data. It introduces hyper attention blocks that combine visual and language information into a shared semantic space guided by language. This design helps manage extended inputs without major increases in computation or loss of detail. The work also introduces a Distractor Resistance test to evaluate focus on key elements amid irrelevant images. Results show strong performance across single-image, multi-image, and video benchmarks, particularly for very long sequences.

Core claim

mPLUG-Owl3 shows that hyper attention blocks allow efficient integration of vision and language into a common language-guided semantic space, supporting state-of-the-art results on single-image, multi-image, and video tasks while excelling on ultra-long visual sequences.

What carries the argument

Hyper attention blocks that integrate vision and language into a common semantic space.

If this is right

Models can handle retrieved image-text knowledge and lengthy videos more effectively.
Performance remains high even as the number of images in a sequence increases significantly.
New evaluations like Distractor Resistance highlight the importance of maintaining focus in long contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might apply to other long-context multimodal tasks beyond images and text.
Future work could test these blocks on even longer sequences or different data types to confirm scalability.

Load-bearing premise

The hyper attention blocks integrate vision and language efficiently without losing information or requiring too much computation for long sequences.

What would settle it

Observe whether performance on long sequence benchmarks drops or computation costs rise sharply when sequence length exceeds the tested limits.

read the original abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

mPLUG-Owl3 adds hyper attention blocks for long visual sequences and hits SOTA on similar-sized models, but the efficiency story lacks scaling data.

read the letter

The main thing to know is that mPLUG-Owl3 introduces hyper attention blocks to let multi-modal models handle longer runs of images and video, and it reports top results among comparable models on single-image, multi-image, and video benchmarks plus a new Distractor Resistance test for staying focused amid noise. The model also holds up on ultra-long inputs in the experiments shown. This targets practical cases like interleaved image-text, retrieved knowledge, and lengthy videos rather than single shots. The blocks are meant to pull vision and language into one guided semantic space without the usual heavy costs for extended sequences. The new evaluation method is a straightforward addition that measures a real failure mode in current systems. Overall the results look plausible for the size class they target. The architecture stays close to standard transformer patterns, which keeps it easy to follow and build on. What is weaker is the supporting analysis for the efficiency claim. The description does not include explicit scaling measurements for attention cost, memory use, or latency as sequence length increases, nor ablations that isolate what the hyper attention blocks add over baseline fusion methods. Without those, the gains rest mainly on the benchmark numbers, and it is hard to judge whether the blocks truly avoid quadratic blowup or information loss on very long inputs. The abstract also omits error bars and full training details, so the SOTA claims need the full experimental section to land solidly. This paper is aimed at people working on multi-modal models for video and multi-image tasks. Readers who want concrete architecture tweaks and a new way to test long-context robustness will find it useful. It shows honest engagement with the practical limits of current MLLMs and has enough new pieces to deserve a serious referee, even if the methods section needs more rigor on the scaling side. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper introduces mPLUG-Owl3, a multi-modal large language model that incorporates novel hyper attention blocks to integrate vision and language into a shared semantic space. This architecture is intended to support long image-sequence understanding in settings with retrieved image-text knowledge, interleaved image-text, and lengthy videos. The work claims state-of-the-art results among similarly sized models on single-image, multi-image, and video benchmarks, introduces a Distractor Resistance evaluation to test focus amid distractions, and reports outstanding performance on ultra-long visual sequence inputs.

Significance. If the efficiency and information-preservation properties of the hyper attention blocks are substantiated, the model would represent a meaningful step toward practical long-context multimodal reasoning, with the Distractor Resistance benchmark providing a useful new diagnostic for evaluating distraction robustness. The SOTA claims on standard benchmarks, if accompanied by rigorous controls, would strengthen the case for the architecture's advantages over prior MLLMs of comparable scale.

major comments (2)

[Architecture / Methods] Architecture section describing hyper attention blocks: the manuscript introduces these blocks to enable efficient processing of extended sequences without information loss or prohibitive compute, yet provides no complexity analysis (attention cost as a function of sequence length), no ablation isolating their contribution to long-context retention, and no memory or latency scaling measurements beyond standard benchmarks. This directly bears on the central claim of outstanding performance on ultra-long visual sequences.
[Experiments] Experimental results section: SOTA performance and Distractor Resistance results are reported without error bars, full ablation studies on the hyper attention components, or details on hyperparameter sensitivity, leaving open whether post-hoc choices affect the performance claims.

minor comments (2)

[Abstract] The abstract states that results 'suggest' SOTA performance; a more precise statement of the exact metrics and number of benchmarks would improve clarity.
[Architecture] Notation for the hyper attention block inputs/outputs could be defined more explicitly when first introduced to aid readers in following the integration mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of the hyper attention blocks and the experimental results. We address each major comment below and will incorporate revisions to provide the requested analyses and details.

read point-by-point responses

Referee: [Architecture / Methods] Architecture section describing hyper attention blocks: the manuscript introduces these blocks to enable efficient processing of extended sequences without information loss or prohibitive compute, yet provides no complexity analysis (attention cost as a function of sequence length), no ablation isolating their contribution to long-context retention, and no memory or latency scaling measurements beyond standard benchmarks. This directly bears on the central claim of outstanding performance on ultra-long visual sequences.

Authors: We agree that a formal complexity analysis and targeted ablations would better substantiate the efficiency and information-preservation properties of the hyper attention blocks. The design integrates vision and language in a shared semantic space to support extended sequences, but the initial submission focused on empirical results rather than explicit scaling derivations. In the revised manuscript, we will add a dedicated analysis of attention cost as a function of sequence length, along with memory and latency measurements on ultra-long inputs. We will also include new ablations that isolate the hyper attention blocks' contribution to long-context retention. revision: yes
Referee: [Experiments] Experimental results section: SOTA performance and Distractor Resistance results are reported without error bars, full ablation studies on the hyper attention components, or details on hyperparameter sensitivity, leaving open whether post-hoc choices affect the performance claims.

Authors: We acknowledge that including error bars, expanded ablations, and hyperparameter details would increase confidence in the reported results. The current experiments demonstrate SOTA performance among comparable models and strong results on the Distractor Resistance benchmark, but additional statistical reporting was not included. In the revision, we will add error bars from multiple runs for key benchmarks, provide fuller ablations on the hyper attention components, and include details on hyperparameter choices along with sensitivity analysis. These updates will clarify the robustness of the findings. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external benchmarks

full rationale

The paper introduces an architecture with hyper attention blocks and reports empirical results on standard single-image, multi-image, video, and custom long-sequence benchmarks. No equations, derivations, or first-principles results are presented that reduce any claimed capability to fitted parameters or self-referential definitions by construction. Claims of SOTA performance and ultra-long sequence handling are supported by measured outcomes on held-out evaluation sets rather than by renaming or fitting inputs. Prior mPLUG-Owl citations exist but are not load-bearing for the new architectural or performance assertions, which remain independently verifiable through the reported experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the proposed hyper attention blocks and standard training assumptions for MLLMs; no new physical entities or unproven mathematical axioms are introduced beyond typical deep learning design choices.

free parameters (1)

hyper attention block hyperparameters
Block dimensions, number of layers, and fusion ratios chosen to enable long-sequence processing.

axioms (1)

domain assumption Standard transformer attention can be extended to multi-image inputs via language-guided semantic space integration
Invoked when describing how hyper attention blocks facilitate extended multi-image scenarios.

invented entities (1)

hyper attention blocks no independent evidence
purpose: Efficiently integrate vision and language for long sequences
New architectural component introduced to address limitations in prior MLLMs.

pith-pipeline@v0.9.0 · 5764 in / 1224 out tokens · 33236 ms · 2026-05-20T06:14:39.033742+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks while demonstrating outstanding performance on ultra-long visual sequence inputs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
cs.CV 2026-05 unverdicted novelty 7.0

FineBench is a new dense VQA benchmark for fine-grained human activity understanding in long videos, revealing weaknesses in open VLMs and showing that FineAgent improves them via localization and description modules.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
cs.CV 2025-09 unverdicted novelty 7.0

Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
cs.CV 2025-06 conditional novelty 7.0

SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
cs.CV 2024-12 accept novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
LVBench: An Extreme Long Video Understanding Benchmark
cs.CV 2024-06 accept novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
cs.CV 2026-04 unverdicted novelty 6.0

VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
cs.CV 2025-10 conditional novelty 6.0

StableSketcher improves text-to-sketch generation by fine-tuning a diffusion VAE and adding a VQA-based RL reward, while releasing the SketchDUO dataset of sketches with captions and QA pairs.
HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding
cs.LG 2025-06 unverdicted novelty 6.0

HeartcareGPT proposes Dual Stream Projection Alignment (DSPA) on a structure-aware tokenizer for unified ECG signal-image modeling, supported by Heartcare-400K dataset and Heartcare-Bench.
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
cs.CV 2026-04 unverdicted novelty 5.0

EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
cs.CV 2026-04 unverdicted novelty 5.0

MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
cs.CV 2025-01 unverdicted novelty 5.0

InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 4.0

A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on aver...

Reference graph

Works this paper leans on

238 extracted references · 238 canonical work pages · cited by 21 Pith papers · 48 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[5]

CoRR , volume =

Jiabo Ye and Anwen Hu and Haiyang Xu and Qinghao Ye and Ming Yan and Yuhao Dan and Chenlin Zhao and Guohai Xu and Chenliang Li and Junfeng Tian and Qian Qi and Ji Zhang and Fei Huang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2307.02499 , eprinttype =. 2307.02499 , timestamp =

work page doi:10.48550/arxiv.2307.02499 2023
[7]

ArXiv , year=

Language Models are Few-Shot Learners , author=. ArXiv , year=

work page
[8]

ArXiv , year=

GPT-4 Technical Report , author=. ArXiv , year=

work page
[9]

2023 , url=

GPT-4V(ision) System Card , author=. 2023 , url=

work page 2023
[10]

ArXiv , year=

LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=

work page
[11]

ArXiv , year=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. ArXiv , year=

work page
[12]

ArXiv , year=

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. ArXiv , year=

work page
[13]

ArXiv , year=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. ArXiv , year=

work page
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[16]

ArXiv , year=

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , author=. ArXiv , year=

work page
[17]

ArXiv , year=

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. ArXiv , year=

work page
[18]

ArXiv , year=

Visual Instruction Tuning , author=. ArXiv , year=

work page
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[20]

ArXiv , year=

Aligning Large Multimodal Models with Factually Augmented RLHF , author=. ArXiv , year=

work page
[24]

ArXiv , year=

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic , author=. ArXiv , year=

work page
[25]

International Conference on Machine Learning , year=

PaLM-E: An Embodied Multimodal Language Model , author=. International Conference on Machine Learning , year=

work page
[26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[27]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites , author=. arXiv preprint arXiv:2404.16821 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

International Conference on Machine Learning , year=

mPLUG-2: A modularized multi-modal foundation model across text, image and video , author=. International Conference on Machine Learning , year=

work page
[29]

Advances in Neural Information Processing Systems , volume=

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

PaLM: Scaling Language Modeling with Pathways , author=. J. Mach. Learn. Res. , year=

work page
[31]

GIT: A Generative Image-to-text Transformer for Vision and Language

Git: A generative image-to-text transformer for vision and language , author=. arXiv preprint arXiv:2205.14100 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Pali: A jointly-scaled multilingual language-image model , author=. arXiv preprint arXiv:2209.06794 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

ArXiv , year=

Otter: A Multi-Modal Model with In-Context Instruction Tuning , author=. ArXiv , year=

work page
[34]

ArXiv , year=

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , author=. ArXiv , year=

work page
[35]

Advances in Neural Information Processing Systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

Advances in neural information processing systems , volume=

Im2text: Describing images using 1 million captioned photographs , author=. Advances in neural information processing systems , volume=

work page
[37]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Tap: Text-aware pre-training for text-vqa and text-caption , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[39]

Advances in Neural Information Processing Systems , volume=

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

arXiv preprint arXiv:2211.12561 , year=

Retrieval-augmented multimodal language modeling , author=. arXiv preprint arXiv:2211.12561 , year=

work page arXiv
[42]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Document understanding dataset and evaluation (dude) , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[43]

NeurIPS , year =

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models , author =. NeurIPS , year =

work page
[45]

ArXiv , year=

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. ArXiv , year=

work page
[46]

Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

work page
[48]

ArXiv , year=

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities , author=. ArXiv , year=

work page
[49]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Hitea: Hierarchical temporal-aware video-language pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[51]

ArXiv , year=

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author=. ArXiv , year=

work page
[52]

ArXiv , year=

Language Is Not All You Need: Aligning Perception with Language Models , author=. ArXiv , year=

work page
[53]

ArXiv , year=

Kosmos-2: Grounding Multimodal Large Language Models to the World , author=. ArXiv , year=

work page
[54]

European conference on computer vision , pages=

End-to-end object detection with transformers , author=. European conference on computer vision , pages=. 2020 , organization=

work page 2020
[55]

GLU Variants Improve Transformer

Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002
[56]

ArXiv , year=

WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. ArXiv , year=

work page
[57]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[58]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[61]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Aligning Large Multi-Modal Model with Robust Instruction Tuning , author=. arXiv preprint arXiv:2306.14565 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

arXiv preprint arXiv:2307.04087 , year=

Svit: Scaling up visual instruction tuning , author=. arXiv preprint arXiv:2307.04087 , year=

work page arXiv
[63]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[64]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[65]

2023 , publisher =

SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification , author =. 2023 , publisher =

work page 2023
[66]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. arXiv preprint arXiv:2307.16125 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=

A diagram is worth a dozen images , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=. 2016 , organization=

work page 2016
[71]

Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181,

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision , author=. arXiv preprint arXiv:2309.14181 , year=

work page arXiv
[75]

ArXiv , year=

Evaluating Object Hallucination in Large Vision-Language Models , author=. ArXiv , year=

work page
[76]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[77]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Agieval: A human-centric benchmark for evaluating foundation models , author=. arXiv preprint arXiv:2304.06364 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Proceedings of the 25th ACM international conference on Multimedia , pages=

Video question answering via gradually refined attention over appearance and motion , author=. Proceedings of the 25th ACM international conference on Multimedia , pages=

work page
[81]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[82]

2023 , eprint=

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

work page 2023
[83]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[85]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023
[86]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[87]

Fixing weight decay regularization in adam , author=

work page
[88]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[89]

Advances in Neural Information Processing Systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=

work page
[91]

2022 , howpublished =

COYO-700M: Image-Text Pair Dataset , author =. 2022 , howpublished =

work page 2022
[92]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=

Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=

work page 2014
[93]

Making the

Yash Goyal and Tejas Khot and Douglas Summers. Making the. Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[94]

2019 international conference on document analysis and recognition (ICDAR) , pages=

Ocr-vqa: Visual question answering by reading text in images , author=. 2019 international conference on document analysis and recognition (ICDAR) , pages=. 2019 , organization=

work page 2019
[95]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[96]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[97]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=

Textcaps: a dataset for image captioning with reading comprehension , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=

work page 2020
[98]

Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=

Ok-vqa: A visual question answering benchmark requiring external knowledge , author=. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=

work page
[99]

European Conference on Computer Vision , pages=

A-okvqa: A benchmark for visual question answering using world knowledge , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022

Showing first 80 references.

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[4] [4]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[5] [5]

CoRR , volume =

Jiabo Ye and Anwen Hu and Haiyang Xu and Qinghao Ye and Ming Yan and Yuhao Dan and Chenlin Zhao and Guohai Xu and Chenliang Li and Junfeng Tian and Qian Qi and Ji Zhang and Fei Huang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2307.02499 , eprinttype =. 2307.02499 , timestamp =

work page doi:10.48550/arxiv.2307.02499 2023

[6] [7]

ArXiv , year=

Language Models are Few-Shot Learners , author=. ArXiv , year=

work page

[7] [8]

ArXiv , year=

GPT-4 Technical Report , author=. ArXiv , year=

work page

[8] [9]

2023 , url=

GPT-4V(ision) System Card , author=. 2023 , url=

work page 2023

[9] [10]

ArXiv , year=

LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=

work page

[10] [11]

ArXiv , year=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. ArXiv , year=

work page

[11] [12]

ArXiv , year=

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. ArXiv , year=

work page

[12] [13]

ArXiv , year=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. ArXiv , year=

work page

[13] [15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[14] [16]

ArXiv , year=

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , author=. ArXiv , year=

work page

[15] [17]

ArXiv , year=

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. ArXiv , year=

work page

[16] [18]

ArXiv , year=

Visual Instruction Tuning , author=. ArXiv , year=

work page

[17] [19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[18] [20]

ArXiv , year=

Aligning Large Multimodal Models with Factually Augmented RLHF , author=. ArXiv , year=

work page

[19] [24]

ArXiv , year=

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic , author=. ArXiv , year=

work page

[20] [25]

International Conference on Machine Learning , year=

PaLM-E: An Embodied Multimodal Language Model , author=. International Conference on Machine Learning , year=

work page

[21] [26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[22] [27]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites , author=. arXiv preprint arXiv:2404.16821 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [28]

International Conference on Machine Learning , year=

mPLUG-2: A modularized multi-modal foundation model across text, image and video , author=. International Conference on Machine Learning , year=

work page

[24] [29]

Advances in Neural Information Processing Systems , volume=

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[25] [30]

PaLM: Scaling Language Modeling with Pathways , author=. J. Mach. Learn. Res. , year=

work page

[26] [31]

GIT: A Generative Image-to-text Transformer for Vision and Language

Git: A generative image-to-text transformer for vision and language , author=. arXiv preprint arXiv:2205.14100 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [32]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Pali: A jointly-scaled multilingual language-image model , author=. arXiv preprint arXiv:2209.06794 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [33]

ArXiv , year=

Otter: A Multi-Modal Model with In-Context Instruction Tuning , author=. ArXiv , year=

work page

[29] [34]

ArXiv , year=

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , author=. ArXiv , year=

work page

[30] [35]

Advances in Neural Information Processing Systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[31] [36]

Advances in neural information processing systems , volume=

Im2text: Describing images using 1 million captioned photographs , author=. Advances in neural information processing systems , volume=

work page

[32] [37]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Tap: Text-aware pre-training for text-vqa and text-caption , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[33] [39]

Advances in Neural Information Processing Systems , volume=

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering , author=. Advances in Neural Information Processing Systems , volume=

work page

[34] [40]

arXiv preprint arXiv:2211.12561 , year=

Retrieval-augmented multimodal language modeling , author=. arXiv preprint arXiv:2211.12561 , year=

work page arXiv

[35] [42]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Document understanding dataset and evaluation (dude) , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[36] [43]

NeurIPS , year =

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models , author =. NeurIPS , year =

work page

[37] [45]

ArXiv , year=

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. ArXiv , year=

work page

[38] [46]

Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

work page

[39] [48]

ArXiv , year=

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities , author=. ArXiv , year=

work page

[40] [49]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Hitea: Hierarchical temporal-aware video-language pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[41] [51]

ArXiv , year=

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author=. ArXiv , year=

work page

[42] [52]

ArXiv , year=

Language Is Not All You Need: Aligning Perception with Language Models , author=. ArXiv , year=

work page

[43] [53]

ArXiv , year=

Kosmos-2: Grounding Multimodal Large Language Models to the World , author=. ArXiv , year=

work page

[44] [54]

European conference on computer vision , pages=

End-to-end object detection with transformers , author=. European conference on computer vision , pages=. 2020 , organization=

work page 2020

[45] [55]

GLU Variants Improve Transformer

Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002

[46] [56]

ArXiv , year=

WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. ArXiv , year=

work page

[47] [57]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023

[48] [58]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[49] [61]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Aligning Large Multi-Modal Model with Robust Instruction Tuning , author=. arXiv preprint arXiv:2306.14565 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [62]

arXiv preprint arXiv:2307.04087 , year=

Svit: Scaling up visual instruction tuning , author=. arXiv preprint arXiv:2307.04087 , year=

work page arXiv

[51] [63]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[52] [64]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023

[53] [65]

2023 , publisher =

SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification , author =. 2023 , publisher =

work page 2023

[54] [66]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [67]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. arXiv preprint arXiv:2307.16125 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [68]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=

A diagram is worth a dozen images , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=. 2016 , organization=

work page 2016

[57] [71]

Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181,

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision , author=. arXiv preprint arXiv:2309.14181 , year=

work page arXiv

[58] [75]

ArXiv , year=

Evaluating Object Hallucination in Large Vision-Language Models , author=. ArXiv , year=

work page

[59] [76]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[60] [77]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [78]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Agieval: A human-centric benchmark for evaluating foundation models , author=. arXiv preprint arXiv:2304.06364 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [79]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [80]

Proceedings of the 25th ACM international conference on Multimedia , pages=

Video question answering via gradually refined attention over appearance and motion , author=. Proceedings of the 25th ACM international conference on Multimedia , pages=

work page

[64] [81]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[65] [82]

2023 , eprint=

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

work page 2023

[66] [83]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[67] [85]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023

[68] [86]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[69] [87]

Fixing weight decay regularization in adam , author=

work page

[70] [88]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[71] [89]

Advances in Neural Information Processing Systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=

work page

[72] [91]

2022 , howpublished =

COYO-700M: Image-Text Pair Dataset , author =. 2022 , howpublished =

work page 2022

[73] [92]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=

Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=

work page 2014

[74] [93]

Making the

Yash Goyal and Tejas Khot and Douglas Summers. Making the. Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page

[75] [94]

2019 international conference on document analysis and recognition (ICDAR) , pages=

Ocr-vqa: Visual question answering by reading text in images , author=. 2019 international conference on document analysis and recognition (ICDAR) , pages=. 2019 , organization=

work page 2019

[76] [95]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[77] [96]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[78] [97]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=

Textcaps: a dataset for image captioning with reading comprehension , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=

work page 2020

[79] [98]

Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=

Ok-vqa: A visual question answering benchmark requiring external knowledge , author=. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=

work page

[80] [99]

European Conference on Computer Vision , pages=

A-okvqa: A benchmark for visual question answering using world knowledge , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022