mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Pith reviewed 2026-05-20 06:14 UTC · model grok-4.3
pith:EUK7U6NQ Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{EUK7U6NQ}
Prints a linked pith:EUK7U6NQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
mPLUG-Owl3 uses hyper attention blocks to process long sequences of images and videos in multi-modal models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
mPLUG-Owl3 shows that hyper attention blocks allow efficient integration of vision and language into a common language-guided semantic space, supporting state-of-the-art results on single-image, multi-image, and video tasks while excelling on ultra-long visual sequences.
What carries the argument
Hyper attention blocks that integrate vision and language into a common semantic space.
If this is right
- Models can handle retrieved image-text knowledge and lengthy videos more effectively.
- Performance remains high even as the number of images in a sequence increases significantly.
- New evaluations like Distractor Resistance highlight the importance of maintaining focus in long contexts.
Where Pith is reading between the lines
- This approach might apply to other long-context multimodal tasks beyond images and text.
- Future work could test these blocks on even longer sequences or different data types to confirm scalability.
Load-bearing premise
The hyper attention blocks integrate vision and language efficiently without losing information or requiring too much computation for long sequences.
What would settle it
Observe whether performance on long sequence benchmarks drops or computation costs rise sharply when sequence length exceeds the tested limits.
read the original abstract
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces mPLUG-Owl3, a multi-modal large language model that incorporates novel hyper attention blocks to integrate vision and language into a shared semantic space. This architecture is intended to support long image-sequence understanding in settings with retrieved image-text knowledge, interleaved image-text, and lengthy videos. The work claims state-of-the-art results among similarly sized models on single-image, multi-image, and video benchmarks, introduces a Distractor Resistance evaluation to test focus amid distractions, and reports outstanding performance on ultra-long visual sequence inputs.
Significance. If the efficiency and information-preservation properties of the hyper attention blocks are substantiated, the model would represent a meaningful step toward practical long-context multimodal reasoning, with the Distractor Resistance benchmark providing a useful new diagnostic for evaluating distraction robustness. The SOTA claims on standard benchmarks, if accompanied by rigorous controls, would strengthen the case for the architecture's advantages over prior MLLMs of comparable scale.
major comments (2)
- [Architecture / Methods] Architecture section describing hyper attention blocks: the manuscript introduces these blocks to enable efficient processing of extended sequences without information loss or prohibitive compute, yet provides no complexity analysis (attention cost as a function of sequence length), no ablation isolating their contribution to long-context retention, and no memory or latency scaling measurements beyond standard benchmarks. This directly bears on the central claim of outstanding performance on ultra-long visual sequences.
- [Experiments] Experimental results section: SOTA performance and Distractor Resistance results are reported without error bars, full ablation studies on the hyper attention components, or details on hyperparameter sensitivity, leaving open whether post-hoc choices affect the performance claims.
minor comments (2)
- [Abstract] The abstract states that results 'suggest' SOTA performance; a more precise statement of the exact metrics and number of benchmarks would improve clarity.
- [Architecture] Notation for the hyper attention block inputs/outputs could be defined more explicitly when first introduced to aid readers in following the integration mechanism.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of the hyper attention blocks and the experimental results. We address each major comment below and will incorporate revisions to provide the requested analyses and details.
read point-by-point responses
-
Referee: [Architecture / Methods] Architecture section describing hyper attention blocks: the manuscript introduces these blocks to enable efficient processing of extended sequences without information loss or prohibitive compute, yet provides no complexity analysis (attention cost as a function of sequence length), no ablation isolating their contribution to long-context retention, and no memory or latency scaling measurements beyond standard benchmarks. This directly bears on the central claim of outstanding performance on ultra-long visual sequences.
Authors: We agree that a formal complexity analysis and targeted ablations would better substantiate the efficiency and information-preservation properties of the hyper attention blocks. The design integrates vision and language in a shared semantic space to support extended sequences, but the initial submission focused on empirical results rather than explicit scaling derivations. In the revised manuscript, we will add a dedicated analysis of attention cost as a function of sequence length, along with memory and latency measurements on ultra-long inputs. We will also include new ablations that isolate the hyper attention blocks' contribution to long-context retention. revision: yes
-
Referee: [Experiments] Experimental results section: SOTA performance and Distractor Resistance results are reported without error bars, full ablation studies on the hyper attention components, or details on hyperparameter sensitivity, leaving open whether post-hoc choices affect the performance claims.
Authors: We acknowledge that including error bars, expanded ablations, and hyperparameter details would increase confidence in the reported results. The current experiments demonstrate SOTA performance among comparable models and strong results on the Distractor Resistance benchmark, but additional statistical reporting was not included. In the revision, we will add error bars from multiple runs for key benchmarks, provide fuller ablations on the hyper attention components, and include details on hyperparameter choices along with sensitivity analysis. These updates will clarify the robustness of the findings. revision: yes
Circularity Check
No circularity: performance claims rest on external benchmarks
full rationale
The paper introduces an architecture with hyper attention blocks and reports empirical results on standard single-image, multi-image, video, and custom long-sequence benchmarks. No equations, derivations, or first-principles results are presented that reduce any claimed capability to fitted parameters or self-referential definitions by construction. Claims of SOTA performance and ultra-long sequence handling are supported by measured outcomes on held-out evaluation sets rather than by renaming or fitting inputs. Prior mPLUG-Owl citations exist but are not load-bearing for the new architectural or performance assertions, which remain independently verifiable through the reported experiments.
Axiom & Free-Parameter Ledger
free parameters (1)
- hyper attention block hyperparameters
axioms (1)
- domain assumption Standard transformer attention can be extended to multi-image inputs via language-guided semantic space integration
invented entities (1)
-
hyper attention blocks
no independent evidence
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks while demonstrating outstanding performance on ultra-long visual sequence inputs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
FineBench is a new dense VQA benchmark for fine-grained human activity understanding in long videos, revealing weaknesses in open VLMs and showing that FineAgent improves them via localization and description modules.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
-
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
LVBench: An Extreme Long Video Understanding Benchmark
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
-
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
-
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
-
StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
StableSketcher improves text-to-sketch generation by fine-tuning a diffusion VAE and adding a VQA-based RL reward, while releasing the SketchDUO dataset of sketches with captions and QA pairs.
-
HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding
HeartcareGPT proposes Dual Stream Projection Alignment (DSPA) on a structure-aware tokenizer for unified ECG signal-image modeling, supported by Heartcare-400K dataset and Heartcare-Bench.
-
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
-
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.
-
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...
-
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on aver...
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[4]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[5]
Jiabo Ye and Anwen Hu and Haiyang Xu and Qinghao Ye and Ming Yan and Yuhao Dan and Chenlin Zhao and Guohai Xu and Chenliang Li and Junfeng Tian and Qian Qi and Ji Zhang and Fei Huang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2307.02499 , eprinttype =. 2307.02499 , timestamp =
- [7]
- [8]
- [9]
-
[10]
LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=
-
[11]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. ArXiv , year=
-
[12]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. ArXiv , year=
-
[13]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. ArXiv , year=
-
[15]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[16]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , author=. ArXiv , year=
-
[17]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. ArXiv , year=
- [18]
-
[19]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[20]
Aligning Large Multimodal Models with Factually Augmented RLHF , author=. ArXiv , year=
-
[24]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic , author=. ArXiv , year=
-
[25]
International Conference on Machine Learning , year=
PaLM-E: An Embodied Multimodal Language Model , author=. International Conference on Machine Learning , year=
-
[26]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[27]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites , author=. arXiv preprint arXiv:2404.16821 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
International Conference on Machine Learning , year=
mPLUG-2: A modularized multi-modal foundation model across text, image and video , author=. International Conference on Machine Learning , year=
-
[29]
Advances in Neural Information Processing Systems , volume=
Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
PaLM: Scaling Language Modeling with Pathways , author=. J. Mach. Learn. Res. , year=
-
[31]
GIT: A Generative Image-to-text Transformer for Vision and Language
Git: A generative image-to-text transformer for vision and language , author=. arXiv preprint arXiv:2205.14100 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Pali: A jointly-scaled multilingual language-image model , author=. arXiv preprint arXiv:2209.06794 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Otter: A Multi-Modal Model with In-Context Instruction Tuning , author=. ArXiv , year=
-
[34]
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , author=. ArXiv , year=
-
[35]
Advances in Neural Information Processing Systems , volume=
Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
Advances in neural information processing systems , volume=
Im2text: Describing images using 1 million captioned photographs , author=. Advances in neural information processing systems , volume=
-
[37]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Tap: Text-aware pre-training for text-vqa and text-caption , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[39]
Advances in Neural Information Processing Systems , volume=
Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering , author=. Advances in Neural Information Processing Systems , volume=
-
[40]
arXiv preprint arXiv:2211.12561 , year=
Retrieval-augmented multimodal language modeling , author=. arXiv preprint arXiv:2211.12561 , year=
-
[42]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Document understanding dataset and evaluation (dude) , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[43]
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models , author =. NeurIPS , year =
-
[45]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. ArXiv , year=
-
[46]
Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=
-
[48]
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities , author=. ArXiv , year=
-
[49]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Hitea: Hierarchical temporal-aware video-language pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[51]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author=. ArXiv , year=
-
[52]
Language Is Not All You Need: Aligning Perception with Language Models , author=. ArXiv , year=
-
[53]
Kosmos-2: Grounding Multimodal Large Language Models to the World , author=. ArXiv , year=
-
[54]
European conference on computer vision , pages=
End-to-end object detection with transformers , author=. European conference on computer vision , pages=. 2020 , organization=
work page 2020
-
[55]
GLU Variants Improve Transformer
Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[56]
WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. ArXiv , year=
-
[57]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
-
[58]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[61]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Aligning Large Multi-Modal Model with Robust Instruction Tuning , author=. arXiv preprint arXiv:2306.14565 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
arXiv preprint arXiv:2307.04087 , year=
Svit: Scaling up visual instruction tuning , author=. arXiv preprint arXiv:2307.04087 , year=
-
[63]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[64]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[65]
SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification , author =. 2023 , publisher =
work page 2023
-
[66]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv preprint arXiv:2306.13394 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. arXiv preprint arXiv:2307.16125 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
A diagram is worth a dozen images , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=. 2016 , organization=
work page 2016
-
[71]
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision , author=. arXiv preprint arXiv:2309.14181 , year=
-
[75]
Evaluating Object Hallucination in Large Vision-Language Models , author=. ArXiv , year=
-
[76]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[77]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Agieval: A human-centric benchmark for evaluating foundation models , author=. arXiv preprint arXiv:2304.06364 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Proceedings of the 25th ACM international conference on Multimedia , pages=
Video question answering via gradually refined attention over appearance and motion , author=. Proceedings of the 25th ACM international conference on Multimedia , pages=
-
[81]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[82]
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=
work page 2023
-
[83]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Vatex: A large-scale, high-quality multilingual dataset for video-and-language research , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[85]
The 2023 Conference on Empirical Methods in Natural Language Processing , year=
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=
work page 2023
-
[86]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[87]
Fixing weight decay regularization in adam , author=
-
[88]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[89]
Advances in Neural Information Processing Systems , volume=
Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=
-
[91]
COYO-700M: Image-Text Pair Dataset , author =. 2022 , howpublished =
work page 2022
-
[92]
Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=
work page 2014
-
[93]
Yash Goyal and Tejas Khot and Douglas Summers. Making the. Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[94]
2019 international conference on document analysis and recognition (ICDAR) , pages=
Ocr-vqa: Visual question answering by reading text in images , author=. 2019 international conference on document analysis and recognition (ICDAR) , pages=. 2019 , organization=
work page 2019
-
[95]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[96]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[97]
Textcaps: a dataset for image captioning with reading comprehension , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=
work page 2020
-
[98]
Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=
Ok-vqa: A visual question answering benchmark requiring external knowledge , author=. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=
-
[99]
European Conference on Computer Vision , pages=
A-okvqa: A benchmark for visual question answering using world knowledge , author=. European Conference on Computer Vision , pages=. 2022 , organization=
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.