{"work":{"id":"6fde0828-1494-4a4e-8aec-3747de70602e","openalex_id":null,"doi":null,"arxiv_id":"2408.04840","raw_key":null,"title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","authors":null,"authors_text":"Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian","year":2024,"venue":"cs.CV","abstract":"Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.","external_url":"https://arxiv.org/abs/2408.04840","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T13:43:28.637607+00:00","pith_arxiv_id":"2408.04840","created_at":"2026-05-10T07:01:49.122950+00:00","updated_at":"2026-06-29T13:43:28.637607+00:00","title_quality_ok":true,"display_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","render_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models"},"hub":{"state":{"work_id":"6fde0828-1494-4a4e-8aec-3747de70602e","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":24,"external_cited_by_count":null,"distinct_field_count":2,"first_pith_cited_at":"2024-06-12T09:36:52+00:00","last_pith_cited_at":"2026-05-27T04:52:42+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-30T06:39:29.763643+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"baseline","n":4},{"context_role":"background","n":2}],"polarity_counts":[{"context_polarity":"baseline","n":4},{"context_polarity":"background","n":2}],"runs":{},"summary":{},"graph":{},"authors":[]}}