hub

Imagebind-llm: Multi-modality instruction tuning

Imagebind-llm: Multi-modality instruction tuning , author= · 2023 · arXiv 2309.03905

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 1

citation-polarity summary

background 1

representative citing papers

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

Exploring Audio Hallucination in Egocentric Video Understanding

cs.CV · 2026-04-26 · unverdicted · novelty 6.0

AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

cs.AI · 2026-04-13 · unverdicted · novelty 6.0 · 2 refs

EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

cs.CV · 2026-03-29 · unverdicted · novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

cs.CV · 2023-11-16 · unverdicted · novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

cs.CV · 2023-06-23 · unverdicted · novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

cs.CV · 2026-04-03 · unverdicted · novelty 5.0

Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.

Hallucination of Multimodal Large Language Models: A Survey

cs.CV · 2024-04-29 · accept · novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

A Survey on Hallucination in Large Vision-Language Models

cs.CV · 2024-02-01 · unverdicted · novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

citing papers explorer

Showing 10 of 10 citing papers.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 15
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention cs.CV · 2023-03-28 · conditional · none · ref 110
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Exploring Audio Hallucination in Egocentric Video Understanding cs.CV · 2026-04-26 · unverdicted · none · ref 17
AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models cs.AI · 2026-04-13 · unverdicted · none · ref 22 · 2 links
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM cs.CV · 2026-03-29 · unverdicted · none · ref 51
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 83
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models cs.CV · 2023-06-23 · unverdicted · none · ref 18
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs cs.CV · 2026-04-03 · unverdicted · none · ref 20
Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 57
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
A Survey on Hallucination in Large Vision-Language Models cs.CV · 2024-02-01 · unverdicted · none · ref 14
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Imagebind-llm: Multi-modality instruction tuning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer