hub

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, Rui Zhao · 2023 · cs.CV · arXiv 2306.15195

35 Pith papers cite this work. Polarity classification is still indexing.

35 Pith papers citing it

open full Pith review browse 35 citing papers arXiv PDF

abstract

In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra's promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code, model and dataset are accessed at https://github.com/shikras/shikra.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1 dataset 1

citation-polarity summary

background 1 baseline 1 use dataset 1

claims ledger

abstract In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra

co-cited works

representative citing papers

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

cs.CV · 2026-05-02 · unverdicted · novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.

From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.

STORM: End-to-End Referring Multi-Object Tracking in Videos

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.

Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

cs.CV · 2026-04-04 · unverdicted · novelty 7.0

Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.

Evaluating Object Hallucination in Large Vision-Language Models

cs.CV · 2023-05-17 · accept · novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-05-13 · conditional · novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

cs.MM · 2026-05-11 · unverdicted · novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.

One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition

cs.CV · 2026-04-25 · unverdicted · novelty 6.0

CineMEC performs multimodal entity coreference by clustering visual entities and aligning them with text role mentions to boost captioning and grounding performance on an extended VidSitu dataset.

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

cs.CV · 2023-11-21 · conditional · novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

eess.AS · 2023-11-14 · unverdicted · novelty 6.0

Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.

Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.

StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

cs.CV · 2026-05-05 · unverdicted · novelty 5.0

StateVLM uses an Auxiliary Regression Loss on box decoder outputs to boost VLMs' accuracy on object and state localization for robotic affordance reasoning, with gains of 1.6% on RefCOCO variants and 5.2% on the new OSAR benchmark.

citing papers explorer

Showing 35 of 35 citing papers.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark cs.CV · 2026-05-02 · unverdicted · none · ref 13 · internal anchor
VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation cs.CV · 2026-05-06 · unverdicted · none · ref 6 · internal anchor
RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 11 · internal anchor
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation cs.CV · 2026-04-17 · unverdicted · none · ref 4 · internal anchor
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning cs.CV · 2026-04-13 · unverdicted · none · ref 4 · internal anchor
By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
STORM: End-to-End Referring Multi-Object Tracking in Videos cs.CV · 2026-04-12 · unverdicted · none · ref 15 · internal anchor
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models cs.CV · 2026-04-04 · unverdicted · none · ref 2 · internal anchor
Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.
Evaluating Object Hallucination in Large Vision-Language Models cs.CV · 2023-05-17 · accept · none · ref 8 · internal anchor
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 7 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement cs.CV · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images cs.CV · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination cs.MM · 2026-05-11 · unverdicted · none · ref 99 · internal anchor
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading cs.CV · 2026-04-29 · unverdicted · none · ref 2 · internal anchor
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition cs.CV · 2026-04-25 · unverdicted · none · ref 8 · internal anchor
CineMEC performs multimodal entity coreference by clustering visual entities and aligning them with text role mentions to boost captioning and grounding performance on an extended VidSitu dataset.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning cs.CV · 2026-04-03 · unverdicted · none · ref 13 · internal anchor
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 10 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 12 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 27 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions cs.CV · 2023-11-21 · conditional · none · ref 6 · internal anchor
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models eess.AS · 2023-11-14 · unverdicted · none · ref 6 · internal anchor
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium cs.CV · 2026-05-11 · unverdicted · none · ref 32 · internal anchor
ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.
StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 7 · internal anchor
StateVLM uses an Auxiliary Regression Loss on box decoder outputs to boost VLMs' accuracy on object and state localization for robotic affordance reasoning, with gains of 1.6% on RefCOCO variants and 5.2% on the new OSAR benchmark.
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow cs.CV · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unverdicted · none · ref 53 · internal anchor
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models cs.CV · 2026-04-07 · unverdicted · none · ref 51 · internal anchor
VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 11 · internal anchor
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone cs.CV · 2024-08-03 · conditional · none · ref 20 · internal anchor
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 25 · internal anchor
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 22 · internal anchor
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs cs.CV · 2024-06-11 · unverdicted · none · ref 5 · internal anchor
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 15 · internal anchor
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Improved Baselines with Visual Instruction Tuning cs.CV · 2023-10-05 · conditional · none · ref 8 · internal anchor
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CV · 2026-04-13 · unverdicted · none · ref 24 · internal anchor
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer