pith. sign in

hub Canonical reference

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Canonical reference. 78% of citing Pith papers cite this work as background.

70 Pith papers citing it
Background 78% of classified citations
abstract

In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra's promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code, model and dataset are accessed at https://github.com/shikras/shikra.

hub tools

citation-role summary

background 22 baseline 4 dataset 1

citation-polarity summary

claims ledger

  • abstract In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra

co-cited works

representative citing papers

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

STORM: End-to-End Referring Multi-Object Tracking in Videos

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.

Agentic AI in Remote Sensing: Foundations, Taxonomy, and Emerging Systems

cs.CV · 2026-01-05 · unverdicted · novelty 7.0

The paper delivers the first comprehensive review and unified taxonomy of agentic AI in remote sensing, covering single-agent copilots, multi-agent systems, planning mechanisms, benchmarks, and a roadmap while noting limitations in grounding and safety.

A More Word-like Image Tokenization for MLLMs

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

DiVT clusters patch embeddings into coherent semantic units and adapts token count to image complexity, matching or exceeding baselines with fewer visual tokens on multimodal benchmarks.

WOW-Seg: A Word-free Open World Segmentation Model

cs.CV · 2026-05-16 · conditional · novelty 6.0

WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.

citing papers explorer

Showing 50 of 70 citing papers.