DeepGaze3.5-VL treats visual scanpaths as discrete token sequences predicted autoregressively by vision-language models, achieving 2.18 bits IG on MIT1003 and outperforming prior specialized models even with matched encoders.
super hub Canonical reference
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Canonical reference. 79% of citing Pith papers cite this work as background.
abstract
In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra's promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code, model and dataset are accessed at https://github.com/shikras/shikra.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra
authors
co-cited works
representative citing papers
HG-Bench supplies 500 human-annotated homework samples and a page-aware protocol that measures complete-answer localization (FA) and step-level decomposition (FSm), exposing that no zero-shot VLM exceeds 55% on either metric.
Earth-OneVision is a unified 2B-parameter RS-MLLM supporting six modalities and nine tasks via FGVLA, SLIS, and PCMA mechanisms plus a 34M QA-pair dataset, reporting competitive or superior benchmark results versus larger models.
Video MLLMs show higher jailbreak rates with multi-clip videos than images or static videos, with success increasing alongside clip count and contextual diversity.
CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.
OmniGF adapts VLMs via dual-branch decoding and head embeddings to unify precise multi-person gaze localization with semantic and social reasoning, claiming new SOTA on benchmarks.
IQA-Spider unifies reasoning, grounding, and referring for multi-granularity image quality assessment via a four-task paradigm and two-stage LMM training with training-free text-to-point mapping.
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.
Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.
The paper delivers the first comprehensive review and unified taxonomy of agentic AI in remote sensing, covering single-agent copilots, multi-agent systems, planning mechanisms, benchmarks, and a roadmap while noting limitations in grounding and safety.
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
GeoSearcher introduces anchor-centric reasoning supervised fine-tuning and process-faithful group relative policy optimization to improve MLLM-based remote sensing visual grounding.
SatAgent is a UAV-satellite collaborative spatial reasoning model using geometric 3D encoding, multi-view alignment, and a new 130K dataset that reports 25.91% and 11.69% gains over general and specialized baselines.
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
MotionAtlas supplies a 2,073-question benchmark, a self-bootstrap pipeline yielding 159k captions, and fine-tuned Video-MLLMs that deliver 5.2-point gains over Qwen3-VL-4B on motion tasks.
citing papers explorer
-
Omni-Perception Policy Optimization for Multimodal Emotion Reasoning
OPPO applies RL with an Omni-Perception Reward and masked-input KL loss to boost cue utilization and suppress hallucinations in emotion reasoning MLLMs, claiming SOTA results on MER-UniBench, MME-Emotion, and MEP-Bench.
-
RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought
Introduces PinCoT paradigm with visual reasoning anchors, builds PIN-170K dataset via automated pipeline, and trains 4B RoboPIN model via three-stage post-training to outperform 7B baselines by 12% on embodied reasoning benchmarks.
-
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.
-
Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration
A training-free region-aware attention recalibration strategy reduces object hallucinations in LVLMs on CHAIR, POPE, and MME benchmarks while preserving fluency.
- Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models