hub

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu, Saining Xie · 2023 · arXiv 2312.14135

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

Improving Vision-language Models with Perception-centric Process Reward Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.

Multimodal Latent Reasoning via Predictive Embeddings

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Pearl learns predictive embeddings from multimodal tool trajectories in latent space to enable efficient reasoning that matches or exceeds supervised fine-tuning and reconstruction-based methods without explicit tool invocation at inference.

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

cs.AI · 2026-04-11 · unverdicted · novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.

MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

PaliGemma: A versatile 3B VLM for transfer

cs.CV · 2024-07-10 · unverdicted · novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

citing papers explorer

Showing 11 of 11 citing papers.

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning cs.CV · 2026-05-08 · unverdicted · none · ref 36
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Improving Vision-language Models with Perception-centric Process Reward Models cs.CV · 2026-04-27 · unverdicted · none · ref 42
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
SketchVLM: Vision language models can annotate images to explain thoughts and guide users cs.CV · 2026-04-23 · unverdicted · none · ref 49
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation cs.CV · 2026-04-20 · unverdicted · none · ref 213
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 32
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
Multimodal Latent Reasoning via Predictive Embeddings cs.LG · 2026-04-09 · unverdicted · none · ref 3
Pearl learns predictive embeddings from multimodal tool trajectories in latent space to enable efficient reasoning that matches or exceeds supervised fine-tuning and reconstruction-based methods without explicit tool invocation at inference.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 33
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unverdicted · none · ref 82
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding cs.CV · 2026-04-10 · unverdicted · none · ref 45
MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 149
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
PaliGemma: A versatile 3B VLM for transfer cs.CV · 2024-07-10 · unverdicted · none · ref 147
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

V*: Guided visual search as a core mechanism in multimodal llms

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer