hub Baseline reference

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu, Saining Xie · 2023 · arXiv 2312.14135

Baseline reference. 62% of citing Pith papers use this work as a benchmark or comparison.

19 Pith papers citing it

Baseline 62% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 4 background 3 baseline 1

citation-polarity summary

use dataset 4 background 3 baseline 1

representative citing papers

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

Improving Vision-language Models with Perception-centric Process Reward Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

cs.CV · 2026-02-27 · unverdicted · novelty 7.0

DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

cs.CV · 2026-02-10 · unverdicted · novelty 7.0

Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.

Multimodal Latent Reasoning via Predictive Embeddings

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Pearl learns predictive embeddings from multimodal tool trajectories in latent space to enable efficient reasoning that matches or exceeds supervised fine-tuning and reconstruction-based methods without explicit tool invocation at inference.

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.

What's Holding Back Latent Visual Reasoning?

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

Latent visual reasoning fails in current models because standard datasets make oracle latents uninformative and inference-time latents collapse away from useful representations.

MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

cs.AI · 2025-09-26 · unverdicted · novelty 5.0

MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

PaliGemma: A versatile 3B VLM for transfer

cs.CV · 2024-07-10 · unverdicted · novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

A Survey on Multimodal Large Language Models

cs.CV · 2023-06-23 · accept · novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

cs.AI · 2026-04-11

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

cs.CV · 2025-09-28

citing papers explorer

Showing 2 of 2 citing papers after filters.

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning cs.AI · 2025-09-26 · unverdicted · none · ref 48
MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unreviewed · ref 82

V*: Guided visual search as a core mechanism in multimodal llms

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer