v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

· 2025 · cs.CL · arXiv 2505.18842

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

When thinking with images, humans rarely rely on a single glance: they revisit visual evidence while reasoning. In contrast, most Multimodal Language Models encode an image once to key-value cache and then reason purely in text, making it hard to re-ground intermediate steps. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. We introduce v1, a lightweight extension for active visual referencing via point-and-copy: the model selects relevant image patches and copies their embeddings back into the reasoning stream. Crucially, our point-and-copy mechanism retrieves patches using their semantic representations as keys, ensuring perceptual evidence remains aligned with the reasoning space. To train this behavior, we build v1g, a dataset of 300K multimodal reasoning traces with interleaved grounding annotations. Across multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines.

representative citing papers

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

citing papers explorer

Showing 1 of 1 citing paper.

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 18 · internal anchor
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer