pith. machine review for the scientific record. sign in

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

When thinking with images, humans rarely rely on a single glance: they revisit visual evidence while reasoning. In contrast, most Multimodal Language Models encode an image once to key-value cache and then reason purely in text, making it hard to re-ground intermediate steps. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. We introduce v1, a lightweight extension for active visual referencing via point-and-copy: the model selects relevant image patches and copies their embeddings back into the reasoning stream. Crucially, our point-and-copy mechanism retrieves patches using their semantic representations as keys, ensuring perceptual evidence remains aligned with the reasoning space. To train this behavior, we build v1g, a dataset of 300K multimodal reasoning traces with interleaved grounding annotations. Across multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines.

fields

cs.CV 1

years

2026 1

verdicts

UNVERDICTED 1

clear filters

representative citing papers

citing papers explorer

Showing 1 of 1 citing paper after filters.