v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Jaeyoung Lee; Jiwan Chung; Junhyeok Kim; Min Soo Kim; Siyeol Kim; Youngjae Yu

arxiv: 2505.18842 · v6 · submitted 2025-05-24 · 💻 cs.CL · cs.CV

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Jiwan Chung , Junhyeok Kim , Siyeol Kim , Jaeyoung Lee , Min Soo Kim , Youngjae Yu This is my paper

Pith reviewed 2026-05-19 12:32 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords multimodal language modelsvisual groundingpoint and copyimage patchesreasoning chainsmathematical reasoninggrounding dataset

0 comments

The pith

By learning to point to and copy relevant image patches during reasoning, multimodal models stay grounded and outperform baselines on math tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current multimodal language models encode an image once and then reason only in text, causing them to lose focus on relevant visual regions as reasoning chains lengthen. The paper introduces v1, which adds a point-and-copy mechanism allowing the model to select important image patches and insert their embeddings back into the reasoning process. Patches are retrieved using semantic representations as keys to preserve alignment with the reasoning space. A new dataset called v1g provides 300,000 multimodal reasoning traces that include interleaved grounding annotations for training this behavior. On multimodal mathematical reasoning benchmarks, v1 shows consistent improvements over comparable models.

Core claim

The core discovery is that active visual referencing through point-and-copy of semantically keyed patches enables multimodal models to re-ground their reasoning steps on visual evidence, preventing progressive loss of focus in long chains.

What carries the argument

The point-and-copy mechanism, which retrieves image patches via their semantic representations as keys and copies the corresponding embeddings into the reasoning stream.

If this is right

Reasoning chains can be extended without losing visual grounding.
Intermediate steps in multimodal reasoning become directly tied to specific image regions.
Training with interleaved grounding annotations produces more reliable visual referencing behavior.
Performance gains appear across various multimodal mathematical reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar mechanisms might help in other domains requiring iterative visual inspection, such as visual question answering with complex scenes.
Providing explicit grounding annotations could become a standard way to train models for better interpretability in multimodal tasks.
Testing on even longer reasoning sequences could reveal the limits of this alignment approach.

Load-bearing premise

Copying patch embeddings retrieved by semantic key matching will preserve the necessary alignment between visual perception and the text-based reasoning space.

What would settle it

Measure whether models using the point-and-copy mechanism maintain higher attention or relevance scores on the correct image regions throughout extended reasoning traces compared to standard models.

Figures

Figures reproduced from arXiv: 2505.18842 by Jaeyoung Lee, Jiwan Chung, Junhyeok Kim, Min Soo Kim, Siyeol Kim, Youngjae Yu.

**Figure 1.** Figure 1: Pure text-based reasoning vs. v1 during inference. Our v1 can actively re-access visual context by pointing to and copying relevant image regions throughout the reasoning process. retrieval steps from the traces using an LLM-guided decomposition process, and (3) grounding each visual reference by associating it with a bounding box in the input image. The pipeline is fully automated, leveraging the generat… view at source ↗

**Figure 2.** Figure 2: Inference process of v1. At each step, the MLLM encodes the multimodal context and generation history into token representations. For the last token (e.g., "<region>"), (a) a copy head projects its representation and computes logits against image patch embeddings, (b) a language head produces logits over the vocabulary, and (c) the two are concatenated to form the final distribution. If a patch is chosen, … view at source ↗

**Figure 3.** Figure 3: Cumulative attention across all visual tokens, showing a gradual decrease in overall attention to the input image tokens. 0 100 200 300 400 500 Generation step 0.6 0.7 0.8 0.9 1.0 Ratio Ratio of bounded region to full image Layer layer 2 layer 14 layer 27 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Attention dynamics during reasoning, showing that semantically important visual regions receive disproportionately low attention, suggesting inefficient grounding during reasoning. tokens or visual features) the model is trained to autoregressively predict the discrete next token xt conditioned on the input c and previously generated tokens x<t: p(x1, . . . , xT | c) = Y T t=1 p(xt | c, x1, . . . , xt−1) … view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on MathVision. v1’s dynamic grounding helps to solve both bar graph and spatial reasoning tasks, while LLaVA-CoT misinterprets visual content in both cases [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of attention to copy tokens vs. original visual tokens. Layer-wise sum of attention scores directed to copy tokens and their corresponding original visual input tokens from a v1 output on a MathVision example. Copy token intervals are highlighted in yellow. valid candidates. These examples illustrate how active visual reference supports more precise and interpretable grounded reasoning than text… view at source ↗

**Figure 7.** Figure 7: v1g dataset construction pipeline. D. Human Evaluation of Grounding Quality D.1. Evaluation of v1g Dataset Quality To validate the quality of our automatically generated visual grounding annotations in the v1g dataset, we conducted a human evaluation comparing our attention-based grounding approach against GroundingDINO (Liu et al., 2024), a widely used open-set object detector. Methodology. We randomly sa… view at source ↗

**Figure 8.** Figure 8: Qualitative example of v1 tackling an attribute-based counting task in a synthetic domain. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative example of v1 performing comparative reasoning on a chart comprehension task. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

When thinking with images, humans rarely rely on a single glance: they revisit visual evidence while reasoning. In contrast, most Multimodal Language Models encode an image once to key-value cache and then reason purely in text, making it hard to re-ground intermediate steps. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. We introduce v1, a lightweight extension for active visual referencing via point-and-copy: the model selects relevant image patches and copies their embeddings back into the reasoning stream. Crucially, our point-and-copy mechanism retrieves patches using their semantic representations as keys, ensuring perceptual evidence remains aligned with the reasoning space. To train this behavior, we build v1g, a dataset of 300K multimodal reasoning traces with interleaved grounding annotations. Across multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The point-and-copy mechanism is a reasonable way to re-insert visual patches during long chains, but the reported gains are hard to credit to it rather than the new 300K dataset.

read the letter

The main thing here is that v1 adds a point-and-copy step so the model can pull relevant image patches back into the reasoning stream using semantic representations as keys. That addresses the documented drop in visual focus as chains lengthen, which is a practical problem in multimodal math and diagram tasks. The paper shows this loss happens in standard models and then trains on their v1g set of 300K traces that include interleaved grounding labels. The mechanism itself is new in how it retrieves and copies embeddings without re-encoding the full image each time, and it keeps the copied tokens aligned to the text space by design. That part is a clean, lightweight extension rather than a full architecture change, and the dataset construction is a usable byproduct for others working on grounded reasoning. The results claim consistent gains over comparable baselines on the benchmarks. The soft spot is the missing control: the baselines appear not to have been trained on equivalent grounded data, so the outperformance could come from the extra 300K examples rather than the retrieval-and-copy operation. Without an ablation that holds data fixed and varies only the mechanism, the central causal claim stays untested. The abstract also gives little on exact baseline definitions, statistical tests, or how the semantic keys are implemented in detail, which makes it difficult to judge robustness. The alignment assumption is plausible but not directly measured beyond the end-task numbers. This paper is for researchers building or extending multimodal models that need sustained visual reference in reasoning chains, such as those working on scientific diagrams or educational math tools. A reader already in that area would get a workable idea and a new dataset to experiment with. I would send it for peer review once they add the data-controlled ablations; the core direction is worth referee time even if the current evidence is preliminary.

Referee Report

3 major / 2 minor

Summary. The paper proposes v1 as a lightweight extension to multimodal language models to enable active visual referencing during reasoning. It identifies that MLLMs lose focus on relevant image regions as reasoning chains lengthen. The v1 model uses a 'point-and-copy' mechanism to select relevant image patches via semantic representations as keys and copy their embeddings into the reasoning stream. It is trained on the newly introduced v1g dataset containing 300K multimodal reasoning traces with interleaved grounding annotations. The paper reports that v1 consistently outperforms comparable baselines across multimodal mathematical reasoning benchmarks.

Significance. If the results are substantiated, this approach could significantly advance multimodal grounded reasoning by allowing models to dynamically re-reference visual evidence without losing alignment. The point-and-copy mechanism addresses a practical limitation in current MLLMs. The v1g dataset may also serve as a useful resource for future work on training models for interleaved reasoning and grounding. The significance hinges on demonstrating that the performance improvements are attributable to the proposed mechanism.

major comments (3)

The abstract claims empirical confirmation of focus loss in models as reasoning chains lengthen, but no details are provided on the experimental controls, baseline definitions, statistical tests, or how focus loss was quantified. This makes the foundational observation difficult to verify.
The central claim of outperformance is undermined by the lack of a controlled ablation study. Specifically, it is unclear if the comparable baselines were trained on the v1g dataset or an equivalent volume of grounded traces. Without holding the training data fixed and varying only the point-and-copy component, the gains cannot be confidently attributed to the semantic-key retrieval and copy operation rather than the additional data.
The claim that using semantic representations as keys ensures perceptual evidence remains aligned with the reasoning space is a key assumption. The manuscript should provide evidence or analysis showing that this retrieval does not introduce misalignment or dilution of focus, as this is load-bearing for the mechanism's validity.

minor comments (2)

The paper would benefit from formalizing the point-and-copy operation with mathematical notation or pseudocode to improve clarity and reproducibility.
Ensure that all baselines are clearly defined, including their training procedures and any differences in model size or architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment point by point below, clarifying our experimental design and committing to revisions that will strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: The abstract claims empirical confirmation of focus loss in models as reasoning chains lengthen, but no details are provided on the experimental controls, baseline definitions, statistical tests, or how focus loss was quantified. This makes the foundational observation difficult to verify.

Authors: We appreciate this observation. Section 3.1 of the manuscript presents an empirical analysis of focus loss using attention maps and a metric that tracks the fraction of attention allocated to ground-truth relevant image regions as reasoning depth increases. However, we acknowledge that the description of controls, exact quantification procedure, and any statistical testing was insufficiently detailed. In the revised manuscript we will add a dedicated subsection that specifies the controls (fixed image encoder and prompt templates), defines the focus-loss metric explicitly, reports results across multiple model scales, and includes statistical significance tests. revision: yes
Referee: The central claim of outperformance is undermined by the lack of a controlled ablation study. Specifically, it is unclear if the comparable baselines were trained on the v1g dataset or an equivalent volume of grounded traces. Without holding the training data fixed and varying only the point-and-copy component, the gains cannot be confidently attributed to the semantic-key retrieval and copy operation rather than the additional data.

Authors: This concern is well-founded for causal attribution. Our current baselines are standard MLLMs fine-tuned on comparable volumes of multimodal reasoning data that lack the interleaved grounding annotations present in v1g. To isolate the contribution of the point-and-copy mechanism, we will add a controlled ablation in which a base model is trained on the identical v1g traces but without the point-and-copy module enabled. The revised paper will report this direct comparison, holding data and training compute fixed while varying only the presence of the retrieval-and-copy operation. revision: yes
Referee: The claim that using semantic representations as keys ensures perceptual evidence remains aligned with the reasoning space is a key assumption. The manuscript should provide evidence or analysis showing that this retrieval does not introduce misalignment or dilution of focus, as this is load-bearing for the mechanism's validity.

Authors: We agree that explicit validation of alignment is necessary. The design intentionally re-uses the same semantic embedding space for both keys and the reasoning stream to avoid projection-induced drift. In the revision we will insert an analysis section that quantifies alignment via cosine similarity between retrieved patch embeddings and their original semantic keys, together with a focus-retention metric measured across increasing reasoning lengths. Any observed dilution effects will be reported and discussed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on new dataset and mechanism without reduction to inputs

full rationale

The paper introduces an empirical extension (point-and-copy via semantic keys) trained on a newly constructed 300K v1g dataset of grounded reasoning traces. No equations, derivations, or parameter-fitting steps are described that would reduce the claimed outperformance to a self-referential definition or fitted input renamed as prediction. The central result is benchmark comparison after training, which is externally falsifiable and does not rely on a self-citation chain or uniqueness theorem imported from prior author work. While the skeptic correctly notes that data volume could explain gains absent an ablation holding data fixed, this is an experimental-control issue rather than circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review provides minimal technical detail; the core premise rests on an empirical observation of focus loss and the assumption that semantic-key retrieval preserves alignment.

axioms (1)

domain assumption Multimodal language models progressively lose focus on relevant image regions as reasoning chains lengthen.
Stated as empirically confirmed in the abstract and used to motivate the need for active referencing.

invented entities (1)

point-and-copy mechanism no independent evidence
purpose: To enable dynamic selection and insertion of image patch embeddings into the reasoning stream.
New component introduced to address visual grounding loss.

pith-pipeline@v0.9.0 · 5692 in / 1261 out tokens · 48953 ms · 2026-05-19T12:32:33.732302+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the model selects relevant image patches and copies their embeddings back into the reasoning stream... retrieves patches using their semantic representations as keys
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight linear heads... pointing query head Lq and pointing key head Lk

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
Latent Visual Reasoning
cs.CV 2025-09 unverdicted novelty 7.0

Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 4 Pith papers · 5 internal anchors

[1]

Teaching Metric Distance to Discrete Autoregressive Language Models

URL https://api.semanticscholar. org/CorpusID:14563301. Chung, J., Kim, S., Jo, Y ., Park, J., Min, D., and Yu, Y . Teaching metric distance to discrete autoregressive lan- guage models, 2025. URL https://arxiv.org/ abs/2503.02379. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hess...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://cloud.google.com/ vertex-ai/generative-ai/docs/models/ gemini/2-0-flash. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Gupta, T. and Kembhavi, A. Visual programming: ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

URL https://openreview.net/forum? id=GNSMl1P5VR. Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025. URLhttps://arxiv.org/abs/2503.06749. 9 Learning to Point Visual Tokens for Multimodal Mathematical Grounded Reasoning Hurst, A., Lere...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3115/v1/d14-1086 2025
[4]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

URL https://openreview.net/forum? id=KUNzEQMWU7. Ma, T., Xie, L., Tian, Y ., Yang, B., and Ye, Q. Claw- machine: Learning to fetch visual tokens for referential comprehension. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=TOtk9dTYGG. Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Learning Transferable Visual Models From Natural Language Supervision

URL https://qwenlm.github.io/blog/ qvq-72b-preview/. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020. See, A., Liu, P. J., and Manning, C. D....

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

ConcatenateIwith a static visual-grounding instruction prompt and feed it to Qwen2.5-VL

Prepare multimodal input. ConcatenateIwith a static visual-grounding instruction prompt and feed it to Qwen2.5-VL

work page
[7]

From the final decoding position, obtain the cross-attention map A over image tokens

Extract attention with instruction. From the final decoding position, obtain the cross-attention map A over image tokens. Use a predefined set of layers (selected empirically) and average across heads

work page
[8]

Remove the object name from the prompt, feed the modified prompt with I to the model, and extract the corresponding attention mapA ′ using the same layers and averaging

Extract baseline attention. Remove the object name from the prompt, feed the modified prompt with I to the model, and extract the corresponding attention mapA ′ using the same layers and averaging

work page
[9]

Compute the contrastive relevance for each image token:R=A/A ′

Compute attention contrast. Compute the contrastive relevance for each image token:R=A/A ′

work page
[10]

Identify the peak region in R

Derive bounding region. Identify the peak region in R. Sweep over multiple candidate crop ratios; for each ratio, form a bounding region around the peak. Select the bounding box maximizing contrast sharpness between inside and outside regions. Convert the selected region to image-coordinate bounding boxb

work page
[11]

yellow cube

Return. returnb 15 Learning to Point Visual Tokens for Multimodal Mathematical Grounded Reasoning Question: Subtract all red things. Subtract all tiny matte balls. How many objects are left? Answer: 5 Image-Question P air Let me see what objects are present. I need to identify all the objects first. detect(query="yellow cube", objects=[" "]) detect(query=...

work page
[12]

Start from the beginning of the reasoning and read EACH sentence

work page
[13]

When you think you’d better look at the object or region, use the detect() function

work page
[14]

visual item that you want to find

Format: ‘detect(query="visual item that you want to find", objects=["<obj#>"])‘

work page
[15]

After detection, reference the visual element with ’<obj#>’ tags every time you need to look at it again immediately after mentioning the item

work page
[16]

Looking at the graph, I can see the function reaches its maximum at x = 3

Use NEW object numbers (‘<obj1>‘, ‘<obj2>‘, ‘<obj3>‘...) for EACH new detection. ### EXAMPLE: Original text: "Looking at the graph, I can see the function reaches its maximum at x = 3." Corrected: “‘ To answer the question, I need to look the graph. detect(query="function graph", objects=["<obj1>"]) Looking at the graph <obj1>, I can see the function reac...

work page

[1] [1]

Teaching Metric Distance to Discrete Autoregressive Language Models

URL https://api.semanticscholar. org/CorpusID:14563301. Chung, J., Kim, S., Jo, Y ., Park, J., Min, D., and Yu, Y . Teaching metric distance to discrete autoregressive lan- guage models, 2025. URL https://arxiv.org/ abs/2503.02379. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hess...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://cloud.google.com/ vertex-ai/generative-ai/docs/models/ gemini/2-0-flash. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Gupta, T. and Kembhavi, A. Visual programming: ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

URL https://openreview.net/forum? id=GNSMl1P5VR. Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025. URLhttps://arxiv.org/abs/2503.06749. 9 Learning to Point Visual Tokens for Multimodal Mathematical Grounded Reasoning Hurst, A., Lere...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3115/v1/d14-1086 2025

[4] [4]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

URL https://openreview.net/forum? id=KUNzEQMWU7. Ma, T., Xie, L., Tian, Y ., Yang, B., and Ye, Q. Claw- machine: Learning to fetch visual tokens for referential comprehension. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=TOtk9dTYGG. Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Learning Transferable Visual Models From Natural Language Supervision

URL https://qwenlm.github.io/blog/ qvq-72b-preview/. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020. See, A., Liu, P. J., and Manning, C. D....

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

ConcatenateIwith a static visual-grounding instruction prompt and feed it to Qwen2.5-VL

Prepare multimodal input. ConcatenateIwith a static visual-grounding instruction prompt and feed it to Qwen2.5-VL

work page

[7] [7]

From the final decoding position, obtain the cross-attention map A over image tokens

Extract attention with instruction. From the final decoding position, obtain the cross-attention map A over image tokens. Use a predefined set of layers (selected empirically) and average across heads

work page

[8] [8]

Remove the object name from the prompt, feed the modified prompt with I to the model, and extract the corresponding attention mapA ′ using the same layers and averaging

Extract baseline attention. Remove the object name from the prompt, feed the modified prompt with I to the model, and extract the corresponding attention mapA ′ using the same layers and averaging

work page

[9] [9]

Compute the contrastive relevance for each image token:R=A/A ′

Compute attention contrast. Compute the contrastive relevance for each image token:R=A/A ′

work page

[10] [10]

Identify the peak region in R

Derive bounding region. Identify the peak region in R. Sweep over multiple candidate crop ratios; for each ratio, form a bounding region around the peak. Select the bounding box maximizing contrast sharpness between inside and outside regions. Convert the selected region to image-coordinate bounding boxb

work page

[11] [11]

yellow cube

Return. returnb 15 Learning to Point Visual Tokens for Multimodal Mathematical Grounded Reasoning Question: Subtract all red things. Subtract all tiny matte balls. How many objects are left? Answer: 5 Image-Question P air Let me see what objects are present. I need to identify all the objects first. detect(query="yellow cube", objects=[" "]) detect(query=...

work page

[12] [12]

Start from the beginning of the reasoning and read EACH sentence

work page

[13] [13]

When you think you’d better look at the object or region, use the detect() function

work page

[14] [14]

visual item that you want to find

Format: ‘detect(query="visual item that you want to find", objects=["<obj#>"])‘

work page

[15] [15]

After detection, reference the visual element with ’<obj#>’ tags every time you need to look at it again immediately after mentioning the item

work page

[16] [16]

Looking at the graph, I can see the function reaches its maximum at x = 3

Use NEW object numbers (‘<obj1>‘, ‘<obj2>‘, ‘<obj3>‘...) for EACH new detection. ### EXAMPLE: Original text: "Looking at the graph, I can see the function reaches its maximum at x = 3." Corrected: “‘ To answer the question, I need to look the graph. detect(query="function graph", objects=["<obj1>"]) Looking at the graph <obj1>, I can see the function reac...

work page