pith. sign in

arxiv: 2605.30912 · v1 · pith:KSEC4N2Snew · submitted 2026-05-29 · 💻 cs.CV · cs.CL

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

Pith reviewed 2026-06-28 23:17 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords Evidence-Anchored Spatial AttentionMultimodal RLVRVision-Language ModelsAttention SupervisionVisual GroundingReinforcement LearningHallucination
0
0 comments X

The pith

Supervising attention to annotated evidence regions during RL training improves VLM grounding and benchmark scores without requiring annotations at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EASE to add visual-evidence process supervision to multimodal RLVR. Outcome-only rewards from final answers fail to ensure models use relevant image regions rather than language shortcuts. EASE converts evidence annotations into smoothed attention targets and applies them only on high-reward trajectories. This produces better alignment between attention and evidence at inference time. The approach yields consistent gains of 2.5 to 3.1 points across perception, hallucination, visual math, and reasoning benchmarks on multiple Qwen VL models.

Core claim

EASE augments multimodal RLVR by converting annotated evidence regions into smoothed visual-token targets that supervise response-to-image attention during training, but only on high-reward trajectories; the annotations serve solely as privileged training labels, and the resulting models achieve higher average scores than DAPO on perception, hallucination, visual math, and multimodal reasoning benchmarks while requiring only the original image and question at inference.

What carries the argument

Evidence-Anchored Spatial Attention (EASE), which turns annotated evidence regions into smoothed visual-token targets to guide attention during RL training on high-reward trajectories.

If this is right

  • Attention supervision on high-reward trajectories produces measurable improvements in alignment with annotated evidence regions.
  • The gains appear across model sizes from 4B to 8B parameters on perception, hallucination, visual math, and multimodal reasoning tasks.
  • Training uses evidence annotations as privileged information only; inference uses the unmodified image and question.
  • Outcome rewards alone are insufficient to prevent language-prior shortcuts on visually grounded questions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Process-level attention supervision may complement pure outcome rewards more broadly in multimodal reinforcement learning.
  • The method could be tested on tasks where evidence regions are derived automatically rather than manually annotated.
  • Similar attention anchoring might reduce hallucination rates in other vision-language settings that currently rely only on final-answer rewards.

Load-bearing premise

Annotated evidence regions provide accurate and sufficient visual justification for correct answers, and supervising attention to them only on high-reward trajectories improves grounding at inference without access to annotations.

What would settle it

A diagnostic showing that attention maps from EASE-trained models do not align more closely with annotated evidence regions than DAPO baselines, or that the 2.5-3.1 point gains disappear under noisy or incomplete evidence annotations.

Figures

Figures reproduced from arXiv: 2605.30912 by Bin Yu, Chen Wang, Jionghao Bai, Kai Wang, Lai Wei, Ruina Hu, Weiran Huang, Yue Wang.

Figure 1
Figure 1. Figure 1: From outcome rewards to evidence acqui￾sition. Standard RLVR may answer correctly while attending weakly to the key evidence region. EASE encourages stronger attention to the supporting region. Liu et al., 2025a; Wang et al., 2025e; Huang et al., 2025; Wang et al., 2026; Cao et al., 2025; Ni et al., 2026). However, outcome-only reward does not tell the model where the answer should come from in the image. … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EASE. During RL training, EASE maps dataset evidence boxes to a smoothed spatial target over visual tokens. The auxiliary objective guides response-to-vision attention on high-reward trajectories toward this target, and inference uses the standard image and question inputs. Low Q2 Q3 Q4 High KL quantile 0 20 40 60 Hallucination rate (%) ρ=0.34 p<0.001 (a) KL-binned hallucination 0.0 0.5 1.0 1.5… view at source ↗
Figure 3
Figure 3. Figure 3: Motivating diagnostic for outcome-only RL. Higher attention-target mismatch corresponds to increased hallucination risk in (a) and a right-shifted KL distribution for hallucinated responses in (b). Details are in Appendix A. signals encourage visual reliance through pertur￾bations, reward-derived proxies, checklist verifi￾cation, textual traces, or point-level targets, but they do not directly match respon… view at source ↗
Figure 4
Figure 4. Figure 4: Evidence annotation pipeline. Given an image-question-answer triple, Step 1 extracts answer-relevant evidence phrases, Step 2 localizes each phrase with grounding models, and Step 3 validates the proposed boxes with semantic and geometric checks. The annotated pool contains single-evidence examples for local grounding and multi-evidence examples for cross-region reasoning. Appendix B gives implementation d… view at source ↗
Figure 5
Figure 5. Figure 5: Evidence-acquisition diagnostics. On Qwen3-VL-4B, EASE increases response-to-vision at￾tention on annotated evidence regions compared with Base and GRPO, as measured by attention mass, point￾ing accuracy, and multi-evidence coverage on a held-out validation set. Error bars denote 95% CIs. V* POPE 87 89 91 Accuracy 87.4 89.5 89.3 90.4 90.1 90.8 89.2 90.1 (a) λattn 0 5×10 −4 10 −3 2×10 −3 V* POPE 88.2 89.7 8… view at source ↗
Figure 6
Figure 6. Figure 6: Hyperparameter sensitivity. On the Qwen3- VL-4B backbone, we vary the attention-loss weight, background smoothing, and sampled response-token budget around the default EASE configuration. Marked bars indicate default settings. Ablation study on evidence target construc￾tion. We ablate three components of the evidence target, including Gaussian smoothing, background smoothing, and the KL direction. Replacin… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative evidence analysis. Representative examples with EASE reasoning traces, final answers, and response-to-vision attention maps. Attention maps are used only for analysis. by combining local grounding with cross-region evidence acquisition. Ablation study on attention extraction granu￾larity. We ablate the layer used to extract response￾to-vision attention. Early-layer supervision per￾forms poorly,… view at source ↗
Figure 8
Figure 8. Figure 8: Representative failure cases. The left case shows a contextual size illusion from HallusionBench-Image, where the model is misled by surrounding blue squares when comparing the two grey boxes. The right case shows a MathVerse-V geometry example, where the model identifies relevant diagram relations but applies an incorrect circle-angle inference [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR by converting annotated evidence regions into smoothed visual-token targets to supervise response-to-image attention during training, but only on high-reward trajectories. Annotations serve as privileged training labels only; inference uses the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, the method claims average score gains of 2.5–3.1 points over DAPO on perception, hallucination, visual math, and multimodal reasoning benchmarks, with diagnostics showing improved alignment of attention to annotated evidence regions.

Significance. If the results hold, the work shows that process-level visual evidence supervision can be added to outcome-only RLVR to improve grounding without changing the inference interface. The privileged-label design is a clear strength. The approach could matter for tasks where language priors or lucky guesses otherwise produce correct answers without visual support.

major comments (2)
  1. [Abstract] Abstract: the central numerical claim (2.5–3.1 point average gains over DAPO) is presented without naming the exact benchmarks, per-model or per-task scores, error bars, number of runs, or statistical tests. These details are load-bearing for assessing whether the reported lift is robust.
  2. [Abstract] Abstract / method description: the interpretation that gains reflect improved visual grounding (rather than fitting to privileged labels) rests on the assumption that the annotated evidence regions are accurate and sufficient justifications for the correct answers. No information is supplied on annotation collection, inter-annotator agreement, handling of multiple valid regions, or cases solvable by language priors alone. This assumption is load-bearing for the claim.
minor comments (1)
  1. [Abstract] Abstract: the statement that 'diagnostics and ablations show that EASE better aligns visual attention' is given without any quantitative results or description of the diagnostics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the grounding assumptions. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central numerical claim (2.5–3.1 point average gains over DAPO) is presented without naming the exact benchmarks, per-model or per-task scores, error bars, number of runs, or statistical tests. These details are load-bearing for assessing whether the reported lift is robust.

    Authors: We agree the abstract should be more informative. In revision we will name the benchmark categories (perception, hallucination, visual math, multimodal reasoning) and explicitly direct readers to the per-model and per-task results in Tables 2–4. Our evaluation follows the single-run protocol standard for these VLM benchmarks; we will add a clarifying sentence in Section 4.1. No statistical significance tests were performed, which we will note as a limitation. revision: yes

  2. Referee: [Abstract] Abstract / method description: the interpretation that gains reflect improved visual grounding (rather than fitting to privileged labels) rests on the assumption that the annotated evidence regions are accurate and sufficient justifications for the correct answers. No information is supplied on annotation collection, inter-annotator agreement, handling of multiple valid regions, or cases solvable by language priors alone. This assumption is load-bearing for the claim.

    Authors: We acknowledge the manuscript lacks annotation details. We will add an appendix describing the expert annotation protocol, inter-annotator IoU agreement (approximately 0.78), and the rule for merging overlapping regions. We will also include a new diagnostic subsection analyzing performance on language-prior solvable subsets, showing that EASE reduces shortcut reliance via attention alignment metrics. These additions directly support the grounding interpretation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses external annotations as privileged labels

full rationale

The paper presents EASE as an augmentation to multimodal RLVR that converts externally annotated evidence regions into smoothed visual-token targets for attention supervision on high-reward trajectories only. No equations, derivations, or self-referential fitting steps appear in the abstract or described method; the annotations function as independent privileged training labels rather than outputs derived from the model itself. Inference operates without annotations, and reported gains are measured against external benchmarks. This structure contains no self-definitional, fitted-input, or self-citation reductions, satisfying the default expectation of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5757 in / 1006 out tokens · 22184 ms · 2026-06-28T23:17:31.432465+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

    cs.AI 2026-06 unverdicted novelty 5.0

    VeriEvol decouples prompt difficulty evolution from answer reliability verification to scale verified data for visual math reasoning, lifting benchmark accuracy from 35.42 to 54.73 and adding +3.88 in GRPO RL.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Evaluating Object Hallucination in Large Vision-Language Models

    Mitigating object hallucinations in large vision- language models through visual contrastive decod- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023a. Eval- uating object hallucination in large vision-language model...

  2. [2]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning.Ad- vances in Neural Information Processing Systems, 38:20538–20559. Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shao- han Huang, Shuming Ma, Qixiang Ye, and Furu Wei. 2024. Grounding multimodal large language models to the world. InInternational Conference on Lea...

  3. [3]

    Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

    Are vlms seeing or just saying? uncover- ing the illusion of visual re-examination.Preprint, arXiv:2605.15864. Sanchit Sinha, Oana Frunza, Kashif Rasul, Yuriy Nevmyvaka, and Aidong Zhang. 2025. Chart- rvr: Reinforcement learning with verifiable rewards for explainable chart reasoning.arXiv preprint arXiv:2510.10973. Huajie Tan, Yuheng Ji, Xiaoshuai Hao, X...

  4. [4]

    Large Vision-Language Models Get Lost in Attention

    Large vision-language models get lost in atten- tion.arXiv preprint arXiv:2605.05668. Jiaer Xia, Bingkui Tong, Yuhang Zang, Rui Shao, and Kaiyang Zhou. 2025. Bootstrapping grounded chain-of-thought in multimodal llms for data-efficient model adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 208–217. Yijia Xiao, E...

  5. [5]

    LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

    Logicvista: Multimodal llm logical rea- soning benchmark in visual contexts.Preprint, arXiv:2407.04973. Zhongxing Xu, Chengzhi Liu, Qingyue Wei, Juncheng Wu, James Zou, Xin Wang, Yuyin Zhou, and Sheng Liu. 2026. More thinking, less seeing? assessing am- plified hallucination in multimodal reasoning models. Advances in Neural Information Processing Systems...

  6. [6]

    associated with

    Chart-rl: Generalized chart comprehension via reinforcement learning with verifiable rewards. arXiv preprint arXiv:2603.06958. Zhihao Zhu, Jiafeng Liang, Shixin Jiang, Jinlan Fu, Ming Liu, Guanglu Sun, See-Kiong Ng, and Bing Qin. 2026. Analyzing reasoning consistency in large multimodal models under cross-modal conflicts. arXiv preprint arXiv:2601.04073. ...