A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

Jing Li; Siqi Wang; Yi Zhao; Yushi Li; Zhe Hu

arxiv: 2605.31351 · v1 · pith:3CWRZGEEnew · submitted 2026-05-29 · 💻 cs.CL · cs.CV

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

Yi Zhao , Siqi Wang , Zhe Hu , Yushi Li , Jing Li This is my paper

Pith reviewed 2026-06-28 22:39 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords VLM-as-a-JudgeVisually Impaired AssistanceBenchmarkEvaluation FrameworkModel ReliabilityFailure TaxonomyVIA-Judge-Agent

0 comments

The pith

VLM judges for visually impaired assistance tasks prove unreliable, with the strongest reaching only 52.6 percent diagnostic accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VIABLE, the first benchmark with over 300K judgment samples across three VIA scenarios, to test whether VLMs can serve as reliable judges. It applies an Effectiveness-Impartiality-Stability framework and a 12-mode failure taxonomy to evaluate seven models of varying scales. Results show existing judges fall short on all axes, with GPT-5.4 at 52.6 percent single-failure accuracy yet 94.2 percent self-preference and open-source models displaying strong bias and adversarial fragility. The work also presents VIA-Judge-Agent, a model-agnostic harness that adds visual evidence extraction and taxonomy-guided steps to raise both diagnostic accuracy and user preference for the resulting assistance responses.

Core claim

Existing VLM-as-a-Judge systems cannot be trusted for VIA tasks because they fail to meet effectiveness, impartiality, and stability criteria under the 12-mode taxonomy; the benchmark quantifies this unreliability across model scales, and the proposed VIA-Judge-Agent supplies a practical inference-time correction that improves both judgment quality and downstream user preference.

What carries the argument

The Effectiveness--Impartiality--Stability framework paired with a 12-mode failure taxonomy that classifies judge errors, together with the VIA-Judge-Agent harness that augments any base model via visual evidence extraction and workflow guidance.

If this is right

Reliable VIA deployment requires either new judge training or inference-time harnesses such as VIA-Judge-Agent.
Open-source VLMs remain unsuitable for impartial VIA judgment without additional safeguards.
The 300K-sample benchmark enables reproducible comparison of future judges across the three VIA scenarios.
Improved judges produce assistance responses that blind and low-vision users prefer over unassisted outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reliability shortfalls may appear when the same judges are applied to other domain-specific assistance tasks outside general benchmarks.
Human raters will likely remain necessary for high-stakes VIA evaluation until judges clear the taxonomy thresholds.
The taxonomy could be tested for transfer to adjacent areas such as medical image interpretation or accessibility auditing.

Load-bearing premise

The Effectiveness--Impartiality--Stability framework together with the 12-mode failure taxonomy provides a valid and comprehensive basis for measuring judge reliability in VIA tasks.

What would settle it

A new VLM judge that scores above 80 percent single-failure diagnostic accuracy on the VIABLE test set while keeping self-preference below 30 percent would directly contradict the unreliability finding.

Figures

Figures reproduced from arXiv: 2605.31351 by Jing Li, Siqi Wang, Yi Zhao, Yushi Li, Zhe Hu.

**Figure 1.** Figure 1: Judgment tasks in VIABLE under the Effectiveness–Impartiality–Stability framework. CIDEr (Vedantam et al., 2015), which are inadequate for this human-centered setting with openended responses. Human evaluation is a more reliable alternative but costly to scale. It then motivates the emerging VLM-as-a-Judge paradigm (Pu et al., 2025; Laskar et al., 2025; Lee et al., 2024b), where VLMs score, rank, or di… view at source ↗

**Figure 2.** Figure 2: Overview of VIABLE. From three VIA corpora (WAD, VisAssist, and VIA-EgoDex), we construct candidate responses, build judgment tasks under the E-I-S framework, and evaluate judges along each axis. Benchmark Multimodal Domain Task Coverage Construction Scale Diagnosis Bias Robustness MT-Bench (Zheng et al., 2023) ✗ General Score/Pair ✗ ✓ ✗ Natural 3K JudgeBench (Tan et al., 2025) ✗ Logical Pair ✗ ✓ ✗ Natural… view at source ↗

**Figure 3.** Figure 3: Failure taxonomy and injection examples. (a) The 12-mode taxonomy; modes in red correspond to (b). (b) Top: a single-injection for C3 Internal Contradiction (“closes” → “opens again”). Bottom: a dual-injection for P4 Evidence Omission (dropping the seating area) and A1 Safety Violation (falsely asserting the path is safe). interaction (Yuan et al., 2025; Gao et al., 2026; Zhao et al., 2025), and (2) common… view at source ↗

**Figure 4.** Figure 4: Per-failure-type diagnostic accuracy (%). Single-injection average over corpora. Lighter is better; even API-based judges remain weak on P4, C3, and A3. we apply the same position-swap protocol to clean– adversarial pairs and report Radv, the rate of selecting the adversarial response in both orders. 4 Benchmarking VLM Judges 4.1 Experimental Setup We evaluate seven VLM judges across various scales and ac… view at source ↗

**Figure 5.** Figure 5: Overview of VIA-Judge-Agent. The agent diagnoses VIA failures in four stages: interaction check (Stage 1), visual evidence extraction (Stage 2), evidence-grounded perception/cognition/action verification (Stage 3), and confidence filtering and refinement (Stage 4), correctly recovering [P4, A1] in the example shown. Single (Acc. ↑) Dual (Full Acc.↑) Dual (Partial Acc. ↑) Method WAD VisAssist VIA-EgoDex WAD… view at source ↗

**Figure 6.** Figure 6: Human preference evaluation for feedback [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Representative examples of dual-injected sam [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Construction pipeline and representative example from VIA-EgoDex. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Per-failure-type diagnostic accuracy by corpus (%) Dim. Failure Type Description Perception P1 Entity / Attribute Error Misidentifies the existence, category, quantity, or physical state of objects. P2 Spatial Mapping Error Incorrectly describes positions, orientations, or relative spatial relationships. P3 OCR / Detail Miss Misreads visible text or overlooks fine-grained, critical visual cues. P4 Evidence… view at source ↗

**Figure 10.** Figure 10: Representative examples of single-injected samples in VIABLE. Each modified answer introduces one [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: https://github.com/YiyiyiZhao/VIABLE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces the first benchmark for VLM-as-a-Judge in visually impaired assistance along with an unreliability claim, but that claim rests on their new unvalidated E-I-S framework and taxonomy.

read the letter

This paper's main news is the creation of the VIABLE benchmark for VLM-as-a-Judge in visually impaired assistance tasks, plus the claim that current judges are largely unreliable, though that claim is measured through their own new framework.

The work introduces over 300K judgment samples across three scenarios, an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy, evaluations of seven judges from different scales, and the VIA-Judge-Agent as a model-agnostic harness that uses visual evidence extraction and taxonomy-guided workflow. It is good that they make the data and code available on GitHub. The paper does well by targeting a domain where automated evaluation could reduce costs for human review, and by providing concrete numbers on performance gaps like the 52.6% diagnostic accuracy for the strongest judge alongside high self-preference rates. This kind of structured look at failure modes in a specific application area can help move beyond general domain testing.

The soft spots are mainly around validation. The abstract reports the headline results but gives no details on data collection methods, how the taxonomy was built or tested for completeness, or any statistical approaches used. The stress-test concern about whether the framework exhaustively covers relevant failure modes for BLV users, avoids circularity in application, and correlates with actual user preferences is a fair one based on the available text. If those elements are not addressed in the full paper, the central unreliability finding would need more support. On the positive side, the construction does not appear to reduce to self-referential fitting.

This paper is for people interested in benchmarks for AI judges or in applications of VLMs to accessibility. A reader looking for new evaluation setups in this niche would get some value from the scale and the proposed agent, even while wanting more on the taxonomy's grounding.

It deserves a serious referee because the topic has practical importance and the resources are shared for inspection. Peer review could help sort out the framework's robustness.

Referee Report

2 major / 2 minor

Summary. The paper introduces VIABLE, the first benchmark for VLM-as-a-Judge evaluation in Visually Impaired Assistance (VIA) tasks, containing over 300K judgment samples across three scenarios. It defines an Effectiveness--Impartiality--Stability (E-I-S) framework paired with a 12-mode failure taxonomy to assess judge reliability. Evaluation of seven judges (including GPT-5.4 and open-source models) shows existing VLMs are largely unreliable: GPT-5.4 reaches only 52.6% single-failure diagnostic accuracy while showing 94.2% self-preference, and open-source judges exhibit strong bias and adversarial fragility. The authors also propose VIA-Judge-Agent, a model-agnostic inference-time harness using visual evidence extraction and taxonomy-guided workflow, which yields improvements in diagnostic accuracy and produces VIA responses preferred by BLV users. Data and code are released at the provided GitHub link.

Significance. If the E-I-S framework and taxonomy prove valid and the reported metrics hold under external validation, the work would establish a needed specialized benchmark for VIA judge evaluation, demonstrate systematic unreliability of current VLMs in this high-stakes domain, and supply a practical inference-time mitigation. The public release of the 300K-sample dataset and code is a clear strength that enables reproducibility and follow-on work in accessible AI.

major comments (2)

[Effectiveness--Impartiality--Stability framework and 12-mode failure taxonomy] The central unreliability claim (e.g., 52.6% diagnostic accuracy for GPT-5.4) is measured exclusively through the newly introduced E-I-S framework and 12-mode taxonomy. The manuscript provides no external validation—such as inter-rater reliability with BLV experts, coverage study against existing VIA error ontologies, or ablation showing correlation with downstream user preference—for the taxonomy or framework itself (see the section introducing the framework and taxonomy). Without this anchor, the measurement instrument remains the least-secured premise supporting the headline conclusion.
[VIA-Judge-Agent evaluation and user study] Table or section reporting the BLV user preference study for VIA-Judge-Agent outputs: the claim that the harness produces responses “more preferred by BLV users” is load-bearing for the practical contribution, yet no details are supplied on participant count, statistical test, or inter-user agreement. This weakens the assertion that the observed accuracy gains translate to real-world utility.

minor comments (2)

[Abstract] The abstract states headline numbers (52.6%, 94.2%) without citing the specific tables or figures that contain the full per-judge, per-scenario breakdowns; adding these cross-references would improve readability.
[Benchmark construction] Notation for the three scenarios and the exact definition of “single-failure diagnostic accuracy” should be introduced earlier and used consistently to avoid ambiguity when comparing judges.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Effectiveness--Impartiality--Stability framework and 12-mode failure taxonomy] The central unreliability claim (e.g., 52.6% diagnostic accuracy for GPT-5.4) is measured exclusively through the newly introduced E-I-S framework and 12-mode taxonomy. The manuscript provides no external validation—such as inter-rater reliability with BLV experts, coverage study against existing VIA error ontologies, or ablation showing correlation with downstream user preference—for the taxonomy or framework itself (see the section introducing the framework and taxonomy). Without this anchor, the measurement instrument remains the least-secured premise supporting the headline conclusion.

Authors: We agree that external validation would strengthen the E-I-S framework and 12-mode taxonomy. The taxonomy was developed via systematic analysis of failure modes in the three VIA scenarios, informed by prior accessibility literature, but the manuscript does not report inter-rater reliability with BLV experts, coverage against existing ontologies, or explicit correlation ablations with user preferences. We will revise the relevant section to provide a more detailed account of the taxonomy construction process and add an ablation correlating framework metrics with the reported user preference outcomes. A full inter-rater study with experts is noted as valuable future work. revision: partial
Referee: [VIA-Judge-Agent evaluation and user study] Table or section reporting the BLV user preference study for VIA-Judge-Agent outputs: the claim that the harness produces responses “more preferred by BLV users” is load-bearing for the practical contribution, yet no details are supplied on participant count, statistical test, or inter-user agreement. This weakens the assertion that the observed accuracy gains translate to real-world utility.

Authors: The referee is correct that the manuscript states VIA-Judge-Agent responses are more preferred by BLV users without supplying participant count, statistical tests, or inter-user agreement. We will revise the manuscript to include a dedicated subsection or table with these study details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new benchmark framework applied to external judge outputs

full rationale

The paper introduces the E-I-S framework and 12-mode taxonomy as the evaluation instrument for VIABLE, then applies it to measure judge performance (e.g., 52.6% diagnostic accuracy). No quoted derivation shows the taxonomy or framework being fitted from judge outputs, self-defined via the results, or justified solely by author self-citation. The unreliability conclusion follows from applying the independently stated criteria to the seven judges; the measurement instrument does not reduce to its own inputs by construction. This is the standard non-circular pattern for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces a new benchmark, framework, taxonomy, and agent; these constitute the primary additions beyond prior work, with the taxonomy and agent lacking independent external validation in the abstract.

axioms (1)

domain assumption The 12-mode failure taxonomy and Effectiveness--Impartiality--Stability framework capture the relevant failure modes for VIA judgments
Introduced as the core evaluation structure without prior citation or validation details in the abstract.

invented entities (2)

VIABLE benchmark no independent evidence
purpose: Provide 300K+ judgment samples across three scenarios for VLM judge evaluation in VIA
Newly constructed dataset and evaluation harness presented as the first of its kind.
VIA-Judge-Agent no independent evidence
purpose: Model-agnostic harness that augments judges via visual evidence extraction and taxonomy-guided workflow
Proposed inference-time system to address identified judge failures.

pith-pipeline@v0.9.1-grok · 5786 in / 1426 out tokens · 28074 ms · 2026-06-28T22:39:41.939146+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 5 canonical work pages · 3 internal anchors

[1]

A Survey on LLM-as-a-Judge

SVLTA: benchmarking vision-language tem- poral alignment via synthetic video situation. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2025, pages 13798–13809. Computer Vision Foundation / IEEE. Qi Gao, Heng Li, Yixin Zhou, Meixuan Zhou, Jieqiong Chen, and Xinyu Chai. 2026. VisAssist: A visually impaired-captured video question ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants

Perspective-aware reasoning in vision- language models via mental imagery simulation. In IEEE/CVF International Conference on Computer Vision, ICCV 2025, pages 9241–9251. IEEE. Seongyun Lee, Seungone Kim, Sue Hyun Park, Gee- wook Kim, and Minjoon Seo. 2024b. Prometheus- vision: Vision-language model as a judge for fine- grained evaluation. InFindings of t...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

InAnnual Conference on Neural Information Processing Systems NeurIPS

LLM evaluators recognize and favor their own generations. InAnnual Conference on Neural Information Processing Systems NeurIPS. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Compu- tational Linguistics, pa...

2002
[4]

SAM 2: Segment Anything in Images and Videos

Sam 2: Segment anything in images and videos.Preprint, arXiv:2408.00714. Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, and Tom Goldstein. 2025. AR- GUS: hallucination and omission evaluation in video- llms. InIEEE/CVF International Conference on 10 Computer Vision, ICCV 2025, pages 20280–20290. IEEE. Kayla Schroeder and Zach Wood-Doughty....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad

Egoblind: Towards egocentric visual assis- tance for the blind.Preprint, arXiv:2503.08221. Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. 2025. Multimodal rewardbench: Holistic evaluation of reward models for vision lan- guage models.Preprint, arXiv:2502.14191. Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian ...

work page arXiv 2025
[6]

Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, and Jing Shao

IEEE. Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, and Jing Shao. 2025. SPA-VL: A comprehensive safety preference alignment dataset for vision lan- guage models. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2025, pages 19867–19878...

2025
[7]

Less is More

"Less is More": Reducing cognitive load and task drift in real-time multimodal assistive agents for the visually impaired.Preprint, arXiv:2511.00945. Yi Zhao, Siqi Wang, and Jing Li. 2026. Laf-grpo: In- situ navigation instruction generation for the visually impaired via GRPO with llm-as-follower reward. In Fortieth AAAI Conference on Artificial Intellige...

work page arXiv 2026
[8]

Can you describe thesetting and overalllayout?

Judgelm: Fine-tuned large language models are scalable judges. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. 11 A VIA-Egodex Construction VIA-EgoDexis an egocentric manipulation corpus we construct from EgoDex, primarily targeting the Object Manipulationtask category. Unlike...

2025

[1] [1]

A Survey on LLM-as-a-Judge

SVLTA: benchmarking vision-language tem- poral alignment via synthetic video situation. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2025, pages 13798–13809. Computer Vision Foundation / IEEE. Qi Gao, Heng Li, Yixin Zhou, Meixuan Zhou, Jieqiong Chen, and Xinyu Chai. 2026. VisAssist: A visually impaired-captured video question ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants

Perspective-aware reasoning in vision- language models via mental imagery simulation. In IEEE/CVF International Conference on Computer Vision, ICCV 2025, pages 9241–9251. IEEE. Seongyun Lee, Seungone Kim, Sue Hyun Park, Gee- wook Kim, and Minjoon Seo. 2024b. Prometheus- vision: Vision-language model as a judge for fine- grained evaluation. InFindings of t...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

InAnnual Conference on Neural Information Processing Systems NeurIPS

LLM evaluators recognize and favor their own generations. InAnnual Conference on Neural Information Processing Systems NeurIPS. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Compu- tational Linguistics, pa...

2002

[4] [4]

SAM 2: Segment Anything in Images and Videos

Sam 2: Segment anything in images and videos.Preprint, arXiv:2408.00714. Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, and Tom Goldstein. 2025. AR- GUS: hallucination and omission evaluation in video- llms. InIEEE/CVF International Conference on 10 Computer Vision, ICCV 2025, pages 20280–20290. IEEE. Kayla Schroeder and Zach Wood-Doughty....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad

Egoblind: Towards egocentric visual assis- tance for the blind.Preprint, arXiv:2503.08221. Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. 2025. Multimodal rewardbench: Holistic evaluation of reward models for vision lan- guage models.Preprint, arXiv:2502.14191. Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian ...

work page arXiv 2025

[6] [6]

Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, and Jing Shao

IEEE. Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, and Jing Shao. 2025. SPA-VL: A comprehensive safety preference alignment dataset for vision lan- guage models. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2025, pages 19867–19878...

2025

[7] [7]

Less is More

"Less is More": Reducing cognitive load and task drift in real-time multimodal assistive agents for the visually impaired.Preprint, arXiv:2511.00945. Yi Zhao, Siqi Wang, and Jing Li. 2026. Laf-grpo: In- situ navigation instruction generation for the visually impaired via GRPO with llm-as-follower reward. In Fortieth AAAI Conference on Artificial Intellige...

work page arXiv 2026

[8] [8]

Can you describe thesetting and overalllayout?

Judgelm: Fine-tuned large language models are scalable judges. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. 11 A VIA-Egodex Construction VIA-EgoDexis an egocentric manipulation corpus we construct from EgoDex, primarily targeting the Object Manipulationtask category. Unlike...

2025