pith. sign in

arxiv: 2605.30698 · v1 · pith:23CI3LEOnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI· cs.MA

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

Pith reviewed 2026-06-28 23:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MA
keywords multi-agent VQAvisual evidence alignmentgrounded reasoningVLM consensusevidence consistencyvisual question answeringEAGLE framework
0
0 comments X

The pith

Answer-level agreement alone is insufficient for reliable multi-agent VQA; aligned visual evidence from shared image regions is required for trustworthy consensus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when multiple vision-language models collaborate on visual questions, reaching the same textual answer does not guarantee the agents are drawing from the same parts of the image. This visual mismatch leaves room for collective hallucinations even when answers align. The proposed EAGLE framework makes each agent's grounding regions explicit so the agents can verify one another's visual evidence and let consistency among those regions determine the final output. A reader would care because existing multi-agent VQA methods import text-only discussion protocols that skip this visual-alignment step, leaving the multimodal case under-served.

Core claim

The central claim is that answer-level agreement is insufficient for reliable multi-agent VQA and that aligned visual evidence—shared support from the image regions agents rely on—is essential for trustworthy consensus. EAGLE implements this by explicitly exposing each agent's grounding regions as visual evidence, enabling mutual verification over the evidence, and using evidence consistency to guide final decision-making, achieving best average performance across domains on six VQA benchmarks while remaining training-free.

What carries the argument

EAGLE (Evidence-Aligned Grounded multi-agent Reasoning), the training-free framework that exposes grounding regions for mutual verification and consistency-guided decision-making.

If this is right

  • EAGLE achieves the best average performance across domains on six VQA benchmarks.
  • The method remains training-free, lightweight, interpretable, and practical for deployment.
  • Focusing on visual evidence alignment rather than textual discussion alone mitigates individual hallucinations and blind spots more effectively than text-centric protocols.
  • Existing multi-agent VQA approaches that adapt text-only protocols are insufficient for the multimodal setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If evidence alignment is the key mechanism, similar verification of grounding regions could be added to single-agent VLM pipelines to reduce hallucinations without multi-agent overhead.
  • Gains may vary with the accuracy of region extraction, suggesting direct tests that swap different grounding modules while holding other components fixed.
  • The same evidence-consistency step could be applied to other multi-model multimodal tasks such as visual chain-of-thought or joint image-captioning systems.

Load-bearing premise

That mutual verification over exposed grounding regions can be effectively implemented in VLMs and that evidence consistency reliably guides better decision-making.

What would settle it

A controlled comparison in which multi-agent systems reach high answer agreement but show no accuracy gain when required to align on visual evidence regions would falsify the claim that aligned visual evidence is essential.

Figures

Figures reproduced from arXiv: 2605.30698 by Dongsheng Ma, Shaoxu Sun, Shuochen Chang, Wentao Zhang, Yalin Feng, Yikang Wang, Yinglong Yang, Yuanzi Li, Yufei Chen, Yuhan Wang, Zhengren Wang.

Figure 1
Figure 1. Figure 1: A case illustrating why answer-level agreement can be misleading. (A) Agents may accept the same textual rationale without verifying whether it is supported by the correct visual regions. (B) Explicit grounding makes the supporting evidence comparable, allowing agents to verify whether their agreement is visually aligned. only textual rationales without exposing the sup￾porting visual regions, limiting evi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EAGLE. The pipeline consists of five modules: (1) Evidence Routing: guides grounding based on question type; (2) Grounded Answer: agents generate initial answers with visual evidence, including grounding regions and visual claims explaining how the grounded regions support the answer; (3) Evidence Diagnosis: evaluates consistency of answers and visual evidence across agents; (4) Grounded Revisi… view at source ↗
Figure 3
Figure 3. Figure 3: Parameter ablations. (A) Effect of the max￾imum number of revision rounds T, showing that one grounded revision is sufficient for reliable consensus; (B) sensitivity to IoU threshold τiou, with 0.4 providing the best spatial alignment across agents. ing removes explicit grounding boxes and keeps only textual visual descriptions; w/o Arbitration re￾places evidence-guided arbitration with vote-based selectio… view at source ↗
read the original abstract

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that answer-level agreement is insufficient for reliable multi-agent VQA and that aligned visual evidence—shared support from the image regions agents rely on—is essential for trustworthy consensus. It proposes EAGLE, a training-free evidence-centered framework that exposes each agent's grounding regions, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making, reporting best average performance across domains on six VQA benchmarks.

Significance. If the results hold, the work would advance multi-agent VLM collaboration by shifting focus from textual agreement to visual evidence alignment. The training-free design is a clear strength, supporting lightweight and practical deployment without additional fine-tuning costs.

major comments (2)
  1. [Abstract] Abstract: The assertion that EAGLE 'achieves best average performance across domains' on six VQA benchmarks provides no information on baselines, statistical tests, error bars, dataset specifics, or controls for confounds, leaving the central empirical claim without verifiable support.
  2. [EAGLE framework description] EAGLE framework description: The core mechanisms for exposing grounding regions, performing mutual verification, and applying evidence consistency lack concrete details on extraction, comparison, and differentiation from prior grounding techniques, which is load-bearing for evaluating whether the approach reliably improves decision-making.
minor comments (1)
  1. [Abstract] The acronym expansion for EAGLE could be formatted more explicitly for immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, providing clarifications from the manuscript and indicating where revisions have been made to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that EAGLE 'achieves best average performance across domains' on six VQA benchmarks provides no information on baselines, statistical tests, error bars, dataset specifics, or controls for confounds, leaving the central empirical claim without verifiable support.

    Authors: We agree the abstract is concise by design and omits granular experimental details. The full manuscript (Section 4) specifies the six benchmarks (VQA v2, GQA, OK-VQA, A-OKVQA, TextVQA, VizWiz), lists all baselines (single-agent VLMs and prior multi-agent methods), reports per-dataset and average results with error bars, and includes controls for confounds such as agent count and prompting variations. Statistical comparisons are provided via paired t-tests in the supplementary material. To better support the claim in the abstract, we have revised it to name the benchmark domains and note the consistent outperformance, while directing readers to the experiments for full details. revision: yes

  2. Referee: [EAGLE framework description] EAGLE framework description: The core mechanisms for exposing grounding regions, performing mutual verification, and applying evidence consistency lack concrete details on extraction, comparison, and differentiation from prior grounding techniques, which is load-bearing for evaluating whether the approach reliably improves decision-making.

    Authors: Section 3 of the manuscript details these components: grounding regions are extracted via each agent's output of bounding boxes aligned to reasoning tokens (using the VLM's native localization capability); mutual verification computes region overlap via IoU thresholds and semantic consistency via CLIP embeddings; evidence consistency then weights the final answer by the fraction of agents sharing supporting regions above a threshold. Differentiation from prior single-agent grounding work (e.g., attention visualization or box prediction methods) is that EAGLE uses the shared evidence for cross-agent consensus rather than individual accuracy. We have expanded this section in revision with pseudocode, explicit extraction steps, and a new comparison table against prior techniques to make the mechanisms fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a conceptual insight (answer-level agreement insufficient; aligned visual evidence essential) and proposes the training-free EAGLE framework that exposes grounding regions for mutual verification. No equations, parameter fittings, self-definitional reductions, or load-bearing self-citations appear in the abstract or described method. The central claim is an observation used to motivate the framework rather than a derived result that collapses to its own inputs by construction. Experiments on external VQA benchmarks provide independent evaluation, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that VLMs can expose usable grounding regions and that consistency among them improves consensus; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Vision-language models can expose grounding regions as visual evidence that can be compared across agents
    The framework depends on this capability to enable mutual verification and evidence consistency checks.

pith-pipeline@v0.9.1-grok · 5786 in / 1253 out tokens · 31478 ms · 2026-06-28T23:23:11.783052+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others

  2. [2]

    Qwen3-VL Technical Report

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Maciej Besta, Nils Blach, Ales Kubicek, Robert Gersten- berger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Pi- otr Nyczyk, and 1 others. 2024. Graph of thoughts: Solving elaborate problems with large language mod- els. InProceedings of the AAAI conferen...

  3. [3]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 19098–19107

    Grounding answers for visual questions asked by visually impaired people. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 19098–19107. Justin Chen, Swarnadeep Saha, and Mohit Bansal. 2024b. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. InPro- ceedings of the 62nd Annual ...

  4. [4]

    Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

    Global context or local detail? adaptive vi- sual grounding for hallucination mitigation.arXiv preprint arXiv:2604.24396. Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. 2025. V oting or consensus? decision-making in multi-agent debate. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11640–11...

  5. [5]

    InProceedings of the Computer Vision and Pattern Recognition Conference, pages 191–201

    Multimodal rationales for explainable visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 191–201. Qing Li, Qingyi Tao, Shafiq Joty, Jianfei Cai, and Jiebo Luo. 2018. Vqa-e: Explaining, elaborating, and en- hancing your answers for visual questions. InPro- ceedings of the European Conference on Compute...

  6. [6]

    Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

    Improving automatic vqa evaluation using large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4171–4179. Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, Quy Duong Dang, Satwik Ramchandre, Son Lam Phung, Zhibin Liao, and 1 others. 2025. Localizing before answering: A benchmark for...

  7. [7]

    Kimi-VL Technical Report

    Kimi-vl technical report.arXiv preprint arXiv:2504.07491. Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes wide shut? exploring the visual shortcomings of multi- modal llms. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9568–9578. Khanh-Tung Tran, Dung Dao, Minh-Duong ...

  8. [8]

    What color is the car?

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Shixin Yi and Lin Shang. 2025. Corgi: Verified chain- of-thought reasoning with visual grounding.arXiv e-prints, pages arXiv–2508. Kepu Zhang, Weijie Yu, Sunhao Dai, and Jun Xu. 2025. Citalaw: Enhancing llm with citati...

  9. [9]

    Inspect the image independently

  10. [10]

    Follow the evidence-grounding instruction

  11. [11]

    Answer the question with concise evidence-grounded reasoning

  12. [12]

    Provide one atomic visual claim that directly supports your answer

  13. [13]

    reasoning

    Provide the grounding boxes for the image region(s) that support this visual claim. Output JSON schema: { "reasoning": "brief evidence-grounded reasoning", "visual_claim": "one atomic visual finding explaining how the grounded regions support the answer", "grounding_boxes": [ {"label": "object or region name", "box": [x1, y1, x2, y2]} ], "answer": "short ...

  14. [14]

    Keep the reasoning concise and tied to visible evidence in the image

  15. [15]

    The visual_claim must be a single atomic visual finding that directly supports the answer

  16. [16]

    The grounding_boxes must localize the region(s) that support the visual_claim

  17. [17]

    Use tight boxes around the relevant visual evidence whenever possible

  18. [18]

    Do not ground irrelevant objects, background regions, or the whole image unless the evidence focus requires global scene evidence

  19. [19]

    If no specific local region is decisive, return grounding_boxes: []

  20. [20]

    claim_aligned

    Return raw JSON only. Grounding boxes and coordinate normalization. The grounding_boxes field localizes the image region(s) that support the visual claim. Since dif- ferent VLMs may emit boxes under different co- ordinate conventions, we normalize all predicted boxes into the original image pixel coordinate sys- tem before evidence diagnosis. This shared ...

  21. [21]

    Re-read the original image independently

  22. [22]

    Use self and peer hypotheses as visual references, not as authority

  23. [23]

    Keep your previous answer if the image still supports it

  24. [24]

    Revise only if a newly verified visual observation better supports another answer

  25. [25]

    answer":

    Keep the reasoning concise and grounded in visible evidence. Output JSON: { "answer": "short final answer", "reasoning": "brief image-grounded reasoning", "grounding_boxes": [ {"label": "object name", "box": [x1, y1, x2, y2]} ], "visual_claim": "one atomic visual finding that directly supports the answer" } B.6 Evidence-Guided Arbitration When no evidence...

  26. [26]

    Use explicit multi-step reasoning grounded in the image and question

  27. [27]

    Keep the reasoning focused and concrete rather than verbose

  28. [28]

    Self-Consistency(Wang et al., 2022)

    Return raw JSON only. Self-Consistency(Wang et al., 2022). Self- Consistency samples multiple reasoning paths from a single model and aggregates their final answers by voting. For each question, we query the same backbone multiple times with the Zero-shot CoT prompt above. Each response contains its own rea- soning path and final answer. We then discard t...

  29. [29]

    Extract the final answer from each sampled response

  30. [30]

    critique

    Select the final answer by majority voting. Self-Refine(Madaan et al., 2023). Self-Refine iteratively improves a model’s own answer using self-generated feedback. For each sample, the model first generates an initial answer with the Zero-shot CoT prompt. It then critiques its own reasoning and answer, and finally produces a re- fined response conditioned ...

  31. [31]

    Focus the critique on the most important possible error in the reasoning or answer

  32. [32]

    If the current answer is still supported by the image, keep it unchanged

  33. [33]

    Revise the answer only when the image provides evidence for the change

  34. [34]

    Do not introduce information that is not visible in the image

  35. [35]

    reasoning

    Return raw JSON only. Multi-Agent Debate(Du et al., 2024; Liang et al., 2024). Multi-Agent Debate lets multiple agents exchange their answers and textual ratio- nales over multiple rounds. In the first round, each agent independently answers the question using the Zero-shot CoT prompt. In later rounds, each agent observes the other agents’ previous answer...

  36. [36]

    Consider peers, but do not follow them blindly

  37. [37]

    Explain step by step how the peer evidence changes or confirms your view

  38. [38]

    Keep the reasoning concrete and tied to the image question

  39. [39]

    Debate Judge Prompt You are given an image question and the full state of a multi-round debate among several vision-language agents

    Return raw JSON only. Debate Judge Prompt You are given an image question and the full state of a multi-round debate among several vision-language agents. Question: {question} Debate states: {debate_text} Task:

  40. [40]

    Read the image yourself

  41. [41]

    Use the debate states only as auxiliary evidence

  42. [42]

    Identify all candidate answers that appeared in the debate states

  43. [43]

    reasoning

    Select the single best final answer from these candidate answers only. Output schema: { "reasoning": "brief image-grounded adjudication that explains why the selected candidate is best", "answer": "one candidate answer copied from the debate states" } Rules:

  44. [44]

    The image is the source of truth; do not blindly follow the debaters

  45. [45]

    You must choose one answer that already appears in the debate states

  46. [46]

    If multiple candidates are plausible, choose the one best supported by the image

  47. [47]

    reasoning

    Return raw JSON only. ReConcile(Chen et al., 2024b). ReConcile is a confidence-driven multi-agent discussion frame- work. Each agent first provides an answer with a confidence score. Then, agents review grouped peer answers, justifications, and confidences before updating their predictions. After the final discus- sion round, we group semantically equival...

  48. [48]

    Base your answer on the image and question

  49. [50]

    Keep the reasoning focused and concrete

  50. [51]

    reasoning

    Return raw JSON only. [Reconcile] 21 You are in a round-table conference with other agents. Review grouped peer answers, justifications, and confidences, then update your answer and confidence. Question: {question} Previous response: {previous_text} Grouped peer views: {peer_json} Output JSON: { "reasoning": "brief evidence-grounded reasoning after review...

  51. [52]

    Review each answer group and compare the supporting justifications

  52. [53]

    Keep your answer if it remains best supported by the image

  53. [54]

    Change your answer only if another group provides more convincing visual evidence

  54. [55]

    Confidence must reflect your final belief after reviewing all groups

  55. [56]

    selected_tools

    Return raw JSON only. [Final confidence-aware aggregation] After the last discussion round, group semantically equivalent final answers. For each answer group y, compute its aggregation score as the sum of confidences from agents supporting y: score(y) = sum(confidence_i for agents whose final answer is y) Select the answer group with the highest score as...

  56. [57]

    Select only tools that are useful for resolving the disagreement

  57. [58]

    grounding

    Select "grounding" when agents disagree about where the relevant evidence is located

  58. [59]

    object_detection

    Select "object_detection" when agents disagree about the presence or identity of objects

  59. [60]

    Select "ocr" when the question depends on visible text, letters, numbers, labels, or symbols

  60. [61]

    spatial_reasoning

    Select "spatial_reasoning" when agents disagree about relative positions, directions, distances, or spatial configurations

  61. [62]

    captioning

    Select "captioning" when global scene context may resolve the disagreement

  62. [63]

    attribute_detection

    Select "attribute_detection" when agents disagree about visual attributes such as color, shape, material, state, or markings

  63. [64]

    reasoning

    Select "reasoning" when the disagreement requires additional visual reasoning beyond direct perception

  64. [65]

    tool_name

    Return raw JSON only. [Expert tool execution] Each selected tool is executed with its corresponding query. Tool implementations: - grounding: GroundingDINO. - object_detection: YOLOv11. - spatial_reasoning: SpaceLLaVA. - ocr: OCR-Qwen. - captioning / attribute_detection / reasoning: InternVL-2.5 MPO. Tool output format: { "tool_name": "tool_name", "query"...

  65. [66]

    Score each agent between 0 and 1

  66. [67]

    A high score means the agent's answer and reasoning are supported by the tool outputs

  67. [68]

    A low score means the agent's answer conflicts with or is unsupported by the tool outputs

  68. [69]

    Use the tool outputs as auxiliary evidence, not as the only criterion

  69. [70]

    reasoning

    Return raw JSON only. [Tool-assisted discussion] You are in a tool-assisted multi-agent discussion. Review the grouped agent solutions, tool outputs, and tool-agreement scores, then update your answer. Question: {question} Previous response: {previous_text} Grouped agent solutions: {grouped_json} Tool outputs: {tool_json} Agreement scores: {score_json} Ou...

  70. [71]

    Prefer answers supported by reliable tool outputs, while keeping the original image question central

  71. [72]

    Use agreement scores as auxiliary evidence, not as the only criterion

  72. [73]

    Keep your answer if it remains best supported by the image and tool evidence

  73. [74]

    Change your answer only when another candidate is better supported by visual evidence

  74. [75]

    Confidence must be a number between 0 and 1

  75. [76]

    reasoning

    Return raw JSON only. [Final aggregator] Choose the best final answer after reviewing post-discussion agent solutions, tool outputs, and tool-agreement scores. Question: {question} Post-discussion solutions: {discussion_json} Tool outputs: {tools_json} Agreement scores: {scores_json} Candidate answers: {candidate_answers} Output JSON: { "reasoning": "brie...

  76. [77]

    Select exactly one answer from Candidate answers

  77. [78]

    Do not invent a new answer or output an answer not proposed by any agent

  78. [79]

    Prefer answers supported by reliable tool outputs

  79. [80]

    Use vote counts, confidence scores, and tool-agreement scores together

  80. [81]

    Do not rely on tool scores alone if image-grounded reasoning contradicts them

Showing first 80 references.