TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

Amanda Hughes; Kaixiang Zhao; Porter Jenkins; Shawn Huang; Tianrun Yu; Yushun Dong

arxiv: 2606.00232 · v1 · pith:Y3QSIMF3new · submitted 2026-05-29 · 💻 cs.AI · cs.LG

TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

Kaixiang Zhao , Tianrun Yu , Shawn Huang , Porter Jenkins , Yushun Dong , Amanda Hughes This is my paper

Pith reviewed 2026-06-28 21:58 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords hallucination mitigationmultimodal generationinference-time repairgraph-based evidencefact-level repairclaim graphobservation graph

0 comments

The pith

TIGER mitigates hallucinations by extracting separate graphs from input and output to score and repair unsupported claims at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TIGER as an inference-time method that fixes specific unsupported facts in multimodal outputs while leaving the base model unchanged. It builds an observation graph from the input and a claim graph from the output, then computes a risk score for each claim using detected support and conflict edges. High-risk claims are repaired selectively. A convergence analysis shows expected total risk falls geometrically toward a fixed bound. Tests on image-to-text, audio-to-text, and video-to-text paths confirm fewer unsupported facts with maintained task performance.

Core claim

TIGER independently extracts an observation graph from the input and a claim graph from the current output, assigns each claim a graph-conditioned risk score based on support and conflict, and repairs selected high-risk claims while keeping the backbone frozen. A convergence analysis shows the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths show TIGER reduces unsupported content while preserving task quality, with gains holding across multiple backbones and a CrisisFACTS case study indicating improved grounding in multi-source settings.

What carries the argument

Observation graph from the input and claim graph from the output, used to compute support and conflict relations for per-claim risk scores.

If this is right

Reduces unsupported content across image-to-text, image-plus-text-to-text, audio-to-text, and video-to-text tasks.
Preserves task quality while operating on multiple frozen backbones.
The same repair mechanism improves grounding when evidence comes from multiple sources, as shown in the CrisisFACTS case study.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graph-based routing could be applied to text-only generation where factuality matters.
Tracing risk scores back to specific edges might make model outputs more interpretable for debugging.
Adding temporal or causal relations to the graphs could address hallucinations in sequential or narrative generation.

Load-bearing premise

The graph extraction step from input and output accurately captures support and conflict relations without introducing systematic bias.

What would settle it

A controlled test comparing TIGER's extracted graphs against human-annotated support relations on the same inputs and outputs, checking whether hallucination reduction disappears when the automated graphs mismatch the annotations.

Figures

Figures reproduced from arXiv: 2606.00232 by Amanda Hughes, Kaixiang Zhao, Porter Jenkins, Shawn Huang, Tianrun Yu, Yushun Dong.

**Figure 1.** Figure 1: Overview of TIGER. TIGER first generates an initial output, extracts fact graphs from the input and output, ranks claims by risk, and locally repairs selected high-risk facts while keeping the backbone frozen. Can we redesign the feedback mechanism in iterative multimodal repair so that it reduces spurious correlation during feedback generation and supports fact-level scheduling? We answer this question wi… view at source ↗

**Figure 2.** Figure 2: Free-form generation vs. atomic enumeration. Top: a qualitative example where free-form generation adds an unsupported object, while atomic enumeration avoids it. Bottom: co-occurrence hallucination rate (CHR) across cue-to-absent object pairs. every pair, and atomic enumeration reduces CHR by about 2.6×. Prior work reports similar hallucination driven by co-occurrence on other backbones (Datta and Su… view at source ↗

**Figure 3.** Figure 3: Component ablation on COCO. ing and sets Ft = Ψα(GX, GYt ) as in Eq. (5). Thus, L0→L1 tests iterative repair alone, L1→L2 isolates atomic projection, and L2→L3 isolates deterministic risk ranking [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Hyperparameter sensitivity of TIGER on COCO. Each curve varies one hyperparameter while keeping the other two fixed. The vertical axis reports CHAIRs; lower is better. each sweep, we vary one hyperparameter and keep the other two fixed at the default setting [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Mechanism analysis on COCO val2014. (a) Joint feedback channels mention the absent object [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Case-study pipeline on the Hurricane event. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: reports wall-clock per sample on COCO val2014 for all image-path methods under the same Qwen2.5-Omni-7B backbone on a single GPU. TIGER at T = 5 takes 199 s per sample, about 5× Frozen (38 s); the cost can be lowered by reducing T (the sensitivity curve in [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model's interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level. We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TIGER's graph separation for localized risk scoring on multimodal claims is a clean inference-time idea, but the abstract leaves the validation and independence claims too thin to judge yet.

read the letter

The main thing to know is that TIGER pulls an observation graph from the input and a claim graph from the output, then scores each claim on support or conflict to pick which ones to repair while the backbone stays frozen. This avoids the joint conditioning that lets output hallucinations color the input reading, and it makes repairs fact-level and schedulable.

The design is new in that explicit separation plus the graph-conditioned risk step. The paper applies it to four cross-modal settings and adds a convergence argument that expected risk drops geometrically to a bound under mild assumptions. The CrisisFACTS case study is a reasonable applied test.

The experiments are described only at the level of “reduces unsupported content while preserving task quality” across backbones, with no numbers, baselines, or error bars visible in the abstract. That makes it hard to tell how large the gains are or whether they survive different metrics.

The stress-test point about bias in the shared backbone is worth checking in the full text. If the same model extracts both graphs, its interpretive habits could leak into the support/conflict relations, so the risk scores and the measured reductions might not be fully independent. The convergence claim rests on those scores being faithful, so any circularity there would weaken the central result.

This paper is for groups working on inference-time reliability fixes for multimodal generation, especially in settings that need traceable grounding. Readers who already use graph methods for evidence tracking will see the most direct value.

It deserves a serious referee. The framing is coherent and the problem is real; the details on the math, the extraction pipeline, and the external validation need to be examined, but the work is structured enough to merit that step.

Referee Report

2 major / 2 minor

Summary. The paper introduces TIGER, an inference-time framework for fact-level repair in multimodal generation to mitigate hallucinations. It extracts an observation graph from the input and a claim graph from the output using a frozen backbone model, computes graph-conditioned risk scores based on support and conflict relations, and performs localized repair on high-risk claims. A convergence analysis is provided showing that expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments on four cross-modal tasks (image-to-text, image+text-to-text, audio-to-text, video-to-text) across multiple backbones report reduced unsupported content while preserving task quality, with an additional CrisisFACTS case study.

Significance. If the central claims hold, TIGER provides a traceable, localized alternative to joint-conditioning repair methods with theoretical convergence guarantees and demonstrated applicability across modalities without retraining. The explicit asymptotic bound and graph-based routing for interpretability are notable strengths. The cross-modal experimental design and multi-backbone consistency add to potential impact in grounded multimodal generation.

major comments (2)

[§4] §4 (Convergence Analysis): The geometric convergence of expected total risk to an explicit asymptotic bound is derived under mild assumptions on the risk scores, but the analysis does not establish independence of the graph-conditioned risk scores from the generator. Since both the observation graph and claim graph are extracted by the same frozen backbone that produced the output, any systematic interpretive bias in the backbone can propagate into the support/conflict relations and risk scores, making the measured risk reduction non-independent of the generator itself. This directly affects whether the bound represents external hallucination reduction.
[§5] §5 (Experiments): The reported reductions in unsupported content across the four cross-modal paths are measured using the same graph-extraction pipeline that defines the risk scores. Without an external validator (e.g., human annotation or an independent model) for the unsupported-content metric, it is unclear whether the gains reflect true grounding improvement or merely consistency with the backbone's own extraction biases. This is load-bearing for the empirical claim that TIGER reduces unsupported content while preserving task quality.

minor comments (2)

[Abstract, §5] Abstract and §5: Dataset sizes, number of examples per path, and error bars on the unsupported-content and task-quality metrics are not stated; these details are needed to assess the scale and statistical reliability of the reported gains.
[§3] §3 (Method): The precise definition of the graph-conditioned risk score (how support and conflict edges are aggregated into a scalar) should be given as an equation rather than prose to allow direct inspection of the convergence assumptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the convergence analysis and experimental design. We respond point by point to the major comments below.

read point-by-point responses

Referee: [§4] §4 (Convergence Analysis): The geometric convergence of expected total risk to an explicit asymptotic bound is derived under mild assumptions on the risk scores, but the analysis does not establish independence of the graph-conditioned risk scores from the generator. Since both the observation graph and claim graph are extracted by the same frozen backbone that produced the output, any systematic interpretive bias in the backbone can propagate into the support/conflict relations and risk scores, making the measured risk reduction non-independent of the generator itself. This directly affects whether the bound represents external hallucination reduction.

Authors: The convergence analysis establishes geometric decrease of expected total risk to the stated asymptotic bound under the given assumptions on the risk scores; it does not claim statistical independence between the risk scores and the backbone. Because the observation and claim graphs are extracted separately from input and output, the risk computation remains well-defined for the repair process. The bound therefore applies to reduction in the model's own risk measure. We will add an explicit clarification in §4 that the theoretical guarantee is internal to the defined risk function. revision: partial
Referee: [§5] §5 (Experiments): The reported reductions in unsupported content across the four cross-modal paths are measured using the same graph-extraction pipeline that defines the risk scores. Without an external validator (e.g., human annotation or an independent model) for the unsupported-content metric, it is unclear whether the gains reflect true grounding improvement or merely consistency with the backbone's own extraction biases. This is load-bearing for the empirical claim that TIGER reduces unsupported content while preserving task quality.

Authors: The unsupported-content metric is computed from the same graph pipeline to ensure direct correspondence with the risk scores that TIGER optimizes. This alignment allows the experiments to measure the precise effect of localized repair. Results remain consistent across four modalities and multiple backbones, and task-quality metrics are preserved, which would be unlikely under pure extraction bias. We will insert a limitations paragraph in §5 acknowledging the internal nature of the metric and the value of future external validation. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain; analysis self-contained

full rationale

The provided abstract and description outline TIGER's graph extraction, risk scoring, and convergence analysis under mild assumptions, but contain no equations, self-citations, or fitted parameters that reduce the claimed geometric decrease or asymptotic bound to the inputs by construction. No self-definitional steps, fitted predictions renamed as results, or load-bearing self-citations are quoted. The framework's independence from the frozen backbone is asserted without circular reduction in the text. This is the standard non-finding for papers whose central claims remain externally falsifiable via the described experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The abstract introduces two new graph constructs and a risk-scoring procedure as modeling choices; the convergence result rests on unspecified mild assumptions. No numerical free parameters are mentioned.

axioms (1)

domain assumption Mild assumptions under which expected total risk decreases geometrically to an explicit asymptotic bound
Invoked to support the convergence analysis stated in the abstract

invented entities (3)

Observation graph no independent evidence
purpose: Independent extraction of evidence from the input
Core modeling construct of TIGER; no external validation supplied in abstract
Claim graph no independent evidence
purpose: Extraction of individual facts from the current output
Core modeling construct of TIGER; no external validation supplied in abstract
Graph-conditioned risk score no independent evidence
purpose: Quantifies support and conflict to rank claims for repair
New scoring mechanism introduced by the framework; no external validation supplied in abstract

pith-pipeline@v0.9.1-grok · 5755 in / 1432 out tokens · 30146 ms · 2026-06-28T21:58:07.734563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Cody Buntain, Amanda Lee Hughes, Richard Mc- Creadie, Benjamin D Horne, Muhammad Imran, and Hemant Purohit. 2023. Crisisfacts 2023-overview paper. InTREC. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325. Zuyao Chen, Jinlin Wu...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882

Mitigating object hallucinations in large vision- language models through visual contrastive decod- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating object hallucination in large vision-language models. In Pro...

work page arXiv 2023
[4]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Aligning large multimodal models with factu- ally augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110. Junyoung Sung, Minjun Kim, Sumin An, Seungwoo Lyu, Arsha Nagrani, and Paul Hongsuck Seo. 2025. Getting to the crux: Graph-based data generation for advancing multi-hop cross-modal reasoning. Chameleon ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

This de- sign is simple, but the current response can affect how the model interprets the input, so hallucinated claims may be reinforced rather than removed

revise an initial output by conditioning on both the input and the current response. This de- sign is simple, but the current response can affect how the model interprets the input, so hallucinated claims may be reinforced rather than removed. The feedback is also written in natural language, which makes it difficult to rank facts or enforce a repair budg...

2024
[6]

Output: a JSON list of (entity, attribute) pairs

Key concept extraction.A prompt extracts the list of object-level concepts mentioned in Y0 that need verification. Output: a JSON list of (entity, attribute) pairs
[7]

Does the image contain a {concept}?

Question formulation.For each extracted concept, a prompt formulates a yes/no verifi- cation question of the form “Does the image contain a {concept}?” or “Is the {entity} {at- tribute}?”
[8]

visual evidence

Visual knowledge validation.For each verifi- cation question, Grounding DINO is invoked with a text query corresponding to the concept (confidence threshold 0.35, top-5 detections per query). The detected bounding boxes and labels constitute the “visual evidence” for that question
[9]

GroundingDINO finds 2 boxes la- belled {label} with scores {...}

Visual claim generation.The detected ev- idence is formatted into a structured claim list, e.g. “GroundingDINO finds 2 boxes la- belled {label} with scores {...}”. This list becomes the explicit feedback F for the cor- rection stage
[10]

Per sample Woodpecker calls Φ four times (Pgen, key-concept extraction, question formulation, hal- lucination correction) plus one Grounding DINO call per extracted concept

Hallucination correction.The backbone Φ is called once more with the image, the original response, and the visual claim list to produce the corrected responseY T . Per sample Woodpecker calls Φ four times (Pgen, key-concept extraction, question formulation, hal- lucination correction) plus one Grounding DINO call per extracted concept. The five-stage pipe...

2025
[11]

Initial generation.Produce Y0 = Φ(Pgen,X) as usual, where X is the original imageI orig
[12]

efficient

Generative feedback via diffusion.Take the caption-like response Y0 as a prompt and syn- thesise an auxiliary image Igen = SD(Y 0) using Stable Diffusion. We use SD-Turbo (stabilityai/sd-turbo) with 1 denoising step and the default scheduler, matching the origi- nal paper’s “efficient” configuration
[13]

The corrected output YT is decoded greedily from the con- trastive logits ˜s(k) = (1 +α)s (k) orig −α s (k) gen, α= 0.5, 13 matching DeGF and the sameα as the original paper

Contrastive decoding.Run two forward passes of the backbone with the same text prompt: one conditioned on Iorig producing logits s(k) orig at decoding step k, the other condi- tioned on Igen producing s(k) gen. The corrected output YT is decoded greedily from the con- trastive logits ˜s(k) = (1 +α)s (k) orig −α s (k) gen, α= 0.5, 13 matching DeGF and the ...
[15]

FIELD ROLES: subject: head entity, written as a bare noun phrase WITHOUT attributes

(subject, predicate, object) . . . FIELD ROLES: subject: head entity, written as a bare noun phrase WITHOUT attributes. GOOD: (car, is, red). BAD: (red car, is, parked). predicate: the relation. Use the standardized forms be- low when applicable. object: tail entity, attribute value, or count. STANDARDIZED PREDICATES: Attributes (color, size, material, sh...
[16]

(dog, exists in, image)
[18]

(man, jogging on, path)
[19]

(man, wearing, red shirt)
[20]

(sky, is, blue) Example — given an audio clip of a busy street:
[21]

(cars, honking, loudly)
[22]

(people, talking, nearby)
[23]

(engine, running, idle)
[24]

(music, playing from, shop) Example — given text ‘The president announced a new policy on Tuesday’:
[25]

(president, announced, new policy)
[26]

One fact per triple

(announcement, happened on, Tuesday) Rules: Extract every object, attribute, spatial relation, action, and count you can verify. One fact per triple. Do NOT combine multiple claims into one line. Always put the bare entity in subject; never bundle ad- jectives into subject. For spatial predicates, subject is the figure positioned relative to object. Aim f...
[27]

(subject, predicate, object)
[28]

FIELD ROLES: subject: head entity, written as a bare noun WITH- OUT adjectives

(subject, predicate, object) . . . FIELD ROLES: subject: head entity, written as a bare noun WITH- OUT adjectives. GOOD: (car, is, red). BAD: (red car, is, parked). predicate: the relation. Prefer the standardized forms below. object: tail entity, attribute value, or count. STANDARDIZED PREDICATES: Attributes:is. e.g., (car, is, red). Counts:count. e.g., ...
[29]

(car, parked near, building)
[30]

(building, is, tall)
[31]

(day, is, sunny) Example 2 — input: ‘The fire hydrant cap is yellow.’
[32]

(fire hydrant cap, is, yellow) Example 3 — input: ‘There are three traffic lights in the image.’
[33]

(traffic lights, exists in, image)
[34]

(traffic lights, count, 3) Example 4 — input: ‘A man in a white shirt is cooking in the kitchen while holding a knife.’
[35]

(man, exists in, image)
[36]

(man, wearing, white shirt)
[37]

(man, cooking in, kitchen)
[38]

(man, holding, knife) Example 5 — input: ‘The dog is on top of the car.’
[39]

{original_text}

(dog, on, car) Rules: Extract every claim, even from a short single sentence. Keep the subject a bare noun; never bundle attributes into subject. Each claim = one triple. Do NOT combine multiple facts. For spatial predicates, subject is the figure positioned relative to object. Output ONLY the numbered triple list. No prose. Now extract from this text: {<...

2024

[1] [1]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Cody Buntain, Amanda Lee Hughes, Richard Mc- Creadie, Benjamin D Horne, Muhammad Imran, and Hemant Purohit. 2023. Crisisfacts 2023-overview paper. InTREC. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325. Zuyao Chen, Jinlin Wu...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882

Mitigating object hallucinations in large vision- language models through visual contrastive decod- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating object hallucination in large vision-language models. In Pro...

work page arXiv 2023

[4] [4]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Aligning large multimodal models with factu- ally augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110. Junyoung Sung, Minjun Kim, Sumin An, Seungwoo Lyu, Arsha Nagrani, and Paul Hongsuck Seo. 2025. Getting to the crux: Graph-based data generation for advancing multi-hop cross-modal reasoning. Chameleon ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

This de- sign is simple, but the current response can affect how the model interprets the input, so hallucinated claims may be reinforced rather than removed

revise an initial output by conditioning on both the input and the current response. This de- sign is simple, but the current response can affect how the model interprets the input, so hallucinated claims may be reinforced rather than removed. The feedback is also written in natural language, which makes it difficult to rank facts or enforce a repair budg...

2024

[6] [6]

Output: a JSON list of (entity, attribute) pairs

Key concept extraction.A prompt extracts the list of object-level concepts mentioned in Y0 that need verification. Output: a JSON list of (entity, attribute) pairs

[7] [7]

Does the image contain a {concept}?

Question formulation.For each extracted concept, a prompt formulates a yes/no verifi- cation question of the form “Does the image contain a {concept}?” or “Is the {entity} {at- tribute}?”

[8] [8]

visual evidence

Visual knowledge validation.For each verifi- cation question, Grounding DINO is invoked with a text query corresponding to the concept (confidence threshold 0.35, top-5 detections per query). The detected bounding boxes and labels constitute the “visual evidence” for that question

[9] [9]

GroundingDINO finds 2 boxes la- belled {label} with scores {...}

Visual claim generation.The detected ev- idence is formatted into a structured claim list, e.g. “GroundingDINO finds 2 boxes la- belled {label} with scores {...}”. This list becomes the explicit feedback F for the cor- rection stage

[10] [10]

Per sample Woodpecker calls Φ four times (Pgen, key-concept extraction, question formulation, hal- lucination correction) plus one Grounding DINO call per extracted concept

Hallucination correction.The backbone Φ is called once more with the image, the original response, and the visual claim list to produce the corrected responseY T . Per sample Woodpecker calls Φ four times (Pgen, key-concept extraction, question formulation, hal- lucination correction) plus one Grounding DINO call per extracted concept. The five-stage pipe...

2025

[11] [11]

Initial generation.Produce Y0 = Φ(Pgen,X) as usual, where X is the original imageI orig

[12] [12]

efficient

Generative feedback via diffusion.Take the caption-like response Y0 as a prompt and syn- thesise an auxiliary image Igen = SD(Y 0) using Stable Diffusion. We use SD-Turbo (stabilityai/sd-turbo) with 1 denoising step and the default scheduler, matching the origi- nal paper’s “efficient” configuration

[13] [13]

The corrected output YT is decoded greedily from the con- trastive logits ˜s(k) = (1 +α)s (k) orig −α s (k) gen, α= 0.5, 13 matching DeGF and the sameα as the original paper

Contrastive decoding.Run two forward passes of the backbone with the same text prompt: one conditioned on Iorig producing logits s(k) orig at decoding step k, the other condi- tioned on Igen producing s(k) gen. The corrected output YT is decoded greedily from the con- trastive logits ˜s(k) = (1 +α)s (k) orig −α s (k) gen, α= 0.5, 13 matching DeGF and the ...

[14] [15]

FIELD ROLES: subject: head entity, written as a bare noun phrase WITHOUT attributes

(subject, predicate, object) . . . FIELD ROLES: subject: head entity, written as a bare noun phrase WITHOUT attributes. GOOD: (car, is, red). BAD: (red car, is, parked). predicate: the relation. Use the standardized forms be- low when applicable. object: tail entity, attribute value, or count. STANDARDIZED PREDICATES: Attributes (color, size, material, sh...

[15] [16]

(dog, exists in, image)

[16] [18]

(man, jogging on, path)

[17] [19]

(man, wearing, red shirt)

[18] [20]

(sky, is, blue) Example — given an audio clip of a busy street:

[19] [21]

(cars, honking, loudly)

[20] [22]

(people, talking, nearby)

[21] [23]

(engine, running, idle)

[22] [24]

(music, playing from, shop) Example — given text ‘The president announced a new policy on Tuesday’:

[23] [25]

(president, announced, new policy)

[24] [26]

One fact per triple

(announcement, happened on, Tuesday) Rules: Extract every object, attribute, spatial relation, action, and count you can verify. One fact per triple. Do NOT combine multiple claims into one line. Always put the bare entity in subject; never bundle ad- jectives into subject. For spatial predicates, subject is the figure positioned relative to object. Aim f...

[25] [27]

(subject, predicate, object)

[26] [28]

FIELD ROLES: subject: head entity, written as a bare noun WITH- OUT adjectives

(subject, predicate, object) . . . FIELD ROLES: subject: head entity, written as a bare noun WITH- OUT adjectives. GOOD: (car, is, red). BAD: (red car, is, parked). predicate: the relation. Prefer the standardized forms below. object: tail entity, attribute value, or count. STANDARDIZED PREDICATES: Attributes:is. e.g., (car, is, red). Counts:count. e.g., ...

[27] [29]

(car, parked near, building)

[28] [30]

(building, is, tall)

[29] [31]

(day, is, sunny) Example 2 — input: ‘The fire hydrant cap is yellow.’

[30] [32]

(fire hydrant cap, is, yellow) Example 3 — input: ‘There are three traffic lights in the image.’

[31] [33]

(traffic lights, exists in, image)

[32] [34]

(traffic lights, count, 3) Example 4 — input: ‘A man in a white shirt is cooking in the kitchen while holding a knife.’

[33] [35]

(man, exists in, image)

[34] [36]

(man, wearing, white shirt)

[35] [37]

(man, cooking in, kitchen)

[36] [38]

(man, holding, knife) Example 5 — input: ‘The dog is on top of the car.’

[37] [39]

{original_text}

(dog, on, car) Rules: Extract every claim, even from a short single sentence. Keep the subject a bare noun; never bundle attributes into subject. Each claim = one triple. Do NOT combine multiple facts. For spatial predicates, subject is the figure positioned relative to object. Output ONLY the numbered triple list. No prose. Now extract from this text: {<...

2024