TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation
Pith reviewed 2026-06-28 21:58 UTC · model grok-4.3
The pith
TIGER mitigates hallucinations by extracting separate graphs from input and output to score and repair unsupported claims at inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TIGER independently extracts an observation graph from the input and a claim graph from the current output, assigns each claim a graph-conditioned risk score based on support and conflict, and repairs selected high-risk claims while keeping the backbone frozen. A convergence analysis shows the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths show TIGER reduces unsupported content while preserving task quality, with gains holding across multiple backbones and a CrisisFACTS case study indicating improved grounding in multi-source settings.
What carries the argument
Observation graph from the input and claim graph from the output, used to compute support and conflict relations for per-claim risk scores.
If this is right
- Reduces unsupported content across image-to-text, image-plus-text-to-text, audio-to-text, and video-to-text tasks.
- Preserves task quality while operating on multiple frozen backbones.
- The same repair mechanism improves grounding when evidence comes from multiple sources, as shown in the CrisisFACTS case study.
Where Pith is reading between the lines
- The graph-based routing could be applied to text-only generation where factuality matters.
- Tracing risk scores back to specific edges might make model outputs more interpretable for debugging.
- Adding temporal or causal relations to the graphs could address hallucinations in sequential or narrative generation.
Load-bearing premise
The graph extraction step from input and output accurately captures support and conflict relations without introducing systematic bias.
What would settle it
A controlled test comparing TIGER's extracted graphs against human-annotated support relations on the same inputs and outputs, checking whether hallucination reduction disappears when the automated graphs mismatch the annotations.
Figures
read the original abstract
We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model's interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level. We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TIGER, an inference-time framework for fact-level repair in multimodal generation to mitigate hallucinations. It extracts an observation graph from the input and a claim graph from the output using a frozen backbone model, computes graph-conditioned risk scores based on support and conflict relations, and performs localized repair on high-risk claims. A convergence analysis is provided showing that expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments on four cross-modal tasks (image-to-text, image+text-to-text, audio-to-text, video-to-text) across multiple backbones report reduced unsupported content while preserving task quality, with an additional CrisisFACTS case study.
Significance. If the central claims hold, TIGER provides a traceable, localized alternative to joint-conditioning repair methods with theoretical convergence guarantees and demonstrated applicability across modalities without retraining. The explicit asymptotic bound and graph-based routing for interpretability are notable strengths. The cross-modal experimental design and multi-backbone consistency add to potential impact in grounded multimodal generation.
major comments (2)
- [§4] §4 (Convergence Analysis): The geometric convergence of expected total risk to an explicit asymptotic bound is derived under mild assumptions on the risk scores, but the analysis does not establish independence of the graph-conditioned risk scores from the generator. Since both the observation graph and claim graph are extracted by the same frozen backbone that produced the output, any systematic interpretive bias in the backbone can propagate into the support/conflict relations and risk scores, making the measured risk reduction non-independent of the generator itself. This directly affects whether the bound represents external hallucination reduction.
- [§5] §5 (Experiments): The reported reductions in unsupported content across the four cross-modal paths are measured using the same graph-extraction pipeline that defines the risk scores. Without an external validator (e.g., human annotation or an independent model) for the unsupported-content metric, it is unclear whether the gains reflect true grounding improvement or merely consistency with the backbone's own extraction biases. This is load-bearing for the empirical claim that TIGER reduces unsupported content while preserving task quality.
minor comments (2)
- [Abstract, §5] Abstract and §5: Dataset sizes, number of examples per path, and error bars on the unsupported-content and task-quality metrics are not stated; these details are needed to assess the scale and statistical reliability of the reported gains.
- [§3] §3 (Method): The precise definition of the graph-conditioned risk score (how support and conflict edges are aggregated into a scalar) should be given as an equation rather than prose to allow direct inspection of the convergence assumptions.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the convergence analysis and experimental design. We respond point by point to the major comments below.
read point-by-point responses
-
Referee: [§4] §4 (Convergence Analysis): The geometric convergence of expected total risk to an explicit asymptotic bound is derived under mild assumptions on the risk scores, but the analysis does not establish independence of the graph-conditioned risk scores from the generator. Since both the observation graph and claim graph are extracted by the same frozen backbone that produced the output, any systematic interpretive bias in the backbone can propagate into the support/conflict relations and risk scores, making the measured risk reduction non-independent of the generator itself. This directly affects whether the bound represents external hallucination reduction.
Authors: The convergence analysis establishes geometric decrease of expected total risk to the stated asymptotic bound under the given assumptions on the risk scores; it does not claim statistical independence between the risk scores and the backbone. Because the observation and claim graphs are extracted separately from input and output, the risk computation remains well-defined for the repair process. The bound therefore applies to reduction in the model's own risk measure. We will add an explicit clarification in §4 that the theoretical guarantee is internal to the defined risk function. revision: partial
-
Referee: [§5] §5 (Experiments): The reported reductions in unsupported content across the four cross-modal paths are measured using the same graph-extraction pipeline that defines the risk scores. Without an external validator (e.g., human annotation or an independent model) for the unsupported-content metric, it is unclear whether the gains reflect true grounding improvement or merely consistency with the backbone's own extraction biases. This is load-bearing for the empirical claim that TIGER reduces unsupported content while preserving task quality.
Authors: The unsupported-content metric is computed from the same graph pipeline to ensure direct correspondence with the risk scores that TIGER optimizes. This alignment allows the experiments to measure the precise effect of localized repair. Results remain consistent across four modalities and multiple backbones, and task-quality metrics are preserved, which would be unlikely under pure extraction bias. We will insert a limitations paragraph in §5 acknowledging the internal nature of the metric and the value of future external validation. revision: partial
Circularity Check
No circularity in derivation chain; analysis self-contained
full rationale
The provided abstract and description outline TIGER's graph extraction, risk scoring, and convergence analysis under mild assumptions, but contain no equations, self-citations, or fitted parameters that reduce the claimed geometric decrease or asymptotic bound to the inputs by construction. No self-definitional steps, fitted predictions renamed as results, or load-bearing self-citations are quoted. The framework's independence from the frozen backbone is asserted without circular reduction in the text. This is the standard non-finding for papers whose central claims remain externally falsifiable via the described experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mild assumptions under which expected total risk decreases geometrically to an explicit asymptotic bound
invented entities (3)
-
Observation graph
no independent evidence
-
Claim graph
no independent evidence
-
Graph-conditioned risk score
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Hallucination of Multimodal Large Language Models: A Survey
Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Cody Buntain, Amanda Lee Hughes, Richard Mc- Creadie, Benjamin D Horne, Muhammad Imran, and Hemant Purohit. 2023. Crisisfacts 2023-overview paper. InTREC. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325. Zuyao Chen, Jinlin Wu...
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[3]
Mitigating object hallucinations in large vision- language models through visual contrastive decod- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating object hallucination in large vision-language models. In Pro...
-
[4]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Aligning large multimodal models with factu- ally augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110. Junyoung Sung, Minjun Kim, Sumin An, Seungwoo Lyu, Arsha Nagrani, and Paul Hongsuck Seo. 2025. Getting to the crux: Graph-based data generation for advancing multi-hop cross-modal reasoning. Chameleon ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
This de- sign is simple, but the current response can affect how the model interprets the input, so hallucinated claims may be reinforced rather than removed
revise an initial output by conditioning on both the input and the current response. This de- sign is simple, but the current response can affect how the model interprets the input, so hallucinated claims may be reinforced rather than removed. The feedback is also written in natural language, which makes it difficult to rank facts or enforce a repair budg...
2024
-
[6]
Output: a JSON list of (entity, attribute) pairs
Key concept extraction.A prompt extracts the list of object-level concepts mentioned in Y0 that need verification. Output: a JSON list of (entity, attribute) pairs
-
[7]
Does the image contain a {concept}?
Question formulation.For each extracted concept, a prompt formulates a yes/no verifi- cation question of the form “Does the image contain a {concept}?” or “Is the {entity} {at- tribute}?”
-
[8]
visual evidence
Visual knowledge validation.For each verifi- cation question, Grounding DINO is invoked with a text query corresponding to the concept (confidence threshold 0.35, top-5 detections per query). The detected bounding boxes and labels constitute the “visual evidence” for that question
-
[9]
GroundingDINO finds 2 boxes la- belled {label} with scores {...}
Visual claim generation.The detected ev- idence is formatted into a structured claim list, e.g. “GroundingDINO finds 2 boxes la- belled {label} with scores {...}”. This list becomes the explicit feedback F for the cor- rection stage
-
[10]
Per sample Woodpecker calls Φ four times (Pgen, key-concept extraction, question formulation, hal- lucination correction) plus one Grounding DINO call per extracted concept
Hallucination correction.The backbone Φ is called once more with the image, the original response, and the visual claim list to produce the corrected responseY T . Per sample Woodpecker calls Φ four times (Pgen, key-concept extraction, question formulation, hal- lucination correction) plus one Grounding DINO call per extracted concept. The five-stage pipe...
2025
-
[11]
Initial generation.Produce Y0 = Φ(Pgen,X) as usual, where X is the original imageI orig
-
[12]
efficient
Generative feedback via diffusion.Take the caption-like response Y0 as a prompt and syn- thesise an auxiliary image Igen = SD(Y 0) using Stable Diffusion. We use SD-Turbo (stabilityai/sd-turbo) with 1 denoising step and the default scheduler, matching the origi- nal paper’s “efficient” configuration
-
[13]
The corrected output YT is decoded greedily from the con- trastive logits ˜s(k) = (1 +α)s (k) orig −α s (k) gen, α= 0.5, 13 matching DeGF and the sameα as the original paper
Contrastive decoding.Run two forward passes of the backbone with the same text prompt: one conditioned on Iorig producing logits s(k) orig at decoding step k, the other condi- tioned on Igen producing s(k) gen. The corrected output YT is decoded greedily from the con- trastive logits ˜s(k) = (1 +α)s (k) orig −α s (k) gen, α= 0.5, 13 matching DeGF and the ...
-
[15]
FIELD ROLES: subject: head entity, written as a bare noun phrase WITHOUT attributes
(subject, predicate, object) . . . FIELD ROLES: subject: head entity, written as a bare noun phrase WITHOUT attributes. GOOD: (car, is, red). BAD: (red car, is, parked). predicate: the relation. Use the standardized forms be- low when applicable. object: tail entity, attribute value, or count. STANDARDIZED PREDICATES: Attributes (color, size, material, sh...
-
[16]
(dog, exists in, image)
-
[18]
(man, jogging on, path)
-
[19]
(man, wearing, red shirt)
-
[20]
(sky, is, blue) Example — given an audio clip of a busy street:
-
[21]
(cars, honking, loudly)
-
[22]
(people, talking, nearby)
-
[23]
(engine, running, idle)
-
[24]
(music, playing from, shop) Example — given text ‘The president announced a new policy on Tuesday’:
-
[25]
(president, announced, new policy)
-
[26]
One fact per triple
(announcement, happened on, Tuesday) Rules: Extract every object, attribute, spatial relation, action, and count you can verify. One fact per triple. Do NOT combine multiple claims into one line. Always put the bare entity in subject; never bundle ad- jectives into subject. For spatial predicates, subject is the figure positioned relative to object. Aim f...
-
[27]
(subject, predicate, object)
-
[28]
FIELD ROLES: subject: head entity, written as a bare noun WITH- OUT adjectives
(subject, predicate, object) . . . FIELD ROLES: subject: head entity, written as a bare noun WITH- OUT adjectives. GOOD: (car, is, red). BAD: (red car, is, parked). predicate: the relation. Prefer the standardized forms below. object: tail entity, attribute value, or count. STANDARDIZED PREDICATES: Attributes:is. e.g., (car, is, red). Counts:count. e.g., ...
-
[29]
(car, parked near, building)
-
[30]
(building, is, tall)
-
[31]
(day, is, sunny) Example 2 — input: ‘The fire hydrant cap is yellow.’
-
[32]
(fire hydrant cap, is, yellow) Example 3 — input: ‘There are three traffic lights in the image.’
-
[33]
(traffic lights, exists in, image)
-
[34]
(traffic lights, count, 3) Example 4 — input: ‘A man in a white shirt is cooking in the kitchen while holding a knife.’
-
[35]
(man, exists in, image)
-
[36]
(man, wearing, white shirt)
-
[37]
(man, cooking in, kitchen)
-
[38]
(man, holding, knife) Example 5 — input: ‘The dog is on top of the car.’
-
[39]
{original_text}
(dog, on, car) Rules: Extract every claim, even from a short single sentence. Keep the subject a bare noun; never bundle attributes into subject. Each claim = one triple. Do NOT combine multiple facts. For spatial predicates, subject is the figure positioned relative to object. Output ONLY the numbered triple list. No prose. Now extract from this text: {<...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.