CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

Chenyu Liu; Ruohan Zhang; Siyuan Chen; Suiyang Guang

arxiv: 2604.22274 · v7 · pith:FXYT4Y2Rnew · submitted 2026-04-24 · 💻 cs.CV

CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

Suiyang Guang , Chenyu Liu , Ruohan Zhang , Siyuan Chen This is my paper

Pith reviewed 2026-05-13 07:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary scene graph generationcounterfactual verificationrelation groundingvisual evidencepredicate decompositionvision-language models

0 comments

The pith

Counterfactual verification grounds open-vocabulary scene graph relations in visual evidence by checking score drops after targeted removal

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that open-vocabulary scene graph generation improves when models stop directly proposing relations and instead verify each candidate by decomposing its predicate into evidence bases such as support, contact, containment, depth, and state. A relation-conditioned encoder extracts the relevant cues, after which a counterfactual verifier removes the necessary bases and confirms that the relation score falls while staying stable under irrelevant changes. Readers should care because current vision-language models frequently output relations driven by language priors or object co-occurrence rather than actual image content, producing unreliable scene descriptions. The approach adds contradiction-aware predicate learning and graph-level preference optimization to sharpen fine-grained distinctions and enforce overall consistency. If the verification step holds, scene graphs become more interpretable and better at generalizing to unseen predicates.

Core claim

The central claim is that replacing direct relation generation with counterfactual relation verification, built on decomposed soft evidence bases and a scorer that tests necessity by removal, yields scene graphs whose relations rest on specific visual, geometric, and contextual support rather than spurious correlations, as measured by gains in recall metrics, unseen predicate generalization, and grounding quality across conventional, open-vocabulary, and panoptic benchmarks.

What carries the argument

Counterfactual verifier that removes necessary evidence bases from a relation and confirms the score decreases while remaining stable under irrelevant perturbations

Load-bearing premise

Decomposing predicate phrases into soft evidence bases and removing them isolates true visual support without missing confounding factors or introducing new biases from the removal process.

What would settle it

A controlled experiment in which known ground-truth relations show no score drop when their listed evidence bases are masked or an incorrect score drop when only irrelevant bases are masked.

Figures

Figures reproduced from arXiv: 2604.22274 by Chenyu Liu, Ruohan Zhang, Siyuan Chen, Suiyang Guang.

**Figure 1.** Figure 1: Overall framework. The method first parses objects and proposes open-vocabulary relations, then models predicate-specific view at source ↗

read the original abstract

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shifts open-vocab SGG from generation to counterfactual verification on decomposed evidence bases, which targets the language-prior problem directly but leaves the removal mechanics underspecified.

read the letter

The main contribution is treating relation prediction as a verification task. They generate open-vocabulary candidates with a vision-language proposer, break predicates into soft evidence bases such as support, contact, containment, depth and state, then run a counterfactual check: remove the relevant evidence and see if the score drops while staying stable under irrelevant changes. They add contradiction-aware learning and graph-level optimization on top. This is a clear attempt to move beyond models that lean on co-occurrence or language priors, and the decomposition gives a concrete way to tie predictions to visual cues rather than just semantic plausibility. If the verification step holds up, it could improve both reliability and interpretability on unseen predicates. The abstract reports consistent gains on recall metrics, unseen generalization, and grounding quality across standard, open-vocabulary, and panoptic benchmarks, which suggests the approach is at least practically useful in current setups. The soft spot is the lack of detail on the removal operator itself. Without knowing exactly how evidence is stripped out or seeing ablations that rule out global feature artifacts or training-signal effects from the extra losses, it is hard to tell whether the score drops reflect true visual grounding or something else. The stress-test concern about confounding factors in the removal process is worth checking against the full experiments. This is aimed at computer-vision researchers working on grounded scene understanding and open-vocabulary models. The idea engages honestly with a known limitation in the field and is concrete enough to test, so it deserves a serious referee to examine the implementation and controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces CAGE-SGG, a framework for open-vocabulary scene graph generation that replaces direct relation proposal with counterfactual verification. Candidate relations are generated via a vision-language model, predicate phrases are decomposed into soft evidence bases (support, contact, containment, depth, state), a relation-conditioned encoder extracts cues, and a verifier checks whether relation scores drop when necessary evidence is removed (while remaining stable under irrelevant perturbations). Contradiction-aware predicate learning and graph-level preference optimization are added for discrimination and consistency. The authors claim consistent gains on conventional, open-vocabulary, and panoptic SGG benchmarks for recall metrics, unseen-predicate generalization, and counterfactual grounding quality.

Significance. If the counterfactual verification step reliably isolates predicate-specific visual evidence without confounding artifacts, the work could meaningfully advance open-vocabulary SGG by reducing language-prior and co-occurrence biases, yielding more interpretable and evidence-grounded graphs. The shift from generation to verification is conceptually attractive and could influence downstream tasks that rely on scene graphs.

major comments (3)

[§3.2] §3.2 (Counterfactual Verifier): The removal operator is not specified. It is unclear how 'necessary evidence is removed' while leaving the remainder of the visual representation unchanged; without an explicit formulation or ablation on removal-induced artifacts (e.g., global feature shifts or sensitivity to any structured perturbation), it is impossible to confirm that score drops reflect true grounding rather than model fragility.
[§4] §4 (Experiments): The central claim that the method 'consistently improves' recall, unseen-predicate generalization, and grounding quality is load-bearing, yet the manuscript supplies no quantitative tables, baseline comparisons, ablation results on the verifier component, or error analysis. Without these data it cannot be determined whether gains arise from verification itself or from the added contradiction-aware loss and graph optimization.
[§3.1] §3.1 (Evidence Decomposition): The decomposition of arbitrary predicate phrases into the fixed soft bases (support/contact/containment/depth/state) risks incomplete coverage or language-prior leakage; no analysis is provided on how well this fixed set captures fine-grained or novel predicates, nor on whether the evidence encoder is trained jointly with the proposer (which could undermine the independence of the verification step).

minor comments (2)

[Abstract] Abstract: Typo 'relation-pecific' should be 'relation-specific'; 'vidence' should be 'evidence'.
[§3] Notation: The distinction between 'soft evidence bases' and the outputs of the evidence encoder is not clearly defined; a short table or diagram would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate the suggested clarifications and expansions into the revised manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Counterfactual Verifier): The removal operator is not specified. It is unclear how 'necessary evidence is removed' while leaving the remainder of the visual representation unchanged; without an explicit formulation or ablation on removal-induced artifacts (e.g., global feature shifts or sensitivity to any structured perturbation), it is impossible to confirm that score drops reflect true grounding rather than model fragility.

Authors: We agree that an explicit formulation of the removal operator is essential. In the revision we will define it mathematically as a targeted subtraction of the relation-specific evidence features (extracted via the conditioned encoder) from the input representation, while the global visual backbone remains unchanged. We will also add an ablation comparing score sensitivity under targeted evidence removal versus random structured perturbations and global feature noise, confirming that drops are predicate-specific rather than artifacts of model fragility. These details and results will be inserted into §3.2. revision: yes
Referee: [§4] §4 (Experiments): The central claim that the method 'consistently improves' recall, unseen-predicate generalization, and grounding quality is load-bearing, yet the manuscript supplies no quantitative tables, baseline comparisons, ablation results on the verifier component, or error analysis. Without these data it cannot be determined whether gains arise from verification itself or from the added contradiction-aware loss and graph optimization.

Authors: We acknowledge the need for fuller experimental reporting. The revised manuscript will expand §4 with complete quantitative tables reporting R@K and mR@K on VG, GQA, and panoptic benchmarks, including seen/unseen predicate splits and direct comparisons against recent open-vocabulary baselines. We will add dedicated ablations that isolate the counterfactual verifier (showing performance with and without it) while holding the contradiction-aware loss and graph optimization fixed. A quantitative error analysis section with grounding-quality metrics and representative failure-case breakdowns will also be included to attribute gains specifically to verification. revision: yes
Referee: [§3.1] §3.1 (Evidence Decomposition): The decomposition of arbitrary predicate phrases into the fixed soft bases (support/contact/containment/depth/state) risks incomplete coverage or language-prior leakage; no analysis is provided on how well this fixed set captures fine-grained or novel predicates, nor on whether the evidence encoder is trained jointly with the proposer (which could undermine the independence of the verification step).

Authors: We appreciate the concern regarding coverage and independence. In the revision we will add an analysis subsection evaluating decomposition coverage on a held-out set of novel predicates, reporting the fraction of predicates that map cleanly to the bases and quantifying any residual language-prior leakage via controlled substitution tests. We will also clarify that the evidence encoder is trained jointly but with a stop-gradient from the proposer to preserve verification independence, and we will include an ablation comparing this setup against a fully frozen-proposer variant. These additions will appear in §3.1. revision: partial

Circularity Check

0 steps flagged

No circularity: framework builds on external proposers with independent verification components

full rationale

The paper's derivation chain starts from an external vision-language proposer to generate candidates, then applies a novel decomposition of predicates into evidence bases (support, contact, etc.), a relation-conditioned encoder, and a counterfactual verifier that tests score changes under removal. None of these steps reduce by construction to fitted inputs or self-citations; the verifier and optimization modules are presented as additive and independently motivated. No equations or claims in the abstract or description equate predictions to their own training signals, import uniqueness from prior self-work, or rename known results. The central improvement claims rest on experimental benchmarks rather than definitional equivalence, making the framework self-contained against external components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard domain assumptions about vision-language models and evidence decomposition without introducing new free parameters or invented entities visible in the abstract.

axioms (1)

domain assumption Vision-language models can generate plausible open-vocabulary relation candidates
The framework begins by using such a model as the proposer.

pith-pipeline@v0.9.0 · 5550 in / 1291 out tokens · 32813 ms · 2026-05-13T07:53:41.448895+00:00 · methodology

Review history (2 revisions) →

CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)