RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses

Bao Ngoc Le; Minh Anh Nguyen; Quang Huy Tran; Sui Yang Guang; Tuan Kiet Pham

arxiv: 2605.20823 · v2 · pith:Q7FJWPD6new · submitted 2026-05-20 · 💻 cs.CV

RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses

Minh Anh Nguyen , Quang Huy Tran , Bao Ngoc Le , Tuan Kiet Pham , Sui Yang Guang This is my paper

Pith reviewed 2026-05-22 09:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary 3D scene graphrelation witnesspositive-unlabeled learningRGB-D sequenceincomplete supervisionvisual-geometric verificationmulti-view consistency

0 comments

The pith

Relation witnesses built from visual and geometric cues let models learn 3D scene graphs even when many true relations lack annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces relation witnesses as concrete visual-geometric signals that make object-pair relations observable in RGB-D data. These signals are used to re-label unannotated pairs as verified missing positives, reliable negatives, or uncertain cases rather than treating all absences as negatives. The resulting witness-guided positive-unlabeled objective trains an open-vocabulary scene-graph model on incomplete labels while preserving multi-view consistency. Experiments on 3DSSG/3RScan and ScanNet splits demonstrate gains in unseen-relation recognition and reductions in hallucinated or redundant phrases. The approach therefore treats selective annotation as a verifiable supervision problem instead of an insurmountable obstacle.

Core claim

RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. Witness-consistent decoding and an RGB-D missing-relation audit protocol further improve output quality.

What carries the argument

A relation witness: a concrete visual-geometric cue (contact plus vertical ordering for support, enclosure for containment, metric closeness for proximity, facing direction for orientation, and persistence across views for stability) that makes a specific relation directly observable in the captured scene.

If this is right

Unseen relation predicates are recognized more accurately because verified positives supply additional training signal.
Hallucinated relations decrease because reliable negatives suppress spurious predictions.
Redundant relation phrases are reduced by witness-consistent decoding that respects multi-view persistence.
The audit protocol provides a reproducible way to quantify how many missing annotations are actually recoverable from geometry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same witness construction could be applied to other 3D perception tasks that suffer from selective labeling, such as affordance prediction or dynamic interaction modeling.
If the cues prove stable across datasets, the method offers a route to scale open-vocabulary scene graphs without exhaustive re-annotation of every new capture.

Load-bearing premise

The chosen visual-geometric cues correctly flag true relations and avoid large numbers of false positives or false negatives on real captured scenes.

What would settle it

Human inspection of a held-out set of scenes to measure whether the witness verifier's verified positives and reliable negatives match independent ground-truth relation labels at high precision and recall.

Figures

Figures reproduced from arXiv: 2605.20823 by Bao Ngoc Le, Minh Anh Nguyen, Quang Huy Tran, Sui Yang Guang, Tuan Kiet Pham.

**Figure 1.** Figure 1: RelWitness overview. Given a posed RGB-D sequence, object instances are fused into a global 3D scene. Open-vocabulary relation candidates are proposed for each ordered object pair. A witness parser maps each phrase to physical witness families, and a visualgeometric verifier checks RGB, depth, 3D geometry, role order, object-prior null views, and multi-view persistence. The witness memory stores verified … view at source ↗

**Figure 2.** Figure 2: Qualitative witness cases. Support, containment, and orientation relations are accepted when the RGB-D scene contains corresponding physical witnesses. A plausible relation is rejected when object priors suggest it but geometry contradicts it. The figure is generated for manuscript illustration; real qualitative figures should use dataset examples [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Open-vocabulary 3D scene graph generation seeks to describe object instances and their relations with flexible natural-language predicates. The central difficulty is not only vocabulary expansion, but supervision reliability: relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated. We propose RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. The key concept is a relation witness: a concrete visual-geometric cue that makes a relation observable in the captured scene. Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views where both objects are visible. RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. We further introduce witness-consistent decoding and an RGB-D missing-relation audit protocol. Simulated manuscript-planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits show the intended behavior: improved unseen-relation recognition, higher witness precision, lower hallucination, and reduced redundant relation phrases. All numerical results are planning values and must be replaced by reproduced measurements before submission

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RelWitness gives a workable framing for turning missing relation labels into positives or negatives via geometric cues, but the placeholder results leave the actual performance untested.

read the letter

The main point is that this paper tackles incomplete supervision in open-vocabulary 3D scene graphs by defining observable visual-geometric witnesses for relations like support or containment. It builds records from RGB-D views, depth, reconstructed geometry, and multi-view checks, then uses a verifier to sort unannotated pairs into verified positives, reliable negatives, or unlabeled. That feeds a positive-unlabeled objective instead of defaulting missing labels to negatives. The approach also adds witness-consistent decoding and an audit protocol for the data splits from 3DSSG and ScanNet-derived sets.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. It defines relation witnesses via visual-geometric cues (contact and vertical ordering for support, enclosure for containment, metric closeness for proximity, facing direction for orientation, and cross-view persistence for stable relations). These cues are used to build witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier then classifies unannotated relation candidates as verified missing positives, reliable negatives, or uncertain unlabeled cases, enabling a witness-guided positive-unlabeled objective. The approach also includes witness-consistent decoding and an RGB-D missing-relation audit protocol. Planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits report intended improvements in unseen-relation recognition, witness precision, lower hallucination, and fewer redundant phrases, though all numerical results are placeholders to be replaced by actual measurements.

Significance. If the visual-geometric cues reliably map to true relations despite reconstruction noise and the positive-unlabeled objective successfully exploits verified missing positives without introducing systematic label errors, the framework could advance open-vocabulary 3D scene graph generation by mitigating the effects of selective and incomplete annotations in existing datasets.

major comments (2)

[Abstract] Abstract: The abstract states that all numerical results are planning values that must be replaced by reproduced measurements before submission, with no full methods, code, or actual data available for verification. This directly prevents assessment of the claimed improvements in unseen-relation recognition and witness precision, which are central to validating the framework.
[Method (visual-geometric witness verifier)] The central claim depends on the visual-geometric cues (contact+vertical ordering, enclosure, metric closeness, facing direction, cross-view persistence) accurately identifying true relations. No analysis or ablation is provided on cue robustness to depth noise, partial occlusions, and mesh reconstruction artifacts typical in 3RScan/ScanNet, which could inject systematic errors into the witness verifier and corrupt the positive-unlabeled objective by misclassifying true positives as uncertain or reliable negatives as false positives.

minor comments (2)

[Abstract] Abstract: The term 'Simulated manuscript-planning experiments' is unclear and should be rephrased for precision, e.g., to indicate preliminary simulations rather than final empirical results.
[Abstract] Abstract: Expand the description of witness record construction (RGB views, depth maps, 3D geometry, role-sensitive text, object-prior null views, multi-view consistency) with concrete implementation details and pseudocode in the main text for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments below and describe the revisions we intend to make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that all numerical results are planning values that must be replaced by reproduced measurements before submission, with no full methods, code, or actual data available for verification. This directly prevents assessment of the claimed improvements in unseen-relation recognition and witness precision, which are central to validating the framework.

Authors: We fully agree that the placeholder numerical results in the current version hinder a complete evaluation of the proposed framework. As noted in the manuscript itself, these are planning values. In the revised submission, we will perform the complete experiments using the 3DSSG/3RScan and ScanNet-derived open-vocabulary splits, replace all planning values with actual reproduced measurements, and provide detailed methods along with code to enable verification of the reported improvements in unseen-relation recognition and witness precision. revision: yes
Referee: [Method (visual-geometric witness verifier)] The central claim depends on the visual-geometric cues (contact+vertical ordering, enclosure, metric closeness, facing direction, cross-view persistence) accurately identifying true relations. No analysis or ablation is provided on cue robustness to depth noise, partial occlusions, and mesh reconstruction artifacts typical in 3RScan/ScanNet, which could inject systematic errors into the witness verifier and corrupt the positive-unlabeled objective by misclassifying true positives as uncertain or reliable negatives as false positives.

Authors: This is a valid concern regarding the reliability of the visual-geometric cues under real-world conditions. The current manuscript does not include such robustness analysis. We will incorporate a dedicated ablation study in the revised version that systematically evaluates the impact of depth noise, partial occlusions, and mesh reconstruction artifacts on each cue's performance. This will include metrics on how these factors influence the classification into verified missing positives, reliable negatives, and uncertain cases, as well as the effect on the positive-unlabeled objective. We expect this addition to provide stronger evidence for the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation uses explicit cue definitions to augment incomplete labels

full rationale

The paper defines relation witnesses via concrete, observable rules (contact+vertical ordering for support, enclosure for containment, metric closeness for proximity, facing direction for orientation, cross-view persistence for stability) applied to input RGB-D and reconstructed geometry. These rules feed a verifier that produces verified positives, reliable negatives, and uncertain cases, which then drive a positive-unlabeled objective. No equations or steps reduce by construction to their own outputs; the chain is a standard data-augmentation pipeline whose validity rests on the external empirical accuracy of the cues rather than self-reference or self-citation. No self-citations, fitted parameters, or uniqueness theorems are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Information is limited to the abstract; the central claim rests on the assumption that the listed visual-geometric cues are reliable indicators and that the verifier can correctly categorize unannotated pairs.

axioms (1)

domain assumption Relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated.
Explicitly stated as the central difficulty in the abstract.

invented entities (1)

relation witness no independent evidence
purpose: A concrete visual-geometric cue that makes a relation observable in the captured scene.
New concept introduced to address incomplete supervision.

pith-pipeline@v0.9.0 · 5821 in / 1343 out tokens · 33868 ms · 2026-05-22T09:55:16.193196+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

witness-guided positive-unlabeled objective

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 1 internal anchor

[1]

Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese

Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and cam- era. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2019. 1, 2

work page 2019
[2]

Learning 3d semantic scene graphs from 3d indoor reconstructions

Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2020. 2, 7, 8

work page 2020
[3]

Scenegraphfusion: Incremen- tal 3d scene graph prediction from rgb-d sequences

Shun-Cheng Wu, Johanna Wald, Keisuke Tateno, Nassir Navab, and Federico Tombari. Scenegraphfusion: Incremen- tal 3d scene graph prediction from rgb-d sequences. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 2, 7, 8

work page 2021
[4]

de Melo, Joshua B

Qiao Gu, Alyssa Kuwajerwala, Sacha Morin, Kr- ishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso M. de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull. Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning. InProceedings ...

work page 2024
[5]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023
[6]

Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann

Ayc ¸a Takmaz, Elisabetta Fedele, Robert W. Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Open- mask3d: Open-vocabulary 3d instance segmentation. InAd- vances in Neural Information Processing Systems, 2023. 2

work page 2023
[7]

Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships

Sebastian Koch, Narunas Vaskevicius, Mirco Colosi, Pe- dro Hermosilla, and Timo Ropinski. Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 7, 8

work page 2024
[8]

Open-vocabulary functional 3d scene graphs for real-world indoor spaces.arXiv preprint arXiv:2503.19199, 2025

Zian Zhang, Cheng-Yu Tai, Yikuan Xie, Xuezhi Bao, Haritha Yerramilli, Luca Weihs, Alexander Weihs, Aniruddha Kem- bhavi, Roozbeh Mottaghi, Prune Truong, and Yiran Geng. Open-vocabulary functional 3d scene graphs for real-world indoor spaces.arXiv preprint arXiv:2503.19199, 2025. 2, 3, 8

work page arXiv 2025
[9]

Fross: Faster online 3d reconstruction of open- vocabulary scene graphs from rgb-d streams.arXiv preprint arXiv:2506.19146, 2025

Jingyi Hou, Kun Liu, Ning Lu, Raghudeep Gadde, and Qiang Qiu. Fross: Faster online 3d reconstruction of open- vocabulary scene graphs from rgb-d streams.arXiv preprint arXiv:2506.19146, 2025. 1, 2, 3, 8

work page arXiv 2025
[10]

Kimera: An open-source library for real-time metric- semantic localization and mapping

Antoni Rosinol, Marcus Abate, Yun Chang, and Luca Car- lone. Kimera: An open-source library for real-time metric- semantic localization and mapping. InProceedings of the IEEE International Conference on Robotics and Automation,

work page
[11]

Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2, 7

work page 2017
[12]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2

work page 2017
[13]

Qi, Li Yi, Hao Su, and Leonidas J

Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[14]

Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J

Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

work page 2019
[15]

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip H. S. Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page
[16]

Image retrieval using scene graphs

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

work page
[17]

Visual relationship detection with language priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei- Fei. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vi- sion, 2016

work page 2016
[18]

Shamma, Michael Bernstein, and Li Fei-Fei

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. InInterna- tional Journal of Computer Vision, 2017

work page 2017
[19]

Choy, and Li Fei-Fei

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2

work page 2017
[20]

Neural motifs: Scene graph parsing with global con- text

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2

work page 2018
[21]

Graph r-cnn for scene graph generation

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InProceed- ings of the European Conference on Computer Vision, 2018

work page 2018
[22]

Learning to compose dynamic tree structures for visual contexts

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019

work page 2019
[23]

Unbiased scene graph generation from bi- ased training

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

work page 2020
[24]

Bipartite graph network with adaptive message passing for unbiased scene graph generation

Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. Bipartite graph network with adaptive message passing for unbiased scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[25]

Sgtr: End-to- end scene graph generation with transformer

Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End-to- end scene graph generation with transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022

work page 2022
[26]

Predicate-aware embedding learning for scene graph generation

Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. Predicate-aware embedding learning for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[27]

Knowledge-embedded routing network for scene graph gen- eration

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. Knowledge-embedded routing network for scene graph gen- eration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[28]

Fine-grained predicates learning for scene graph generation

Xinyu Lyu, Lianli Gao, Yudong Guo, Zhou Zhao, and Heng Tao Shen Huang. Fine-grained predicates learning for scene graph generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2022

work page 2022
[29]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020

work page 2020
[30]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Semantic compo- sitional learning for low-shot scene graph generation.arXiv preprint arXiv:2108.08600, 2021. 3

work page arXiv 2021
[31]

T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56, 2022

work page 2022
[32]

T. He, L. Gao, J. Song, and Y .-F. Li. Toward a unified transformer-based framework for scene graph generation and human-object interaction detection.IEEE Transactions on Image Processing, 32:6274–6288, 2023

work page 2023
[33]

T. He, T. Wu, D. Zhang, G. Duan, K. Qin, and Y .- F. Li. Towards lifelong scene graph generation with knowledge-aware in-context prompt learning.arXiv preprint arXiv:2401.14626, 2024. 2

work page arXiv 2024
[34]

Panoptic scene graph gener- ation

Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph gener- ation. InProceedings of the European Conference on Com- puter Vision, 2022. 2

work page 2022
[35]

Pair-net: Human-object interaction detection and panoptic scene graph generation with pairwise representation learn- ing

Yeliang Wang, Jialian Yu, Zhongang Zhang, and Ziwei Liu. Pair-net: Human-object interaction detection and panoptic scene graph generation with pairwise representation learn- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024

work page 2024
[36]

Openpsg: Open-set panoptic scene graph generation via large multimodal models

Ziqin Zhou, Yichao Zhang, Yifei Wang, Yu Li, and Ziwei Liu. Openpsg: Open-set panoptic scene graph generation via large multimodal models. InProceedings of the European Conference on Computer Vision, 2024

work page 2024
[37]

T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022

work page 2022
[38]

X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page
[39]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, 2021. 2, 3, 8

work page 2021
[40]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the International Conference on Machine Learning, 2022

work page 2022
[41]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the International Conference on Machine Learn- ing, 2023. 3

work page 2023
[42]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023. 2

work page 2023
[43]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page
[44]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection. InarXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[46]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision, 2017

work page 2017
[47]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Pro- cessing Systems, 2015

work page 2015
[48]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InProceedings of the European Conference on Computer Vision, 2020. 2

work page 2020
[49]

Scene graph prediction with limited labels

Apoorva Dornadula, Austin Narcomey, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Scene graph prediction with limited labels. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision Workshops, 2019. 3

work page 2019
[50]

Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher R´e, and Li Fei-Fei

Vincent S. Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher R´e, and Li Fei-Fei. Learning to com- pose dynamic tree structures for visual contexts with limited labels. InProceedings of the IEEE/CVF International Con- ference on Computer Vision Workshops, 2019

work page 2019
[51]

Recovering the unbiased scene graphs from the biased ones

Meng-Jiun Chiou, Henghui Ding, Hanshu Yan, Changhu Wang, Roger Zimmermann, and Jiashi Feng. Recovering the unbiased scene graphs from the biased ones. InProceedings of the ACM International Conference on Multimedia, 2021

work page 2021
[52]

Not all relations are equal: Mining informative relationships for scene graph generation

Vikash Goel, Nishant Chandak, and Dinesh Manocha. Not all relations are equal: Mining informative relationships for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

work page
[53]

R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Muap: Multi-step adaptive prompt learning for vision- language model with missing modality.arXiv preprint arXiv:2409.04693, 2024. 3

work page arXiv 2024
[54]

R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Robustpt: Dynamic disentanglement prompt tuning in vision-language models with missing modalities. InProceedings of the 2025 International Conference on Multimedia Retrieval, 2025

work page 2025
[55]

R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025

work page 2025
[56]

R. Dai, Z. Cai, L. Mo, G. Duan, K. Shi, and T. He. Anchor drift no more: Hierarchical consistency-guided prompt dis- tillation for incomplete multimodal learning. InProceedings of the ACM Web Conference, pages 7330–7341, 2026

work page 2026
[57]

S. Wei, K. Zhang, L. Chen, T. He, and G. Duan. Unbiased dynamic multimodal fusion.arXiv preprint arXiv:2603.19681, 2026

work page arXiv 2026
[58]

Q. Dong, R. Dai, G. Duan, K. Qin, Y . Zhang, and T. He. Un- biased multimodal intent recognition with auxiliary rationale generation.Neurocomputing, page 131197, 2025

work page 2025
[59]

W. Yin, S. Zhan, C. Liu, X. Hu, G. Duan, X. Xie, Y .-F. Li, and T. He. Tical: Typicality-based consistency-aware learn- ing for multimodal emotion recognition. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17948–17956, 2026. 3

work page 2026
[60]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, 2018. 3

work page 2018
[61]

T. He, L. Gao, J. Song, and Y .-F. Li. Exploiting scene graphs for human-object interaction detection. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15984–15993, 2021

work page 2021
[62]

X. Hu, K. Qin, T. He, and G. Luo. Exploring hierarchical tuple-based contextual correlations for human-object inter- action detection.Tsinghua Science and Technology, 2026

work page 2026
[63]

Z. Yang, X. Liu, D. Ouyang, G. Duan, D. Zhang, T. He, and Y .-F. Li. Towards open-vocabulary hoi detection with cal- ibrated vision-language models and locality-aware queries. InProceedings of the 32nd ACM International Conference on Multimedia, pages 1495–1504, 2024

work page 2024
[64]

J. W. Owusu, R. Y . Zakari, K. Qin, and T. He. Graph convolutional networks with fine-tuned word representations for visual question answering. In2024 IEEE Smart World Congress, pages 1381–1387, 2024

work page 2024
[65]

R. Y . Zakari, J. W. Owusu, K. Qin, H. Wang, Z. K. Lawal, and T. He. Vqa and visual reasoning: An overview of ap- proaches, datasets, and future direction.Neurocomputing, 622:129345, 2025. 3

work page 2025
[66]

J. Song, T. He, H. Fan, and L. Gao. Deep discrete hashing with self-supervised pairwise labels. InJoint European Con- ference on Machine Learning and Knowledge Discovery in Databases, 2017. 3

work page 2017
[67]

T. He, L. Gao, J. Song, X. Wang, K. Huang, and Y . Li. Sneq: Semi-supervised attributed network embedding with attention-based quantisation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4091–4098, 2020

work page 2020
[68]

T. He, L. Gao, J. Song, and Y .-F. Li. Semisupervised net- work embedding with differentiable deep quantization.IEEE Transactions on Neural Networks and Learning Systems, 34 (8):4791–4802, 2021

work page 2021
[69]

Zhang, S

D. Zhang, S. Liang, T. He, J. Shao, and K. Qin. Cviformer: Cross-view interactive transformer for efficient stereoscopic image super-resolution.IEEE Transactions on Emerging Topics in Computational Intelligence, 9(2), 2024. 3

work page 2024
[70]

W. Yin, Y . Wang, G. Duan, D. Zhang, X. Hu, Y .-F. Li, and T. He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsupervised cross-domain visual emotion recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3888–3898, 2025. 3

work page 2025

[1] [1]

Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese

Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and cam- era. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2019. 1, 2

work page 2019

[2] [2]

Learning 3d semantic scene graphs from 3d indoor reconstructions

Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2020. 2, 7, 8

work page 2020

[3] [3]

Scenegraphfusion: Incremen- tal 3d scene graph prediction from rgb-d sequences

Shun-Cheng Wu, Johanna Wald, Keisuke Tateno, Nassir Navab, and Federico Tombari. Scenegraphfusion: Incremen- tal 3d scene graph prediction from rgb-d sequences. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 2, 7, 8

work page 2021

[4] [4]

de Melo, Joshua B

Qiao Gu, Alyssa Kuwajerwala, Sacha Morin, Kr- ishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso M. de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull. Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning. InProceedings ...

work page 2024

[5] [5]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023

[6] [6]

Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann

Ayc ¸a Takmaz, Elisabetta Fedele, Robert W. Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Open- mask3d: Open-vocabulary 3d instance segmentation. InAd- vances in Neural Information Processing Systems, 2023. 2

work page 2023

[7] [7]

Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships

Sebastian Koch, Narunas Vaskevicius, Mirco Colosi, Pe- dro Hermosilla, and Timo Ropinski. Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 7, 8

work page 2024

[8] [8]

Open-vocabulary functional 3d scene graphs for real-world indoor spaces.arXiv preprint arXiv:2503.19199, 2025

Zian Zhang, Cheng-Yu Tai, Yikuan Xie, Xuezhi Bao, Haritha Yerramilli, Luca Weihs, Alexander Weihs, Aniruddha Kem- bhavi, Roozbeh Mottaghi, Prune Truong, and Yiran Geng. Open-vocabulary functional 3d scene graphs for real-world indoor spaces.arXiv preprint arXiv:2503.19199, 2025. 2, 3, 8

work page arXiv 2025

[9] [9]

Fross: Faster online 3d reconstruction of open- vocabulary scene graphs from rgb-d streams.arXiv preprint arXiv:2506.19146, 2025

Jingyi Hou, Kun Liu, Ning Lu, Raghudeep Gadde, and Qiang Qiu. Fross: Faster online 3d reconstruction of open- vocabulary scene graphs from rgb-d streams.arXiv preprint arXiv:2506.19146, 2025. 1, 2, 3, 8

work page arXiv 2025

[10] [10]

Kimera: An open-source library for real-time metric- semantic localization and mapping

Antoni Rosinol, Marcus Abate, Yun Chang, and Luca Car- lone. Kimera: An open-source library for real-time metric- semantic localization and mapping. InProceedings of the IEEE International Conference on Robotics and Automation,

work page

[11] [11]

Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2, 7

work page 2017

[12] [12]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2

work page 2017

[13] [13]

Qi, Li Yi, Hao Su, and Leonidas J

Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space. InAdvances in Neural Information Processing Systems, 2017

work page 2017

[14] [14]

Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J

Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

work page 2019

[15] [15]

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip H. S. Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page

[16] [16]

Image retrieval using scene graphs

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

work page

[17] [17]

Visual relationship detection with language priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei- Fei. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vi- sion, 2016

work page 2016

[18] [18]

Shamma, Michael Bernstein, and Li Fei-Fei

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. InInterna- tional Journal of Computer Vision, 2017

work page 2017

[19] [19]

Choy, and Li Fei-Fei

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2

work page 2017

[20] [20]

Neural motifs: Scene graph parsing with global con- text

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2

work page 2018

[21] [21]

Graph r-cnn for scene graph generation

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InProceed- ings of the European Conference on Computer Vision, 2018

work page 2018

[22] [22]

Learning to compose dynamic tree structures for visual contexts

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019

work page 2019

[23] [23]

Unbiased scene graph generation from bi- ased training

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

work page 2020

[24] [24]

Bipartite graph network with adaptive message passing for unbiased scene graph generation

Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. Bipartite graph network with adaptive message passing for unbiased scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021

[25] [25]

Sgtr: End-to- end scene graph generation with transformer

Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End-to- end scene graph generation with transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022

work page 2022

[26] [26]

Predicate-aware embedding learning for scene graph generation

Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. Predicate-aware embedding learning for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[27] [27]

Knowledge-embedded routing network for scene graph gen- eration

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. Knowledge-embedded routing network for scene graph gen- eration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019

[28] [28]

Fine-grained predicates learning for scene graph generation

Xinyu Lyu, Lianli Gao, Yudong Guo, Zhou Zhao, and Heng Tao Shen Huang. Fine-grained predicates learning for scene graph generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2022

work page 2022

[29] [29]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020

work page 2020

[30] [30]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Semantic compo- sitional learning for low-shot scene graph generation.arXiv preprint arXiv:2108.08600, 2021. 3

work page arXiv 2021

[31] [31]

T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56, 2022

work page 2022

[32] [32]

T. He, L. Gao, J. Song, and Y .-F. Li. Toward a unified transformer-based framework for scene graph generation and human-object interaction detection.IEEE Transactions on Image Processing, 32:6274–6288, 2023

work page 2023

[33] [33]

T. He, T. Wu, D. Zhang, G. Duan, K. Qin, and Y .- F. Li. Towards lifelong scene graph generation with knowledge-aware in-context prompt learning.arXiv preprint arXiv:2401.14626, 2024. 2

work page arXiv 2024

[34] [34]

Panoptic scene graph gener- ation

Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph gener- ation. InProceedings of the European Conference on Com- puter Vision, 2022. 2

work page 2022

[35] [35]

Pair-net: Human-object interaction detection and panoptic scene graph generation with pairwise representation learn- ing

Yeliang Wang, Jialian Yu, Zhongang Zhang, and Ziwei Liu. Pair-net: Human-object interaction detection and panoptic scene graph generation with pairwise representation learn- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024

work page 2024

[36] [36]

Openpsg: Open-set panoptic scene graph generation via large multimodal models

Ziqin Zhou, Yichao Zhang, Yifei Wang, Yu Li, and Ziwei Liu. Openpsg: Open-set panoptic scene graph generation via large multimodal models. InProceedings of the European Conference on Computer Vision, 2024

work page 2024

[37] [37]

T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022

work page 2022

[38] [38]

X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page

[39] [39]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, 2021. 2, 3, 8

work page 2021

[40] [40]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the International Conference on Machine Learning, 2022

work page 2022

[41] [41]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the International Conference on Machine Learn- ing, 2023. 3

work page 2023

[42] [42]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023. 2

work page 2023

[43] [43]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page

[44] [44]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection. InarXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[46] [46]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision, 2017

work page 2017

[47] [47]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Pro- cessing Systems, 2015

work page 2015

[48] [48]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InProceedings of the European Conference on Computer Vision, 2020. 2

work page 2020

[49] [49]

Scene graph prediction with limited labels

Apoorva Dornadula, Austin Narcomey, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Scene graph prediction with limited labels. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision Workshops, 2019. 3

work page 2019

[50] [50]

Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher R´e, and Li Fei-Fei

Vincent S. Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher R´e, and Li Fei-Fei. Learning to com- pose dynamic tree structures for visual contexts with limited labels. InProceedings of the IEEE/CVF International Con- ference on Computer Vision Workshops, 2019

work page 2019

[51] [51]

Recovering the unbiased scene graphs from the biased ones

Meng-Jiun Chiou, Henghui Ding, Hanshu Yan, Changhu Wang, Roger Zimmermann, and Jiashi Feng. Recovering the unbiased scene graphs from the biased ones. InProceedings of the ACM International Conference on Multimedia, 2021

work page 2021

[52] [52]

Not all relations are equal: Mining informative relationships for scene graph generation

Vikash Goel, Nishant Chandak, and Dinesh Manocha. Not all relations are equal: Mining informative relationships for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

work page

[53] [53]

R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Muap: Multi-step adaptive prompt learning for vision- language model with missing modality.arXiv preprint arXiv:2409.04693, 2024. 3

work page arXiv 2024

[54] [54]

R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Robustpt: Dynamic disentanglement prompt tuning in vision-language models with missing modalities. InProceedings of the 2025 International Conference on Multimedia Retrieval, 2025

work page 2025

[55] [55]

R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025

work page 2025

[56] [56]

R. Dai, Z. Cai, L. Mo, G. Duan, K. Shi, and T. He. Anchor drift no more: Hierarchical consistency-guided prompt dis- tillation for incomplete multimodal learning. InProceedings of the ACM Web Conference, pages 7330–7341, 2026

work page 2026

[57] [57]

S. Wei, K. Zhang, L. Chen, T. He, and G. Duan. Unbiased dynamic multimodal fusion.arXiv preprint arXiv:2603.19681, 2026

work page arXiv 2026

[58] [58]

Q. Dong, R. Dai, G. Duan, K. Qin, Y . Zhang, and T. He. Un- biased multimodal intent recognition with auxiliary rationale generation.Neurocomputing, page 131197, 2025

work page 2025

[59] [59]

W. Yin, S. Zhan, C. Liu, X. Hu, G. Duan, X. Xie, Y .-F. Li, and T. He. Tical: Typicality-based consistency-aware learn- ing for multimodal emotion recognition. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17948–17956, 2026. 3

work page 2026

[60] [60]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, 2018. 3

work page 2018

[61] [61]

T. He, L. Gao, J. Song, and Y .-F. Li. Exploiting scene graphs for human-object interaction detection. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15984–15993, 2021

work page 2021

[62] [62]

X. Hu, K. Qin, T. He, and G. Luo. Exploring hierarchical tuple-based contextual correlations for human-object inter- action detection.Tsinghua Science and Technology, 2026

work page 2026

[63] [63]

Z. Yang, X. Liu, D. Ouyang, G. Duan, D. Zhang, T. He, and Y .-F. Li. Towards open-vocabulary hoi detection with cal- ibrated vision-language models and locality-aware queries. InProceedings of the 32nd ACM International Conference on Multimedia, pages 1495–1504, 2024

work page 2024

[64] [64]

J. W. Owusu, R. Y . Zakari, K. Qin, and T. He. Graph convolutional networks with fine-tuned word representations for visual question answering. In2024 IEEE Smart World Congress, pages 1381–1387, 2024

work page 2024

[65] [65]

R. Y . Zakari, J. W. Owusu, K. Qin, H. Wang, Z. K. Lawal, and T. He. Vqa and visual reasoning: An overview of ap- proaches, datasets, and future direction.Neurocomputing, 622:129345, 2025. 3

work page 2025

[66] [66]

J. Song, T. He, H. Fan, and L. Gao. Deep discrete hashing with self-supervised pairwise labels. InJoint European Con- ference on Machine Learning and Knowledge Discovery in Databases, 2017. 3

work page 2017

[67] [67]

T. He, L. Gao, J. Song, X. Wang, K. Huang, and Y . Li. Sneq: Semi-supervised attributed network embedding with attention-based quantisation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4091–4098, 2020

work page 2020

[68] [68]

T. He, L. Gao, J. Song, and Y .-F. Li. Semisupervised net- work embedding with differentiable deep quantization.IEEE Transactions on Neural Networks and Learning Systems, 34 (8):4791–4802, 2021

work page 2021

[69] [69]

Zhang, S

D. Zhang, S. Liang, T. He, J. Shao, and K. Qin. Cviformer: Cross-view interactive transformer for efficient stereoscopic image super-resolution.IEEE Transactions on Emerging Topics in Computational Intelligence, 9(2), 2024. 3

work page 2024

[70] [70]

W. Yin, Y . Wang, G. Duan, D. Zhang, X. Hu, Y .-F. Li, and T. He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsupervised cross-domain visual emotion recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3888–3898, 2025. 3

work page 2025