RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses
Pith reviewed 2026-05-22 09:55 UTC · model grok-4.3
The pith
Relation witnesses built from visual and geometric cues let models learn 3D scene graphs even when many true relations lack annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. Witness-consistent decoding and an RGB-D missing-relation audit protocol further improve output quality.
What carries the argument
A relation witness: a concrete visual-geometric cue (contact plus vertical ordering for support, enclosure for containment, metric closeness for proximity, facing direction for orientation, and persistence across views for stability) that makes a specific relation directly observable in the captured scene.
If this is right
- Unseen relation predicates are recognized more accurately because verified positives supply additional training signal.
- Hallucinated relations decrease because reliable negatives suppress spurious predictions.
- Redundant relation phrases are reduced by witness-consistent decoding that respects multi-view persistence.
- The audit protocol provides a reproducible way to quantify how many missing annotations are actually recoverable from geometry.
Where Pith is reading between the lines
- The same witness construction could be applied to other 3D perception tasks that suffer from selective labeling, such as affordance prediction or dynamic interaction modeling.
- If the cues prove stable across datasets, the method offers a route to scale open-vocabulary scene graphs without exhaustive re-annotation of every new capture.
Load-bearing premise
The chosen visual-geometric cues correctly flag true relations and avoid large numbers of false positives or false negatives on real captured scenes.
What would settle it
Human inspection of a held-out set of scenes to measure whether the witness verifier's verified positives and reliable negatives match independent ground-truth relation labels at high precision and recall.
Figures
read the original abstract
Open-vocabulary 3D scene graph generation seeks to describe object instances and their relations with flexible natural-language predicates. The central difficulty is not only vocabulary expansion, but supervision reliability: relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated. We propose RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. The key concept is a relation witness: a concrete visual-geometric cue that makes a relation observable in the captured scene. Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views where both objects are visible. RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. We further introduce witness-consistent decoding and an RGB-D missing-relation audit protocol. Simulated manuscript-planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits show the intended behavior: improved unseen-relation recognition, higher witness precision, lower hallucination, and reduced redundant relation phrases. All numerical results are planning values and must be replaced by reproduced measurements before submission
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. It defines relation witnesses via visual-geometric cues (contact and vertical ordering for support, enclosure for containment, metric closeness for proximity, facing direction for orientation, and cross-view persistence for stable relations). These cues are used to build witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier then classifies unannotated relation candidates as verified missing positives, reliable negatives, or uncertain unlabeled cases, enabling a witness-guided positive-unlabeled objective. The approach also includes witness-consistent decoding and an RGB-D missing-relation audit protocol. Planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits report intended improvements in unseen-relation recognition, witness precision, lower hallucination, and fewer redundant phrases, though all numerical results are placeholders to be replaced by actual measurements.
Significance. If the visual-geometric cues reliably map to true relations despite reconstruction noise and the positive-unlabeled objective successfully exploits verified missing positives without introducing systematic label errors, the framework could advance open-vocabulary 3D scene graph generation by mitigating the effects of selective and incomplete annotations in existing datasets.
major comments (2)
- [Abstract] Abstract: The abstract states that all numerical results are planning values that must be replaced by reproduced measurements before submission, with no full methods, code, or actual data available for verification. This directly prevents assessment of the claimed improvements in unseen-relation recognition and witness precision, which are central to validating the framework.
- [Method (visual-geometric witness verifier)] The central claim depends on the visual-geometric cues (contact+vertical ordering, enclosure, metric closeness, facing direction, cross-view persistence) accurately identifying true relations. No analysis or ablation is provided on cue robustness to depth noise, partial occlusions, and mesh reconstruction artifacts typical in 3RScan/ScanNet, which could inject systematic errors into the witness verifier and corrupt the positive-unlabeled objective by misclassifying true positives as uncertain or reliable negatives as false positives.
minor comments (2)
- [Abstract] Abstract: The term 'Simulated manuscript-planning experiments' is unclear and should be rephrased for precision, e.g., to indicate preliminary simulations rather than final empirical results.
- [Abstract] Abstract: Expand the description of witness record construction (RGB views, depth maps, 3D geometry, role-sensitive text, object-prior null views, multi-view consistency) with concrete implementation details and pseudocode in the main text for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments below and describe the revisions we intend to make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states that all numerical results are planning values that must be replaced by reproduced measurements before submission, with no full methods, code, or actual data available for verification. This directly prevents assessment of the claimed improvements in unseen-relation recognition and witness precision, which are central to validating the framework.
Authors: We fully agree that the placeholder numerical results in the current version hinder a complete evaluation of the proposed framework. As noted in the manuscript itself, these are planning values. In the revised submission, we will perform the complete experiments using the 3DSSG/3RScan and ScanNet-derived open-vocabulary splits, replace all planning values with actual reproduced measurements, and provide detailed methods along with code to enable verification of the reported improvements in unseen-relation recognition and witness precision. revision: yes
-
Referee: [Method (visual-geometric witness verifier)] The central claim depends on the visual-geometric cues (contact+vertical ordering, enclosure, metric closeness, facing direction, cross-view persistence) accurately identifying true relations. No analysis or ablation is provided on cue robustness to depth noise, partial occlusions, and mesh reconstruction artifacts typical in 3RScan/ScanNet, which could inject systematic errors into the witness verifier and corrupt the positive-unlabeled objective by misclassifying true positives as uncertain or reliable negatives as false positives.
Authors: This is a valid concern regarding the reliability of the visual-geometric cues under real-world conditions. The current manuscript does not include such robustness analysis. We will incorporate a dedicated ablation study in the revised version that systematically evaluates the impact of depth noise, partial occlusions, and mesh reconstruction artifacts on each cue's performance. This will include metrics on how these factors influence the classification into verified missing positives, reliable negatives, and uncertain cases, as well as the effect on the positive-unlabeled objective. We expect this addition to provide stronger evidence for the central claims. revision: yes
Circularity Check
No circularity; derivation uses explicit cue definitions to augment incomplete labels
full rationale
The paper defines relation witnesses via concrete, observable rules (contact+vertical ordering for support, enclosure for containment, metric closeness for proximity, facing direction for orientation, cross-view persistence for stability) applied to input RGB-D and reconstructed geometry. These rules feed a verifier that produces verified positives, reliable negatives, and uncertain cases, which then drive a positive-unlabeled objective. No equations or steps reduce by construction to their own outputs; the chain is a standard data-augmentation pipeline whose validity rests on the external empirical accuracy of the cues rather than self-reference or self-citation. No self-citations, fitted parameters, or uniqueness theorems are invoked in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated.
invented entities (1)
-
relation witness
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
witness-guided positive-unlabeled objective
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese
Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and cam- era. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2019. 1, 2
work page 2019
-
[2]
Learning 3d semantic scene graphs from 3d indoor reconstructions
Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2020. 2, 7, 8
work page 2020
-
[3]
Scenegraphfusion: Incremen- tal 3d scene graph prediction from rgb-d sequences
Shun-Cheng Wu, Johanna Wald, Keisuke Tateno, Nassir Navab, and Federico Tombari. Scenegraphfusion: Incremen- tal 3d scene graph prediction from rgb-d sequences. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 2, 7, 8
work page 2021
-
[4]
Qiao Gu, Alyssa Kuwajerwala, Sacha Morin, Kr- ishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso M. de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull. Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning. InProceedings ...
work page 2024
-
[5]
Openscene: 3d scene understanding with open vocabularies
Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2
work page 2023
-
[6]
Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann
Ayc ¸a Takmaz, Elisabetta Fedele, Robert W. Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Open- mask3d: Open-vocabulary 3d instance segmentation. InAd- vances in Neural Information Processing Systems, 2023. 2
work page 2023
-
[7]
Sebastian Koch, Narunas Vaskevicius, Mirco Colosi, Pe- dro Hermosilla, and Timo Ropinski. Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 7, 8
work page 2024
-
[8]
Zian Zhang, Cheng-Yu Tai, Yikuan Xie, Xuezhi Bao, Haritha Yerramilli, Luca Weihs, Alexander Weihs, Aniruddha Kem- bhavi, Roozbeh Mottaghi, Prune Truong, and Yiran Geng. Open-vocabulary functional 3d scene graphs for real-world indoor spaces.arXiv preprint arXiv:2503.19199, 2025. 2, 3, 8
-
[9]
Jingyi Hou, Kun Liu, Ning Lu, Raghudeep Gadde, and Qiang Qiu. Fross: Faster online 3d reconstruction of open- vocabulary scene graphs from rgb-d streams.arXiv preprint arXiv:2506.19146, 2025. 1, 2, 3, 8
-
[10]
Kimera: An open-source library for real-time metric- semantic localization and mapping
Antoni Rosinol, Marcus Abate, Yun Chang, and Luca Car- lone. Kimera: An open-source library for real-time metric- semantic localization and mapping. InProceedings of the IEEE International Conference on Robotics and Automation,
-
[11]
Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2, 7
work page 2017
-
[12]
Qi, Hao Su, Kaichun Mo, and Leonidas J
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2
work page 2017
-
[13]
Qi, Li Yi, Hao Su, and Leonidas J
Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[14]
Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J
Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019
work page 2019
-
[15]
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip H. S. Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[16]
Image retrieval using scene graphs
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
-
[17]
Visual relationship detection with language priors
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei- Fei. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vi- sion, 2016
work page 2016
-
[18]
Shamma, Michael Bernstein, and Li Fei-Fei
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. InInterna- tional Journal of Computer Vision, 2017
work page 2017
-
[19]
Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2
work page 2017
-
[20]
Neural motifs: Scene graph parsing with global con- text
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2
work page 2018
-
[21]
Graph r-cnn for scene graph generation
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InProceed- ings of the European Conference on Computer Vision, 2018
work page 2018
-
[22]
Learning to compose dynamic tree structures for visual contexts
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019
work page 2019
-
[23]
Unbiased scene graph generation from bi- ased training
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
work page 2020
-
[24]
Bipartite graph network with adaptive message passing for unbiased scene graph generation
Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. Bipartite graph network with adaptive message passing for unbiased scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021
work page 2021
-
[25]
Sgtr: End-to- end scene graph generation with transformer
Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End-to- end scene graph generation with transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022
work page 2022
-
[26]
Predicate-aware embedding learning for scene graph generation
Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. Predicate-aware embedding learning for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[27]
Knowledge-embedded routing network for scene graph gen- eration
Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. Knowledge-embedded routing network for scene graph gen- eration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019
work page 2019
-
[28]
Fine-grained predicates learning for scene graph generation
Xinyu Lyu, Lianli Gao, Yudong Guo, Zhou Zhao, and Heng Tao Shen Huang. Fine-grained predicates learning for scene graph generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[29]
T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020
work page 2020
- [30]
-
[31]
T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56, 2022
work page 2022
-
[32]
T. He, L. Gao, J. Song, and Y .-F. Li. Toward a unified transformer-based framework for scene graph generation and human-object interaction detection.IEEE Transactions on Image Processing, 32:6274–6288, 2023
work page 2023
- [33]
-
[34]
Panoptic scene graph gener- ation
Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph gener- ation. InProceedings of the European Conference on Com- puter Vision, 2022. 2
work page 2022
-
[35]
Yeliang Wang, Jialian Yu, Zhongang Zhang, and Ziwei Liu. Pair-net: Human-object interaction detection and panoptic scene graph generation with pairwise representation learn- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024
work page 2024
-
[36]
Openpsg: Open-set panoptic scene graph generation via large multimodal models
Ziqin Zhou, Yichao Zhang, Yifei Wang, Yu Li, and Ziwei Liu. Openpsg: Open-set panoptic scene graph generation via large multimodal models. InProceedings of the European Conference on Computer Vision, 2024
work page 2024
-
[37]
T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022
work page 2022
-
[38]
X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[39]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, 2021. 2, 3, 8
work page 2021
-
[40]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the International Conference on Machine Learning, 2022
work page 2022
-
[41]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the International Conference on Machine Learn- ing, 2023. 3
work page 2023
-
[42]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023. 2
work page 2023
-
[43]
Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[44]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection. InarXiv preprint arXiv:2303.05499, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Schwing, Alexan- der Kirillov, and Rohit Girdhar
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[46]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision, 2017
work page 2017
-
[47]
Faster r-cnn: Towards real-time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Pro- cessing Systems, 2015
work page 2015
-
[48]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InProceedings of the European Conference on Computer Vision, 2020. 2
work page 2020
-
[49]
Scene graph prediction with limited labels
Apoorva Dornadula, Austin Narcomey, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Scene graph prediction with limited labels. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision Workshops, 2019. 3
work page 2019
-
[50]
Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher R´e, and Li Fei-Fei
Vincent S. Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher R´e, and Li Fei-Fei. Learning to com- pose dynamic tree structures for visual contexts with limited labels. InProceedings of the IEEE/CVF International Con- ference on Computer Vision Workshops, 2019
work page 2019
-
[51]
Recovering the unbiased scene graphs from the biased ones
Meng-Jiun Chiou, Henghui Ding, Hanshu Yan, Changhu Wang, Roger Zimmermann, and Jiashi Feng. Recovering the unbiased scene graphs from the biased ones. InProceedings of the ACM International Conference on Multimedia, 2021
work page 2021
-
[52]
Not all relations are equal: Mining informative relationships for scene graph generation
Vikash Goel, Nishant Chandak, and Dinesh Manocha. Not all relations are equal: Mining informative relationships for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
- [53]
-
[54]
R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Robustpt: Dynamic disentanglement prompt tuning in vision-language models with missing modalities. InProceedings of the 2025 International Conference on Multimedia Retrieval, 2025
work page 2025
-
[55]
R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025
work page 2025
-
[56]
R. Dai, Z. Cai, L. Mo, G. Duan, K. Shi, and T. He. Anchor drift no more: Hierarchical consistency-guided prompt dis- tillation for incomplete multimodal learning. InProceedings of the ACM Web Conference, pages 7330–7341, 2026
work page 2026
- [57]
-
[58]
Q. Dong, R. Dai, G. Duan, K. Qin, Y . Zhang, and T. He. Un- biased multimodal intent recognition with auxiliary rationale generation.Neurocomputing, page 131197, 2025
work page 2025
-
[59]
W. Yin, S. Zhan, C. Liu, X. Hu, G. Duan, X. Xie, Y .-F. Li, and T. He. Tical: Typicality-based consistency-aware learn- ing for multimodal emotion recognition. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17948–17956, 2026. 3
work page 2026
-
[60]
Bottom-up and top-down attention for image captioning and visual question answering
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, 2018. 3
work page 2018
-
[61]
T. He, L. Gao, J. Song, and Y .-F. Li. Exploiting scene graphs for human-object interaction detection. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15984–15993, 2021
work page 2021
-
[62]
X. Hu, K. Qin, T. He, and G. Luo. Exploring hierarchical tuple-based contextual correlations for human-object inter- action detection.Tsinghua Science and Technology, 2026
work page 2026
-
[63]
Z. Yang, X. Liu, D. Ouyang, G. Duan, D. Zhang, T. He, and Y .-F. Li. Towards open-vocabulary hoi detection with cal- ibrated vision-language models and locality-aware queries. InProceedings of the 32nd ACM International Conference on Multimedia, pages 1495–1504, 2024
work page 2024
-
[64]
J. W. Owusu, R. Y . Zakari, K. Qin, and T. He. Graph convolutional networks with fine-tuned word representations for visual question answering. In2024 IEEE Smart World Congress, pages 1381–1387, 2024
work page 2024
-
[65]
R. Y . Zakari, J. W. Owusu, K. Qin, H. Wang, Z. K. Lawal, and T. He. Vqa and visual reasoning: An overview of ap- proaches, datasets, and future direction.Neurocomputing, 622:129345, 2025. 3
work page 2025
-
[66]
J. Song, T. He, H. Fan, and L. Gao. Deep discrete hashing with self-supervised pairwise labels. InJoint European Con- ference on Machine Learning and Knowledge Discovery in Databases, 2017. 3
work page 2017
-
[67]
T. He, L. Gao, J. Song, X. Wang, K. Huang, and Y . Li. Sneq: Semi-supervised attributed network embedding with attention-based quantisation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4091–4098, 2020
work page 2020
-
[68]
T. He, L. Gao, J. Song, and Y .-F. Li. Semisupervised net- work embedding with differentiable deep quantization.IEEE Transactions on Neural Networks and Learning Systems, 34 (8):4791–4802, 2021
work page 2021
- [69]
-
[70]
W. Yin, Y . Wang, G. Duan, D. Zhang, X. Hu, Y .-F. Li, and T. He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsupervised cross-domain visual emotion recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3888–3898, 2025. 3
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.