pith. machine review for the scientific record. sign in

arxiv: 2604.10397 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Recognition: unknown

Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords human-object interactionvideo anticipationjoint detection and predictionset predictionresidual transitionstemporal alignmentpair-centric modeling
0
0 comments X

The pith

Jointly learning detection and anticipation of human-object interactions as residual pair-state transitions improves both tasks, especially at longer horizons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that video human-object interaction understanding improves when anticipation is not treated as a separate forecasting task but instead learned jointly with detection. It introduces the HOI-DA framework, which performs subject-object localization, present interaction detection, and future anticipation together by representing future interactions as residual transitions from current pair states. To support reliable evaluation, the authors create the DETAnt-HOI benchmark with temporal corrections that align nominal future labels more closely with actual video dynamics. Experiments on this benchmark show consistent gains for both detection and anticipation, with the largest improvements appearing at longer prediction horizons. A sympathetic reader would care because the joint modeling acts as a structural constraint that enriches pair-level video representations without needing explicit future pair construction.

Core claim

The central claim is that anticipation of future human-object interactions is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. This is realized in the HOI-DA framework through set prediction over time, where future interactions are modeled as residual transitions from current pair states rather than through externally constructed future pairs. The approach yields consistent improvements on both detection and anticipation, with larger gains observed at longer time horizons, and is supported by the temporally corrected DETAnt-HOI benchmark derived from VidHOI and Action Genome.

What carries the argument

HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states.

If this is right

  • Detection accuracy improves because anticipation acts as a structural regularizer on pair representations.
  • Anticipation accuracy increases, with the benefit growing as the prediction horizon lengthens.
  • Pair-level video representations become richer when future state changes are modeled as residuals from the present.
  • Separate pipelines that first build human-object pairs and then forecast on them are outperformed by the unified set-prediction approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same residual-transition idea could be tested on other video understanding tasks such as action anticipation or multi-agent interaction forecasting without changing the core architecture.
  • If the joint model generalizes, it suggests that many current video benchmarks underestimate performance because of temporal label misalignment rather than model shortcomings.
  • Extending the residual modeling to continuous time instead of discrete horizons might further reduce the need for dense future annotations.

Load-bearing premise

That modeling future interactions as residual transitions from current pair states captures the actual dynamics without requiring explicit future pair construction, and that the temporal corrections in DETAnt-HOI remove misalignment without introducing new annotation biases.

What would settle it

A controlled comparison on DETAnt-HOI in which a model trained only on detection followed by a separate anticipation head matches or exceeds the joint HOI-DA model at long horizons would falsify the claim that joint residual-transition modeling is required for the observed gains.

Figures

Figures reproduced from arXiv: 2604.10397 by Di Wen, Jiale Wei, Junwei Zheng, Kunyu Peng, Rainer Stiefelhage, Ruiping Liu, Yuanhao Luo, Yufan Chen.

Figure 1
Figure 1. Figure 1: Comparison between conventional multi-stage video HOI pipelines and HOI-DA. Unlike prior methods that separate [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our model. Given an observed clip, HOI-DA builds a shared spatio-temporal visual memory and uses a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Temporal Summary Module. Learnable horizon [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Temporal non-continuity in the VidHOI evalua [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of unified HOI detection and multi-horizon anticipation on VidHOI. Given an observed video [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pair tracking under occlusion and camera motion on VidHOI. Numbered bounding boxes denote persistent human– [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Joint HOI detection and multi-horizon anticipation across diverse scenes on VidHOI. For each scenario, we visualize [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-attention heatmaps at three decoder stages of HOI-DA on VidHOI. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome, and HOI-DA, a pair-centric set-prediction framework that jointly performs subject-object localization, present HOI detection, and multi-horizon anticipation by modeling future interactions as residual transitions from current pair states. Experiments report consistent improvements in both detection and anticipation, with larger gains at longer horizons, leading to the claim that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning.

Significance. If the empirical results hold and the residual-transition modeling generalizes beyond stable pairs, the work could shift video HOI research toward unified detection-plus-anticipation architectures rather than cascaded pipelines. The corrected benchmark addresses a real evaluation issue (temporal misalignment of sparse keyframes) and the public release of code and data would be a concrete contribution to the community.

major comments (1)
  1. [HOI-DA framework (method)] The HOI-DA core design choice of representing future HOI labels as residuals from currently detected pairs (described in the method) avoids explicit future-pair construction but does not address pair birth, death, or re-assignment across horizons. Because real video dynamics routinely involve new subject-object pairings, any measured joint-learning benefit may be confined to clips with stable pair sets; longer-horizon gains could therefore reflect dataset bias rather than a general structural advantage. A concrete test on sequences containing entering/exiting entities or disengaging pairs is needed to support the central claim.
minor comments (2)
  1. [Abstract] The abstract states that experiments show consistent improvements and larger gains at longer horizons but supplies no quantitative metrics, ablation tables, or error analysis; this makes it impossible for a reader to assess the magnitude or robustness of the reported gains.
  2. [Benchmark construction] Clarification is needed on whether DETAnt-HOI's temporal corrections create any new pair-level supervision or merely realign existing labels; if the latter, the joint-learning experiments still operate under the same fixed-pair assumption.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of DETAnt-HOI and HOI-DA. We address the single major comment below.

read point-by-point responses
  1. Referee: The HOI-DA core design choice of representing future HOI labels as residuals from currently detected pairs (described in the method) avoids explicit future-pair construction but does not address pair birth, death, or re-assignment across horizons. Because real video dynamics routinely involve new subject-object pairings, any measured joint-learning benefit may be confined to clips with stable pair sets; longer-horizon gains could therefore reflect dataset bias rather than a general structural advantage. A concrete test on sequences containing entering/exiting entities or disengaging pairs is needed to support the central claim.

    Authors: We agree that the residual-transition formulation primarily evolves interactions from the current detected pair set and does not explicitly instantiate new pairs or terminate existing ones at future horizons. The set-prediction backbone does allow the cardinality of the output set to vary, but the residual mechanism indeed ties future predictions to the present pairs. To strengthen the central claim, we will add a targeted evaluation on dynamic subsets of DETAnt-HOI that contain entering/exiting entities and disengaging pairs, reporting separate metrics for these cases in the revised manuscript. This analysis will clarify whether the observed joint-learning gains generalize beyond stable-pair clips. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on derived benchmark with independent design choices

full rationale

The paper introduces DETAnt-HOI as a temporally corrected benchmark derived from existing VidHOI and Action Genome datasets, and HOI-DA as a pair-centric framework that models future HOI as residual transitions from current pair states. The headline claim—that joint detection and anticipation yields gains, especially at longer horizons—is presented as an experimental outcome rather than a mathematical derivation. No equations, uniqueness theorems, or self-citations are invoked to force the result by construction; the residual-transition ansatz is an explicit modeling decision whose validity is tested via ablation and comparison on the benchmark. This is a standard empirical setup with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described in the provided text. The residual-transition modeling is treated as a modeling choice rather than a new postulated entity.

pith-pipeline@v0.9.0 · 5494 in / 1009 out tokens · 44912 ms · 2026-05-10T16:23:30.436342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction

    cs.CV 2026-05 unverdicted novelty 5.0

    IMPACT-HOI introduces a supervisory control framework for constructing partial HOI event graphs in procedural videos via trust-calibrated automation and atomic rollback to reduce manual annotation effort while preserv...

Reference graph

Works this paper leans on

55 extracted references · 8 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Yichao Cao, Qingfei Tang, Xiu Su, Song Chen, Shan You, Xiaobo Lu, and Chang Xu. 2023. Detecting any human-object interaction relationship: Universal hoi detector with spatial prompt learning on foundation models.Advances in Neural Information Processing Systems36 (2023), 739–751

  2. [2]

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan- der Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. InEuropean conference on computer vision. Springer, 213–229

  3. [3]

    Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. 2018. Learn- ing to Detect Human-Object Interactions. In2018 IEEE Winter Conference on Applications of Computer Vision (W ACV). 381–389. doi:10.1109/WACV.2018.00048

  4. [4]

    Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, and Jiashi Feng. 2021. St-hoi: A spatial-temporal baseline for human-object interaction detection in videos. InProceedings of the 2021 ACM workshop on intelligent cross- data analysis and retrieval. 9–17

  5. [5]

    Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-temporal transformer for dynamic scene graph generation. InProceedings of the IEEE/CVF international conference on computer vision. 16372–16382

  6. [6]

    Chen Gao, Yuliang Zou, and Jia-Bin Huang. 2018. iCAN: Instance-Centric Atten- tion Network for Human-Object Interaction Detection. InBritish Machine Vision Conference (BMVC)

  7. [7]

    Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video action transformer network. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. 244–253

  8. [8]

    Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and Recognizing Human-Object Interactions. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR)

  9. [9]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18995–19012

  10. [10]

    Dongzhou Gu, Kaihua Huang, Shiwei Ma, and Jiang Liu. 2025. HOI-V: One- stage human-object interaction detection based on multi-feature fusion in videos. Signal Processing: Image Communication130 (2025), 117224

  11. [11]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  12. [12]

    Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10236– 10247

  13. [13]

    Donggoo Kang, Dasol Jeong, Hyunmin Lee, Sangwoo Park, Hasil Park, Sunkyu Kwon, Yeongjoon Kim, and Joonki Paik. 2024. VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis. InEuropean Conference on Computer Vision. Springer, 218–235

  14. [14]

    Rahima Khanam and Muhammad Hussain. 2024. What is YOLOv5: A deep look into the internal features of the popular object detector.arXiv preprint arXiv:2407.20892(2024)

  15. [15]

    Bumsoo Kim, Junhyun Lee, Jaewoo Kang, Eun-Sol Kim, and Hyunwoo J Kim

  16. [16]

    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 74–83

  17. [17]

    Bumsoo Kim, Jonghwan Mun, Kyoung-Woon On, Minchul Shin, Junhyun Lee, and Eun-Sol Kim. 2022. Mstr: Multi-scale transformer for end-to-end human-object interaction detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19578–19587

  18. [18]

    Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, and Xi Wang. 2024. Palm: Predicting actions through language models. InEuropean Conference on Computer Vision. Springer, 140–158

  19. [19]

    Sanghyun Kim, Deunsol Jung, and Minsu Cho. 2023. Relational context learning for human-object interaction detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2925–2934

  20. [20]

    Qinqian Lei, Bo Wang, and Robby T Tan. 2024. Ez-hoi: Vlm adaptation via guided prompt learning for zero-shot hoi detection.Advances in Neural Information Processing Systems37 (2024), 55831–55857

  21. [21]

    Ting Lei, Shaofeng Yin, and Yang Liu. 2024. Exploring the potential of large foundation models for open-vocabulary hoi detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16657–16667

  22. [22]

    Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Yi Chen, Jiong Wang, and Jiafei Wu. 2026. DQEN: Dual Query Enhancement Network for DETR-based HOI Detection.IEEE Transactions on Artificial Intelligence(2026)

  23. [23]

    Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 482–490

  24. [24]

    Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. 2022. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20123–20132

  25. [25]

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision. 2980–2988

  26. [26]

    Miao Liu, Siyu Tang, Yin Li, and James M Rehg. 2020. Forecasting human-object interaction: Joint prediction of motor attention and egocentric activity.Lecture Notes in Computer Science12346 (2020), 704–721

  27. [27]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

  28. [28]

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022

  29. [29]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

  30. [30]

    Shuailei Ma, Yuefeng Wang, Shanze Wang, and Ying Wei. 2023. Fgahoi: Fine- grained anchors for human-object interaction detection.IEEE Transactions on Pattern Analysis and Machine Intelligence46, 4 (2023), 2415–2429

  31. [31]

    Yunyao Mao, Jiajun Deng, Wengang Zhou, Li Li, Yao Fang, and Houqiang Li. 2023. Clip4hoi: towards adapting clip for practical zero-shot hoi detection.Advances in Neural Information Processing Systems36 (2023), 45895–45906

  32. [32]

    Esteve Valls Mascaro, Daniel Sliwowski, and Dongheui Lee. 2023. Hoi4abot: Human-object interaction anticipation for human intention reading collaborative robots.arXiv preprint arXiv:2309.16524(2023)

  33. [33]

    Zhifan Ni, Esteve Valls Mascaró, Hyemin Ahn, and Dongheui Lee. 2023. Human– object interaction prediction in videos through gaze following.Computer Vision and Image Understanding233 (2023), 103741

  34. [34]

    Shan Ning, Longtian Qiu, Yongfei Liu, and Xuming He. 2023. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23507– 23517

  35. [35]

    Jeeseung Park, Jin-Woo Park, and Jong-Seok Lee. 2023. Viplo: Vision transformer based pose-conditioned self-loop graph for human-object interaction detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. 17152–17162

  36. [36]

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks.IEEE transac- tions on pattern analysis and machine intelligence39, 6 (2016), 1137–1149

  37. [37]

    Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua

  38. [38]

    InProceedings of the 2019 on International Conference on Multimedia Retrieval

    Annotating objects and relations in user-generated videos. InProceedings of the 2019 on International Conference on Multimedia Retrieval. 279–287

  39. [39]

    Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. 2021. Qpic: Query- based pairwise human-object interaction detection with image-wide contextual information. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10410–10419

  40. [40]

    Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, and Alessio Del Bue. 2024. Leveraging next-active objects for context-aware anticipation in egocentric videos. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 8657–8666

  41. [41]

    Danyang Tu, Wei Sun, Xiongkuo Min, Guangtao Zhai, and Wei Shen. 2022. Video-based human-object interaction detection from tubelet tokens.Advances in Neural Information Processing Systems35 (2022), 23345–23357

  42. [42]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  43. [43]

    Ning Wang, Guangming Zhu, Liang Zhang, Peiyi Shen, Hongsheng Li, and Cong Hua. 2021. Spatio-temporal interaction graph parsing networks for human-object interaction recognition. InProceedings of the 29th ACM international conference on multimedia. 4985–4993

  44. [44]

    Yisong Wang, Nan Xi, Jingjing Meng, and Junsong Yuan. 2024. Interaction- Centric Spatio-Temporal Context Reasoning for Multi-person Video HOI Recog- nition. InEuropean Conference on Computer Vision. Springer, 419–435

  45. [45]

    Junxian Wu, Yujia Zhang, Michael Kampffmeyer, Yi Pan, Chenyu Zhang, Shiying Sun, Hui Chang, and Xiaoguang Zhao. 2025. HierGAT: hierarchical spatial- temporal network with graph and transformer for video HOI detection.Multi- media Systems31, 1 (2025), 13

  46. [46]

    Nan Xi, Jingjing Meng, and Junsong Yuan. 2023. Open set video hoi detection from action-centric chain-of-look prompting. InProceedings of the IEEE/CVF International Conference on Computer Vision. 3079–3089

  47. [47]

    Shiyu Xuan, Dongkai Wang, Zechao Li, and Jinhui Tang. 2026. Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition.arXiv preprint arXiv:2602.15124(2026)

  48. [48]

    Bin Yang, Yulin Zhang, Hong-Yu Zhou, and Sibei Yang. 2025. No More Sibling Rivalry: Debiasing Human-Object Interaction Detection. InProceedings of the Rethinking Video Human–Object Interaction: Set Prediction over Time for Unified Detection and Anticipation IEEE/CVF International Conference on Computer Vision. 22707–22717

  49. [49]

    Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, and Ruimao Zhang

  50. [50]

    Boosting human-object interaction detection with text-to-image diffusion model.arXiv preprint arXiv:2305.12252(2023)

  51. [51]

    Jie Yang, Bingliang Li, Ailing Zeng, Lei Zhang, and Ruimao Zhang. 2024. Open- world human-object interaction detection via multi-modal prompts. InProceed- ings of the ieee/cvf conference on computer vision and pattern recognition. 16954– 16964

  52. [52]

    Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni, and Mingqian Tang. 2022. Rlip: Relational language-image pre-training for human-object interaction detection.Advances in Neural Information Processing Systems35 (2022), 37416–37431

  53. [53]

    Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, and Deli Zhao. 2023. Rlipv2: Fast scaling of relational language-image pre-training. InProceedings of the IEEE/CVF international conference on computer vision. 21649–21661

  54. [54]

    Aixi Zhang, Yue Liao, Si Liu, Miao Lu, Yongliang Wang, Chen Gao, and Xiaobo Li. 2021. Mining the benefits of two-stage and one-stage hoi detection.Advances in neural information processing systems34 (2021), 17209–17220

  55. [55]

    Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, and Chen Sun. 2023. Antgpt: Can large language models help long-term action anticipation from videos?arXiv preprint arXiv:2307.16368 (2023). Luo and Wen, et al. A Additional Details on the DETAnt-HOI Benchmark As described in Section 4 of the main paper, DETAnt-HOI e...