arxiv: 2604.10397 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Recognition: unknown

Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

Yuanhao Luo , Di Wen , Kunyu Peng , Ruiping Liu , Junwei Zheng , Yufan Chen , Jiale Wei , Rainer Stiefelhage

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords human-object interactionvideo anticipationjoint detection and predictionset predictionresidual transitionstemporal alignmentpair-centric modeling

0 comments

The pith

Jointly learning detection and anticipation of human-object interactions as residual pair-state transitions improves both tasks, especially at longer horizons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that video human-object interaction understanding improves when anticipation is not treated as a separate forecasting task but instead learned jointly with detection. It introduces the HOI-DA framework, which performs subject-object localization, present interaction detection, and future anticipation together by representing future interactions as residual transitions from current pair states. To support reliable evaluation, the authors create the DETAnt-HOI benchmark with temporal corrections that align nominal future labels more closely with actual video dynamics. Experiments on this benchmark show consistent gains for both detection and anticipation, with the largest improvements appearing at longer prediction horizons. A sympathetic reader would care because the joint modeling acts as a structural constraint that enriches pair-level video representations without needing explicit future pair construction.

Core claim

The central claim is that anticipation of future human-object interactions is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. This is realized in the HOI-DA framework through set prediction over time, where future interactions are modeled as residual transitions from current pair states rather than through externally constructed future pairs. The approach yields consistent improvements on both detection and anticipation, with larger gains observed at longer time horizons, and is supported by the temporally corrected DETAnt-HOI benchmark derived from VidHOI and Action Genome.

What carries the argument

HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states.

If this is right

Detection accuracy improves because anticipation acts as a structural regularizer on pair representations.
Anticipation accuracy increases, with the benefit growing as the prediction horizon lengthens.
Pair-level video representations become richer when future state changes are modeled as residuals from the present.
Separate pipelines that first build human-object pairs and then forecast on them are outperformed by the unified set-prediction approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual-transition idea could be tested on other video understanding tasks such as action anticipation or multi-agent interaction forecasting without changing the core architecture.
If the joint model generalizes, it suggests that many current video benchmarks underestimate performance because of temporal label misalignment rather than model shortcomings.
Extending the residual modeling to continuous time instead of discrete horizons might further reduce the need for dense future annotations.

Load-bearing premise

That modeling future interactions as residual transitions from current pair states captures the actual dynamics without requiring explicit future pair construction, and that the temporal corrections in DETAnt-HOI remove misalignment without introducing new annotation biases.

What would settle it

A controlled comparison on DETAnt-HOI in which a model trained only on detection followed by a separate anticipation head matches or exceeds the joint HOI-DA model at long horizons would falsify the claim that joint residual-transition modeling is required for the observed gains.

Figures

Figures reproduced from arXiv: 2604.10397 by Di Wen, Jiale Wei, Junwei Zheng, Kunyu Peng, Rainer Stiefelhage, Ruiping Liu, Yuanhao Luo, Yufan Chen.

**Figure 1.** Figure 1: Comparison between conventional multi-stage video HOI pipelines and HOI-DA. Unlike prior methods that separate [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of our model. Given an observed clip, HOI-DA builds a shared spatio-temporal visual memory and uses a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Temporal Summary Module. Learnable horizon [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Temporal non-continuity in the VidHOI evalua [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of unified HOI detection and multi-horizon anticipation on VidHOI. Given an observed video [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Pair tracking under occlusion and camera motion on VidHOI. Numbered bounding boxes denote persistent human– [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Joint HOI detection and multi-horizon anticipation across diverse scenes on VidHOI. For each scenario, we visualize [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-attention heatmaps at three decoder stages of HOI-DA on VidHOI. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies HOI detection and anticipation via residual transitions on fixed pairs and fixes some annotation timing issues, but the fixed-pair assumption looks like the main weak point.

read the letter

The main contribution is HOI-DA, which does detection, present HOI classification, and future anticipation in one pass by treating future labels as residuals from the current detected subject-object pairs. They also put out DETAnt-HOI, a version of VidHOI and Action Genome with temporal corrections to reduce misalignment between keyframes and actual future frames. The reported pattern is that joint training helps, and the lift grows at longer horizons. That part is straightforward and addresses a real pipeline problem where anticipation was usually bolted on after separate pair construction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome, and HOI-DA, a pair-centric set-prediction framework that jointly performs subject-object localization, present HOI detection, and multi-horizon anticipation by modeling future interactions as residual transitions from current pair states. Experiments report consistent improvements in both detection and anticipation, with larger gains at longer horizons, leading to the claim that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning.

Significance. If the empirical results hold and the residual-transition modeling generalizes beyond stable pairs, the work could shift video HOI research toward unified detection-plus-anticipation architectures rather than cascaded pipelines. The corrected benchmark addresses a real evaluation issue (temporal misalignment of sparse keyframes) and the public release of code and data would be a concrete contribution to the community.

major comments (1)

[HOI-DA framework (method)] The HOI-DA core design choice of representing future HOI labels as residuals from currently detected pairs (described in the method) avoids explicit future-pair construction but does not address pair birth, death, or re-assignment across horizons. Because real video dynamics routinely involve new subject-object pairings, any measured joint-learning benefit may be confined to clips with stable pair sets; longer-horizon gains could therefore reflect dataset bias rather than a general structural advantage. A concrete test on sequences containing entering/exiting entities or disengaging pairs is needed to support the central claim.

minor comments (2)

[Abstract] The abstract states that experiments show consistent improvements and larger gains at longer horizons but supplies no quantitative metrics, ablation tables, or error analysis; this makes it impossible for a reader to assess the magnitude or robustness of the reported gains.
[Benchmark construction] Clarification is needed on whether DETAnt-HOI's temporal corrections create any new pair-level supervision or merely realign existing labels; if the latter, the joint-learning experiments still operate under the same fixed-pair assumption.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of DETAnt-HOI and HOI-DA. We address the single major comment below.

read point-by-point responses

Referee: The HOI-DA core design choice of representing future HOI labels as residuals from currently detected pairs (described in the method) avoids explicit future-pair construction but does not address pair birth, death, or re-assignment across horizons. Because real video dynamics routinely involve new subject-object pairings, any measured joint-learning benefit may be confined to clips with stable pair sets; longer-horizon gains could therefore reflect dataset bias rather than a general structural advantage. A concrete test on sequences containing entering/exiting entities or disengaging pairs is needed to support the central claim.

Authors: We agree that the residual-transition formulation primarily evolves interactions from the current detected pair set and does not explicitly instantiate new pairs or terminate existing ones at future horizons. The set-prediction backbone does allow the cardinality of the output set to vary, but the residual mechanism indeed ties future predictions to the present pairs. To strengthen the central claim, we will add a targeted evaluation on dynamic subsets of DETAnt-HOI that contain entering/exiting entities and disengaging pairs, reporting separate metrics for these cases in the revised manuscript. This analysis will clarify whether the observed joint-learning gains generalize beyond stable-pair clips. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on derived benchmark with independent design choices

full rationale

The paper introduces DETAnt-HOI as a temporally corrected benchmark derived from existing VidHOI and Action Genome datasets, and HOI-DA as a pair-centric framework that models future HOI as residual transitions from current pair states. The headline claim—that joint detection and anticipation yields gains, especially at longer horizons—is presented as an experimental outcome rather than a mathematical derivation. No equations, uniqueness theorems, or self-citations are invoked to force the result by construction; the residual-transition ansatz is an explicit modeling decision whose validity is tested via ablation and comparison on the benchmark. This is a standard empirical setup with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described in the provided text. The residual-transition modeling is treated as a modeling choice rather than a new postulated entity.

pith-pipeline@v0.9.0 · 5494 in / 1009 out tokens · 44912 ms · 2026-05-10T16:23:30.436342+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction
cs.CV 2026-05 unverdicted novelty 5.0

IMPACT-HOI introduces a supervisory control framework for constructing partial HOI event graphs in procedural videos via trust-calibrated automation and atomic rollback to reduce manual annotation effort while preserv...

Reference graph

Works this paper leans on

55 extracted references · 8 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Yichao Cao, Qingfei Tang, Xiu Su, Song Chen, Shan You, Xiaobo Lu, and Chang Xu. 2023. Detecting any human-object interaction relationship: Universal hoi detector with spatial prompt learning on foundation models.Advances in Neural Information Processing Systems36 (2023), 739–751

2023
[2]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan- der Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. InEuropean conference on computer vision. Springer, 213–229

2020
[3]

Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. 2018. Learn- ing to Detect Human-Object Interactions. In2018 IEEE Winter Conference on Applications of Computer Vision (W ACV). 381–389. doi:10.1109/WACV.2018.00048

work page doi:10.1109/wacv.2018.00048 2018
[4]

Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, and Jiashi Feng. 2021. St-hoi: A spatial-temporal baseline for human-object interaction detection in videos. InProceedings of the 2021 ACM workshop on intelligent cross- data analysis and retrieval. 9–17

2021
[5]

Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-temporal transformer for dynamic scene graph generation. InProceedings of the IEEE/CVF international conference on computer vision. 16372–16382

2021
[6]

Chen Gao, Yuliang Zou, and Jia-Bin Huang. 2018. iCAN: Instance-Centric Atten- tion Network for Human-Object Interaction Detection. InBritish Machine Vision Conference (BMVC)

2018
[7]

Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video action transformer network. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. 244–253

2019
[8]

Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and Recognizing Human-Object Interactions. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR)

2018
[9]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18995–19012

2022
[10]

Dongzhou Gu, Kaihua Huang, Shiwei Ma, and Jiang Liu. 2025. HOI-V: One- stage human-object interaction detection based on multi-feature fusion in videos. Signal Processing: Image Communication130 (2025), 117224

2025
[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

2016
[12]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10236– 10247

2020
[13]

Donggoo Kang, Dasol Jeong, Hyunmin Lee, Sangwoo Park, Hasil Park, Sunkyu Kwon, Yeongjoon Kim, and Joonki Paik. 2024. VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis. InEuropean Conference on Computer Vision. Springer, 218–235

2024
[14]

Rahima Khanam and Muhammad Hussain. 2024. What is YOLOv5: A deep look into the internal features of the popular object detector.arXiv preprint arXiv:2407.20892(2024)

work page arXiv 2024
[15]

Bumsoo Kim, Junhyun Lee, Jaewoo Kang, Eun-Sol Kim, and Hyunwoo J Kim
[16]

In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 74–83
[17]

Bumsoo Kim, Jonghwan Mun, Kyoung-Woon On, Minchul Shin, Junhyun Lee, and Eun-Sol Kim. 2022. Mstr: Multi-scale transformer for end-to-end human-object interaction detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19578–19587

2022
[18]

Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, and Xi Wang. 2024. Palm: Predicting actions through language models. InEuropean Conference on Computer Vision. Springer, 140–158

2024
[19]

Sanghyun Kim, Deunsol Jung, and Minsu Cho. 2023. Relational context learning for human-object interaction detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2925–2934

2023
[20]

Qinqian Lei, Bo Wang, and Robby T Tan. 2024. Ez-hoi: Vlm adaptation via guided prompt learning for zero-shot hoi detection.Advances in Neural Information Processing Systems37 (2024), 55831–55857

2024
[21]

Ting Lei, Shaofeng Yin, and Yang Liu. 2024. Exploring the potential of large foundation models for open-vocabulary hoi detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16657–16667

2024
[22]

Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Yi Chen, Jiong Wang, and Jiafei Wu. 2026. DQEN: Dual Query Enhancement Network for DETR-based HOI Detection.IEEE Transactions on Artificial Intelligence(2026)

2026
[23]

Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 482–490

2020
[24]

Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. 2022. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20123–20132

2022
[25]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision. 2980–2988

2017
[26]

Miao Liu, Siyu Tang, Yin Li, and James M Rehg. 2020. Forecasting human-object interaction: Joint prediction of motor attention and egocentric activity.Lecture Notes in Computer Science12346 (2020), 704–721

2020
[27]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[28]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022

2021
[29]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Shuailei Ma, Yuefeng Wang, Shanze Wang, and Ying Wei. 2023. Fgahoi: Fine- grained anchors for human-object interaction detection.IEEE Transactions on Pattern Analysis and Machine Intelligence46, 4 (2023), 2415–2429

2023
[31]

Yunyao Mao, Jiajun Deng, Wengang Zhou, Li Li, Yao Fang, and Houqiang Li. 2023. Clip4hoi: towards adapting clip for practical zero-shot hoi detection.Advances in Neural Information Processing Systems36 (2023), 45895–45906

2023
[32]

Esteve Valls Mascaro, Daniel Sliwowski, and Dongheui Lee. 2023. Hoi4abot: Human-object interaction anticipation for human intention reading collaborative robots.arXiv preprint arXiv:2309.16524(2023)

work page arXiv 2023
[33]

Zhifan Ni, Esteve Valls Mascaró, Hyemin Ahn, and Dongheui Lee. 2023. Human– object interaction prediction in videos through gaze following.Computer Vision and Image Understanding233 (2023), 103741

2023
[34]

Shan Ning, Longtian Qiu, Yongfei Liu, and Xuming He. 2023. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23507– 23517

2023
[35]

Jeeseung Park, Jin-Woo Park, and Jong-Seok Lee. 2023. Viplo: Vision transformer based pose-conditioned self-loop graph for human-object interaction detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. 17152–17162

2023
[36]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks.IEEE transac- tions on pattern analysis and machine intelligence39, 6 (2016), 1137–1149

2016
[37]

Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua
[38]

InProceedings of the 2019 on International Conference on Multimedia Retrieval

Annotating objects and relations in user-generated videos. InProceedings of the 2019 on International Conference on Multimedia Retrieval. 279–287

2019
[39]

Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. 2021. Qpic: Query- based pairwise human-object interaction detection with image-wide contextual information. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10410–10419

2021
[40]

Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, and Alessio Del Bue. 2024. Leveraging next-active objects for context-aware anticipation in egocentric videos. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 8657–8666

2024
[41]

Danyang Tu, Wei Sun, Xiongkuo Min, Guangtao Zhai, and Wei Shen. 2022. Video-based human-object interaction detection from tubelet tokens.Advances in Neural Information Processing Systems35 (2022), 23345–23357

2022
[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[43]

Ning Wang, Guangming Zhu, Liang Zhang, Peiyi Shen, Hongsheng Li, and Cong Hua. 2021. Spatio-temporal interaction graph parsing networks for human-object interaction recognition. InProceedings of the 29th ACM international conference on multimedia. 4985–4993

2021
[44]

Yisong Wang, Nan Xi, Jingjing Meng, and Junsong Yuan. 2024. Interaction- Centric Spatio-Temporal Context Reasoning for Multi-person Video HOI Recog- nition. InEuropean Conference on Computer Vision. Springer, 419–435

2024
[45]

Junxian Wu, Yujia Zhang, Michael Kampffmeyer, Yi Pan, Chenyu Zhang, Shiying Sun, Hui Chang, and Xiaoguang Zhao. 2025. HierGAT: hierarchical spatial- temporal network with graph and transformer for video HOI detection.Multi- media Systems31, 1 (2025), 13

2025
[46]

Nan Xi, Jingjing Meng, and Junsong Yuan. 2023. Open set video hoi detection from action-centric chain-of-look prompting. InProceedings of the IEEE/CVF International Conference on Computer Vision. 3079–3089

2023
[47]

Shiyu Xuan, Dongkai Wang, Zechao Li, and Jinhui Tang. 2026. Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition.arXiv preprint arXiv:2602.15124(2026)

work page arXiv 2026
[48]

Bin Yang, Yulin Zhang, Hong-Yu Zhou, and Sibei Yang. 2025. No More Sibling Rivalry: Debiasing Human-Object Interaction Detection. InProceedings of the Rethinking Video Human–Object Interaction: Set Prediction over Time for Unified Detection and Anticipation IEEE/CVF International Conference on Computer Vision. 22707–22717

2025
[49]

Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, and Ruimao Zhang
[50]

Boosting human-object interaction detection with text-to-image diffusion model.arXiv preprint arXiv:2305.12252(2023)

work page arXiv 2023
[51]

Jie Yang, Bingliang Li, Ailing Zeng, Lei Zhang, and Ruimao Zhang. 2024. Open- world human-object interaction detection via multi-modal prompts. InProceed- ings of the ieee/cvf conference on computer vision and pattern recognition. 16954– 16964

2024
[52]

Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni, and Mingqian Tang. 2022. Rlip: Relational language-image pre-training for human-object interaction detection.Advances in Neural Information Processing Systems35 (2022), 37416–37431

2022
[53]

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, and Deli Zhao. 2023. Rlipv2: Fast scaling of relational language-image pre-training. InProceedings of the IEEE/CVF international conference on computer vision. 21649–21661

2023
[54]

Aixi Zhang, Yue Liao, Si Liu, Miao Lu, Yongliang Wang, Chen Gao, and Xiaobo Li. 2021. Mining the benefits of two-stage and one-stage hoi detection.Advances in neural information processing systems34 (2021), 17209–17220

2021
[55]

Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, and Chen Sun. 2023. Antgpt: Can large language models help long-term action anticipation from videos?arXiv preprint arXiv:2307.16368 (2023). Luo and Wen, et al. A Additional Details on the DETAnt-HOI Benchmark As described in Section 4 of the main paper, DETAnt-HOI e...

work page arXiv 2023