pith. machine review for the scientific record. sign in

arxiv: 2604.14062 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.MM

Recognition: unknown

OneHOI: Unifying Human-Object Interaction Generation and Editing

Chee Seng Chan, Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:18 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords human-object interactiondiffusion transformergenerationeditingunified modelrelational attentioncomputer vision
0
0 comments X

The pith

OneHOI unifies human-object interaction generation and editing in a single diffusion transformer using shared structured representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OneHOI as a single framework that performs both the synthesis of new human-object interaction scenes from triplets and layouts and the modification of existing interactions through text or masks. It achieves this by routing all tasks through one conditional denoising process in a Relational Diffusion Transformer that uses role- and instance-aware tokens, layout-based action grounding, topology-enforcing attention, and specialized rotational embeddings to handle multiple interactions. Joint training with modality dropout on the new HOI-Edit-44K dataset plus other sources lets the model accept mixed inputs such as full HOI triplets, object-only cues, or arbitrary masks. A reader would care because this removes the previous split between separate generation and editing pipelines, allowing more fluid control in visual content creation.

Core claim

We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on HOI-Edit-44K along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control.

What carries the argument

The Relational Diffusion Transformer (R-DiT), which processes verb-mediated relations via role- and instance-aware HOI tokens, layout-based Action Grounding, Structured HOI Attention for topology, and HOI RoPE for multi-interaction disentanglement within one conditional denoising process.

If this is right

  • The model supports layout-guided, layout-free, arbitrary-mask, and mixed-condition inputs in one process.
  • Joint training on HOI-Edit-44K and related datasets yields state-of-the-art results on both generation and editing tasks.
  • Shared structured representations eliminate the need for separate models when switching between creation and modification.
  • Modality dropout enables the system to handle incomplete or mixed conditions such as object-only entities alongside full interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The unification approach could reduce overall model count and training overhead when building broader scene-understanding systems that combine multiple interaction types.
  • Interactive applications might now allow seamless generation followed by immediate editing within the same trained network rather than model switching.
  • Similar token-and-attention designs could be tested on related tasks such as human-human interactions or scene-graph-conditioned synthesis.

Load-bearing premise

The assumption that the Relational Diffusion Transformer components together with modality dropout in joint training can support mixed conditions and deliver strong performance on both generation and editing without hidden trade-offs.

What would settle it

A direct comparison where the unified model produces measurably worse generation quality or editing accuracy than specialized models on standard benchmarks, or fails to correctly integrate mixed inputs such as an HOI triplet plus an independent object mask.

Figures

Figures reproduced from arXiv: 2604.14062 by Chee Seng Chan, Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan.

Figure 1
Figure 1. Figure 1: OneHOI unifies Human-Object Interaction (HOI) generation and editing in a single, versatile model. It excels at challenging HOI editing, from text-guided changes to novel layout-guided control and novel multi-HOI edits. For generation, OneHOI synthesises scenes from text, layouts, arbitrary shapes, or mixed conditions, offering unprecedented control over relational understanding in images. Abstract Human-O… view at source ↗
Figure 2
Figure 2. Figure 2: Unified HOI generation and editing. OneHOI en￾ables a single-model multi-step workflow. It begins with (i) Mixed-Condition Generation, synthesising a complex scene from layout-guided HOIs with arbitrary shape. Then, it per￾forms (ii) Layout-free HOI Editing, (e.g., change him to plant the flag), followed by (iii) Layout-guided HOI Editing (e.g., add another astronaut and driving a rover) and (iv) Attribute… view at source ↗
Figure 3
Figure 3. Figure 3: (a) OneHOI unifies HOI editing and generation tasks on a DiT backbone. The pipeline features an HOI Encoder to inject role and instance cues, and Structured HOI Attention to enforce verb-mediated topology and spatial grounding. (b, c) To separate instances, in contrast to the Original RoPE (b), HOI RoPE (c) provides unique positional indices for each interaction. [33] democratised this by operating in a co… view at source ↗
Figure 4
Figure 4. Figure 4: Action-token→image attention heatmaps from the base￾line. The “Between” region proposed in InteractDiffusion [13] misses where the action actually attends, while our “Union” re￾gion (subject ∪ object) better matches the attention footprint. Encoder, adding fine-grained role and instance identity; (iii) Structured HOI Attention, enforcing a verb-mediated attention topology and layout constraints; and (iv) H… view at source ↗
Figure 5
Figure 5. Figure 5: (a) HOI attention mask. Colours match Fig. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison for layout-free HOI editing. Our method successfully renders the new interaction while preserving [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison for HOI generation. While object-level methods correctly place entities, they fail to synthesise specified [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Progressively adding components improves the inter [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Versatile control in HOI generation. Our model supports [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: Results of the Human Preference Study. Aggregated [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Attention footprint of Flux.1. “Union” better matches [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Versatile workflow for unified HOI generation and editing using OneHOI. OneHOI enables a seamless, multi-step work￾flow within a single model, showcasing diverse conditional control. Starting with: Top Row: Urban Park Scene. (1) Mixed-Condition Generation synthesises a complex scene from layout-guided HOIs (i.e., walking dog) and arbitrary shape-guided independent objects (i.e., lamp post, leash), alongsi… view at source ↗
Figure 15
Figure 15. Figure 15: Examples from the HOI-Edit-44K dataset [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Treemap visualising the distribution of the interacting [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Distribution of the 54 object categories within the [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Distribution of source (pre-edit) actions in Multi [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 21
Figure 21. Figure 21: Sankey diagram visualising the action transitions in the [PITH_FULL_IMAGE:figures/full_fig_p017_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Additional qualitative results for HOI generation. These examples further highlight the limitations of baselines, which often [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Additional qualitative comparisons for layout-free HOI edits. [PITH_FULL_IMAGE:figures/full_fig_p019_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Examples from the MultiHOIEdit benchmark. [PITH_FULL_IMAGE:figures/full_fig_p019_24.png] view at source ↗
read the original abstract

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process. The core Relational Diffusion Transformer (R-DiT) uses role- and instance-aware HOI tokens, layout-based spatial Action Grounding, Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. It is trained jointly with modality dropout on HOI-Edit-44K plus HOI and object-centric datasets to support layout-guided, layout-free, arbitrary-mask, and mixed-condition control, claiming state-of-the-art results on both tasks.

Significance. If the unification holds without performance trade-offs, this would be a meaningful advance in HOI modeling by replacing separate generation and editing pipelines with one model supporting flexible mixed conditions. The public code release at the provided URL is a clear strength that supports reproducibility and community validation.

major comments (2)
  1. [Abstract] Abstract: The assertion of achieving 'state-of-the-art results across both HOI generation and editing' supplies no quantitative metrics, benchmark tables, or specific comparisons. This is load-bearing for the central unification claim, as the abstract provides no evidence that joint training maintains or exceeds specialized models on pure generation or editing benchmarks.
  2. [Experimental evaluation] The manuscript lacks any ablation isolating the effect of joint training with modality dropout versus task-specific training on the R-DiT components. This directly tests the assumption that the shared structured representations support arbitrary mixed conditions without hidden trade-offs on generation (e.g., layout-conditioned synthesis) or editing benchmarks.
minor comments (1)
  1. [Method] The description of HOI RoPE would benefit from an explicit equation or diagram contrasting it with standard rotary position embeddings to clarify the disentanglement mechanism for multi-HOI scenes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review and constructive suggestions. Below, we provide detailed responses to the major comments, outlining the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of achieving 'state-of-the-art results across both HOI generation and editing' supplies no quantitative metrics, benchmark tables, or specific comparisons. This is load-bearing for the central unification claim, as the abstract provides no evidence that joint training maintains or exceeds specialized models on pure generation or editing benchmarks.

    Authors: We agree that providing quantitative metrics in the abstract would better support the central unification claim. In the revised manuscript, we will update the abstract to include specific performance metrics and comparisons from our experimental results, such as key improvements on generation and editing benchmarks. This will offer immediate evidence while keeping the abstract concise, with full details remaining in the main text and tables. revision: yes

  2. Referee: [Experimental evaluation] The manuscript lacks any ablation isolating the effect of joint training with modality dropout versus task-specific training on the R-DiT components. This directly tests the assumption that the shared structured representations support arbitrary mixed conditions without hidden trade-offs on generation (e.g., layout-conditioned synthesis) or editing benchmarks.

    Authors: We acknowledge the importance of this ablation to validate that joint training does not introduce trade-offs. We will add a dedicated ablation study in the revised manuscript that compares the jointly trained model with modality dropout against separately trained task-specific models on both HOI generation and editing benchmarks. This will provide direct evidence regarding the effectiveness of the shared representations for mixed conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; novel architecture stands on independent design and training

full rationale

The paper introduces OneHOI as a new diffusion transformer with R-DiT components (role/instance-aware tokens, Action Grounding, Structured HOI Attention, HOI RoPE) trained jointly via modality dropout on HOI-Edit-44K and other datasets. No equations, predictions, or first-principles results are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The unification of generation and editing is claimed through architectural choices and empirical SOTA results rather than any renaming or redefinition of known quantities. Self-citations, if present for prior HOI work, are not load-bearing for the core contribution, which remains a self-contained model proposal against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The abstract describes a new model relying on standard diffusion assumptions plus several invented architectural components whose effectiveness is asserted without supporting derivations or data in the provided text.

axioms (1)
  • standard math Diffusion process assumptions for conditional denoising
    Implicit in any diffusion transformer for generation/editing.
invented entities (3)
  • Relational Diffusion Transformer (R-DiT) no independent evidence
    purpose: Models verb-mediated relations via role- and instance-aware HOI tokens
    New transformer variant introduced to unify the tasks.
  • Structured HOI Attention no independent evidence
    purpose: Enforces interaction topology
    New attention mechanism described in abstract.
  • HOI RoPE no independent evidence
    purpose: Disentangles multi-HOI scenes
    New positional encoding variant.

pith-pipeline@v0.9.0 · 5536 in / 1414 out tokens · 69694 ms · 2026-05-10T14:18:42.154984+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions. InCVPR, 2023. 6

  2. [2]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InICCV, pages 22560–22570, 2023. 6

  3. [3]

    Yichao Cao, Qingfei Tang, Xiu Su, Song Chen, Shan You, Xiaobo Lu, and Chang Xu. Detecting any human-object interaction relationship: Universal hoi detector with spatial prompt learning on foundation models.Advances in Neural Information Processing Systems, 36:739–751, 2023. 1

  4. [4]

    Verbdiff: Text-only diffusion models with enhanced interaction awareness

    SeungJu Cha, Kwanyoung Lee, Ye-Chan Kim, Hyunwoo Oh, and Dong-Jin Kim. Verbdiff: Text-only diffusion models with enhanced interaction awareness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8041–8050, 2025. 1

  5. [5]

    Learning to detect human-object interactions

    Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In2018 ieee winter conference on applications of computer vision (wacv), pages 381–389. IEEE, 2018. 5

  6. [6]

    Qahoi: Query-based anchors for human-object interaction detection

    Junwen Chen and Keiji Yanai. Qahoi: Query-based anchors for human-object interaction detection. In2023 18th In- ternational Conference on Machine Vision and Applications (MVA), pages 1–5. IEEE, 2023. 1

  7. [7]

    Fireflow: Fast inversion of rectified flow for image semantic editing

    Yingying Deng, Xiangyu He, Changwang Mei, Peisong Wang, and Fan Tang. Fireflow: Fast inversion of rectified flow for image semantic editing. InICML, 2025. 6

  8. [8]

    Turboedit: Text-based image editing using few-step diffusion models

    Gilad Deutch, Rinon Gal, Daniel Garibi, Or Patashnik, and Daniel Cohen-Or. Turboedit: Text-based image editing using few-step diffusion models. InSIGGRAPH Asia 2024 Con- ference Papers, pages 1–12, 2024. 6

  9. [9]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  10. [10]

    Prompt-to-prompt image editing with cross attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. InICLR, 2023. 6

  11. [11]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 1

  12. [12]

    Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 2

  13. [13]

    Interactdiffusion: Interaction control in text-to-image diffusion models

    Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap-Peng Tan, and Weipeng Hu. Interactdiffusion: Interaction control in text-to-image diffusion models. InCVPR, pages 6180– 6189, 2024. 1, 2, 3, 6, 7

  14. [14]

    Interactedit: Zero-shot editing of human-object interactions in images.arXiv preprint arXiv:2503.09130, 2025

    Jiun Tian Hoe, Weipeng Hu, Wei Zhou, Chao Xie, Ziwei Wang, Chee Seng Chan, Xudong Jiang, and Yap-Peng Tan. Interactedit: Zero-shot editing of human-object interactions in images.arXiv preprint arXiv:2503.09130, 2025. 1, 2, 5, 6

  15. [15]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 5, 1

  16. [16]

    An edit friendly DDPM noise space: Inversion and manipulations

    Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly DDPM noise space: Inversion and manipulations. InCVPR, pages 12469–12478, 2024. 6

  17. [17]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. 5

  18. [18]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, pages 4015–4026, 2023. 5, 2

  19. [19]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023. 5

  20. [20]

    Flux, 2024

    Black Forest Labs. Flux, 2024. GitHub repository. 1

  21. [21]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  22. [22]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, pages 22511–22521, 2023. 2, 6

  23. [23]

    Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.arXiv preprint arXiv:2305.13655,

    Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm- grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. InarXiv preprint arXiv:2305.13655, 2023. 2, 5

  24. [24]

    Flow matching for genera- tive modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 3, 5

  25. [25]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection.ECCV, 2024

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.ECCV, 2024. 2

  26. [26]

    Discovering syntactic interaction clues for human-object interaction detection

    Jinguo Luo, Weihong Ren, Weibo Jiang, Xi’ai Chen, Qiang Wang, Zhi Han, and Honghai Liu. Discovering syntactic interaction clues for human-object interaction detection. In CVPR, pages 28212–28222, 2024. 1

  27. [27]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: representing scenes as neural radiance fields for view synthe- sis.Communications of the ACM, 65:99–106, 2022. 4

  28. [28]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InCVPR, pages 6038–6047,

  29. [29]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 5, 2

  30. [30]

    Scalable diffusion mod- els with transformers

    William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 2, 3

  31. [31]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learn- ing Research, 21(140):1–67, 2020. 3

  32. [32]

    C. J. Van Rijsbergen.Information Retrieval. Butterworth- Heinemann, USA, 2nd edition, 1979. 2

  33. [33]

    High-resolution image syn- thesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 3

  34. [34]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InMedical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 3

  35. [35]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021. 2

  36. [36]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  37. [37]

    Instancediffusion: Instance- level control for image generation

    Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Ro- hit Girdhar, and Ishan Misra. Instancediffusion: Instance- level control for image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6232–6242, 2024. 6

  38. [38]

    Qwen-image technical report,

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

  39. [39]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Jun- jie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jia- hao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omni- gen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.1887...

  40. [40]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  41. [41]

    Omnigen: Unified image genera- tion

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InCVPR, pages 13294–13304, 2025. 6

  42. [42]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 5

  43. [43]

    Hoiedit: Human– object interaction editing with text-to-image diffusion model

    Tang Xu, Wenbin Wang, and Alin Zhong. Hoiedit: Human– object interaction editing with text-to-image diffusion model. The Visual Computer, pages 1–13, 2025. 1, 2, 6

  44. [44]

    Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, and Stephen Gould

    Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, and Stephen Gould. Exploring predicate visual con- text in detecting human–object interactions. InICCV, pages 10411–10421, 2023. 5, 2

  45. [45]

    arXiv preprint arXiv:2501.01097 (2025)

    Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, and Yu Zhang. Eligen: Entity-level controlled im- age generation with regional attention.arXiv preprint arXiv:2501.01097, 2025. 2, 3, 6, 8, 1

  46. [46]

    Migc++: Advanced multi-instance generation controller for image synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Dewei Zhou, You Li, Fan Ma, Zongxin Yang, and Yi Yang. Migc++: Advanced multi-instance generation controller for image synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 6

  47. [47]

    Migc: Multi-instance generation controller for text-to-image synthesis

    Dewei Zhou, You Li, Fan Ma, Xiaoting Zhang, and Yi Yang. Migc: Multi-instance generation controller for text-to-image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6818– 6828, 2024. 2, 6 OneHOI: Unifying Human-Object Interaction Generation and Editing Supplementary Material A. Implementation Details ...

  48. [48]

    a photo of〈subject〉〈target action〉〈object〉 at〈background〉

    and InteractDiffusion [13] frameworks. We adapt the original SDXL-based InteractEdit backbone to the InteractDiffusion-XL variant. Our implementation follows a two-stage inversion process for each source image in the IEBench benchmark. In a departure from the standard text- only inversion used in InteractEdit, we leverage InteractD- iffusion’s native supp...