pith. machine review for the scientific record. sign in

arxiv: 2604.23249 · v2 · submitted 2026-04-25 · 💻 cs.RO

Recognition: unknown

BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances

Haoyu Zhang, Jianxiang Liu, Wenzhao Lian, Yifan Han, Yunhan Guo, Yuqi Gu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 07:50 UTC · model grok-4.3

classification 💻 cs.RO
keywords affordance learninghuman video imitationrobot manipulationembodiment-agnostic representationzero-shot transfertask composition3D motion affordances
0
0 comments X

The pith

BridgeACT transfers human video demonstrations into executable robot actions by using affordances as an embodiment-agnostic bridge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that robotic manipulation skills can be acquired directly from human demonstration videos without any robot-specific demonstration data. It does so by treating affordances as a shared intermediate layer that separates the problems of identifying grasp locations and predicting 3D motion trajectories. These elements are then converted into robot commands through a dedicated grasping module and a lightweight closed-loop controller. The approach further represents tasks as sequences of such affordance operations, enabling uniform handling of diverse interactions. A sympathetic reader would care because this removes the usual bottleneck of collecting expensive robot data while promising generalization across objects, scenes, and viewpoints.

Core claim

BridgeACT models affordance as an embodiment-agnostic intermediate representation that bridges human demonstrations and robot actions. It decomposes each manipulation into grounding task-relevant affordance regions in the current scene and predicting task-conditioned 3D motion affordances from human videos. The resulting affordances are executed on a robot via a grasping module and a lightweight closed-loop motion controller, supporting direct real-world deployment and composition of complex tasks from basic affordance operations.

What carries the argument

The unified tool-target affordances, which act as the embodiment-agnostic bridge by extracting grasp regions and task-conditioned 3D motion trajectories from human videos and mapping them to robot execution modules.

If this is right

  • Robots can perform real-world manipulation tasks using only models trained on human videos.
  • Diverse tasks and object-to-object interactions are handled uniformly by composing sequences of affordance operations.
  • Generalization holds to unseen objects, scenes, and viewpoints without retraining.
  • Performance exceeds prior methods that require robot demonstration data or produce only perception-level outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large collections of internet human videos could become practical training sources for robots at scale.
  • The same bridging idea might transfer skills between different robot hardware without per-robot data collection.
  • Adding handling for dynamic obstacles or contact-rich interactions would test the limits of the closed-loop controller.

Load-bearing premise

Affordance regions and 3D motion affordances extracted from human videos can be mapped accurately to robot actions by a grasping module and closed-loop controller without robot-specific data or adaptation.

What would settle it

A real-robot trial in which affordance predictions from human videos are accurate yet the physical grasps or motions repeatedly fail to match the intended task would show the mapping step is insufficient.

Figures

Figures reproduced from arXiv: 2604.23249 by Haoyu Zhang, Jianxiang Liu, Wenzhao Lian, Yifan Han, Yunhan Guo, Yuqi Gu.

Figure 1
Figure 1. Figure 1: Overview of BridgeACT. BridgeACT learns role-conditioned tool-target affordances from human videos without robot demonstrations. It identifies task-relevant operable regions, assigns functional tool-target roles, and predicts executable 3D interaction dynamics for real-robot manipulation. The framework supports both single-object and object-to-object interactions across diverse manipulation scenarios. Abst… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of BridgeACT. (a) From raw human videos, we automatically construct motion point-flow training data through task understanding, object– action annotation, 2D segmentation, and 3D trajectory generation. (b) Given a task and the current scene, a task-conditioned affordance grounding agent localizes the task-relevant tool and target regions and samples 3D query points. (c) The grounded scene, query p… view at source ↗
Figure 3
Figure 3. Figure 3: Partial visualization of motion affordances for representative tasks: open oven (left), pour water (middle), and cut fruit (right). The predicted 3D view at source ↗
Figure 4
Figure 4. Figure 4: Trajectory visualization comparing our method with General Flow. view at source ↗
Figure 5
Figure 5. Figure 5: Trajectory visualization of our model under Cross-Object and Cross view at source ↗
read the original abstract

Learning robot manipulation from human videos is appealing due to the scale and diversity of human demonstrations, but transferring such demonstrations to executable robot behavior remains challenging. Prior work either relies on robot data for downstream adaptation or learns affordance representations that remain at the perception level and do not directly support real-world execution. We present BridgeACT, an affordance-driven framework that learns robotic manipulation directly from human videos without requiring any robot demonstration data. Our key idea is to model affordance as an embodiment-agnostic intermediate representation that bridges human demonstrations and robot actions. BridgeACT decomposes manipulation into two complementary problems: where to grasp and how to move. To this end, BridgeACT first grounds task-relevant affordance regions in the current scene, and then predicts task-conditioned 3D motion affordances from human demonstrations. The resulting affordances are mapped to robot actions through a grasping module and a lightweight closed-loop motion controller, enabling direct deployment on real robots. In addition, we represent complex manipulation tasks as compositions of affordance operations, which allows a unified treatment of diverse tasks and object-to-object interactions. Experiments on real-world manipulation tasks show that BridgeACT outperforms prior baselines and generalizes to unseen objects, scenes, and viewpoints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BridgeACT, an affordance-driven framework for learning robotic manipulation directly from human videos without any robot demonstration data. It models affordances as embodiment-agnostic intermediates by grounding task-relevant regions in the scene and predicting task-conditioned 3D motion affordances from human demonstrations. These are mapped to executable robot actions via a grasping module and lightweight closed-loop controller, with complex tasks represented as compositions of affordance operations for unified handling of diverse manipulations and object interactions. The abstract claims outperformance over baselines and generalization to unseen objects, scenes, and viewpoints on real-world tasks.

Significance. If the central claims hold with rigorous validation, this would represent a meaningful advance in scalable robot learning by eliminating the need for robot-specific demonstration data and leveraging abundant human videos. The embodiment-agnostic affordance decomposition and compositional task representation are conceptually strong for improving generalization. The paper explicitly credits the use of external human video data as grounding, avoiding self-referential definitions.

major comments (2)
  1. [Abstract] Abstract: The claim that 'Experiments on real-world manipulation tasks show that BridgeACT outperforms prior baselines and generalizes to unseen objects, scenes, and viewpoints' is made without any metrics, baselines, success rates, or experimental protocol details. This is load-bearing for the central claim of superiority and generalization, as the evaluation cannot be assessed from the provided text.
  2. [Method] Approach description: The mapping step from extracted affordances to robot actions relies on an unspecified 'grasping module' and 'lightweight closed-loop motion controller' with no details on whether these incorporate learned components trained on robot data, robot kinematics, per-robot calibration, or any form of robot-specific adaptation. This directly affects the key assertion of learning 'directly from human videos without requiring any robot demonstration data,' as any implicit robot data here would undermine the guarantee.
minor comments (1)
  1. [Abstract] The title references 'Unified Tool-Target Affordances' but the abstract does not explicitly define or distinguish tool versus target affordances; a short clarification in the introduction would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] The claim that 'Experiments on real-world manipulation tasks show that BridgeACT outperforms prior baselines and generalizes to unseen objects, scenes, and viewpoints' is made without any metrics, baselines, success rates, or experimental protocol details. This is load-bearing for the central claim of superiority and generalization, as the evaluation cannot be assessed from the provided text.

    Authors: We agree that the abstract, as a high-level summary, does not include specific quantitative metrics or protocol details, which can make the performance claims harder to evaluate at a glance. The full manuscript provides these in the Experiments section, including success rates, baseline comparisons, and evaluation protocols across real-world tasks. To address this directly, we will revise the abstract to concisely incorporate key results (such as average success rates and generalization metrics) while preserving its brevity. This revision will better substantiate the claims without changing the underlying findings. revision: yes

  2. Referee: [Method] Approach description: The mapping step from extracted affordances to robot actions relies on an unspecified 'grasping module' and 'lightweight closed-loop motion controller' with no details on whether these incorporate learned components trained on robot data, robot kinematics, per-robot calibration, or any form of robot-specific adaptation. This directly affects the key assertion of learning 'directly from human videos without requiring any robot demonstration data,' as any implicit robot data here would undermine the guarantee.

    Authors: This is a valid concern for ensuring the embodiment-agnostic nature of our approach. The grasping module is a non-learned, geometry-driven component that directly uses the predicted task-relevant 3D affordance regions and standard point-cloud processing to compute grasp poses, with no training on robot demonstration data. The lightweight closed-loop motion controller executes the task-conditioned 3D motion affordances via simple feedback control (leveraging robot forward kinematics and real-time visual feedback) without any learned robot-specific components, demonstration data, or extensive per-robot calibration beyond standard deployment setup. We will revise the Method section to include explicit descriptions, implementation details, and pseudocode for these modules to unambiguously confirm that no robot demonstration data is involved, thereby reinforcing the core claim. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a framework that extracts embodiment-agnostic affordance regions and task-conditioned 3D motion affordances from human videos, then maps them to robot actions via a grasping module and closed-loop controller. No equations, fitted parameters, or self-referential definitions appear in the provided text. The central claim rests on external human video data and standard affordance concepts rather than reducing to a tautology or self-citation chain. The mapping step is described at a high level without introducing fitted inputs called predictions or ansatzes smuggled via prior self-work. This is a normal non-circular case for a systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that affordance is a transferable, embodiment-independent representation that can be learned from human video and executed on robots without further tuning.

axioms (1)
  • domain assumption Affordance can be modeled as an embodiment-agnostic intermediate representation bridging human demonstrations and robot actions
    Invoked as the key idea that enables learning without robot data.

pith-pipeline@v0.9.0 · 5533 in / 1073 out tokens · 20397 ms · 2026-05-08T07:50:25.296799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction,

    Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi, “Hoi4d: A 4d egocentric dataset for category-level human-object interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 013–21 022

  2. [2]

    Scaling egocentric vision: The epic-kitchens dataset,

    D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kaza- kos, D. Moltisanti, J. Munro, T. Perrett, W. Priceet al., “Scaling egocentric vision: The epic-kitchens dataset,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 720–736

  3. [3]

    ://arxiv.org/abs/2401.00025

    C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,”arXiv preprint arXiv:2401.00025, 2023

  4. [4]

    arXiv preprint arXiv:2401.11439

    C. Yuan, C. Wen, T. Zhang, and Y . Gao, “General flow as foundation af- fordance for scalable robot learning,”arXiv preprint arXiv:2401.11439, 2024

  5. [5]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

  6. [6]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “π 0.5: a vision- language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

  7. [7]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  8. [8]

    Uad: Unsupervised affordance distillation for generalization in robotic manipulation,

    Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generalization in robotic manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 3822–3831

  9. [9]

    Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 306–324

  10. [10]

    Huang, C

    W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manip- ulation,”arXiv preprint arXiv:2409.01652, 2024

  11. [11]

    Vlm see, robot do: Human demo video to robot action plan via vision language model,

    B. Wang, J. Zhang, S. Dong, I. Fang, and C. Feng, “Vlm see, robot do: Human demo video to robot action plan via vision language model,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 17 215–17 222

  12. [12]

    Policy adaptation via language optimization: Decomposing tasks for few-shot imitation,

    V . Myers, B. C. Zheng, O. Mees, S. Levine, and K. Fang, “Policy adaptation via language optimization: Decomposing tasks for few-shot imitation,”arXiv preprint arXiv:2408.16228, 2024

  13. [13]

    Tracegen: World modeling in 3d trace space enables learning from cross-embodiment videos,

    S. Lee, Y . Jung, I. Chun, Y .-C. Lee, Z. Cai, H. Huang, A. Talreja, T. D. Dao, Y . Liang, J.-B. Huang, and F. Huang, “Tracegen: World modeling in 3d trace space enables learning from cross-embodiment videos,” 2025

  14. [14]

    Pre-training auto-regressive robotic models with 4d representations,

    D. Niu, Y . Sharma, H. Xue, G. Biamby, J. Zhang, Z. Ji, T. Darrell, and R. Herzig, “Pre-training auto-regressive robotic models with 4d representations,”arXiv preprint arXiv:2502.13142, 2025

  15. [15]

    Flow as the cross-domain manipulation interface

    M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,”arXiv preprint arXiv:2407.15208, 2024

  16. [16]

    Correspondence- oriented imitation learning: Flexible visuomotor control with 3d condi- tioning,

    Y . Cao, Z. Bhaumik, J. Jia, X. He, and K. Fang, “Correspondence- oriented imitation learning: Flexible visuomotor control with 3d condi- tioning,” 2025

  17. [17]

    Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,

    W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei, “Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,” 2026

  18. [18]

    The epic- kitchens dataset: Collection, challenges and baselines,

    D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kaza- kos, D. Moltisanti, J. Munro, T. Perrett, W. Priceet al., “The epic- kitchens dataset: Collection, challenges and baselines,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4125–4141, 2020

  19. [19]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  20. [20]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

  21. [21]

    Cotracker3: Simpler and better point tracking by pseudo- labelling real videos,

    N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker3: Simpler and better point tracking by pseudo- labelling real videos,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 6013–6022. 9

  22. [22]

    Point transformer v3: Simpler faster stronger,

    X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao, “Point transformer v3: Simpler faster stronger,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 4840–4851

  23. [23]

    Pointnext: Revisiting pointnet++ with improved training and scaling strategies,

    G. Qian, Y . Li, H. Peng, J. Mai, H. Hammoud, M. Elhoseiny, and B. Ghanem, “Pointnext: Revisiting pointnet++ with improved training and scaling strategies,”Advances in neural information processing systems, vol. 35, pp. 23 192–23 204, 2022

  24. [24]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  25. [25]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  26. [26]

    Efficient diffusion training via min-snr weighting strategy,

    T. Hang, S. Gu, C. Li, J. Bao, D. Chen, H. Hu, X. Geng, and B. Guo, “Efficient diffusion training via min-snr weighting strategy,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 7441–7451

  27. [27]

    Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,

    C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao, “Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 7622–7629

  28. [28]

    Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting,

    W. Bao, L. Chen, L. Zeng, Z. Li, Y . Xu, J. Yuan, and Y . Kong, “Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 13 702–13 711

  29. [29]

    Joint hand motion and interaction hotspots prediction from egocentric videos,

    S. Liu, S. Tripathi, S. Majumdar, and X. Wang, “Joint hand motion and interaction hotspots prediction from egocentric videos,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3282–3292