pith. machine review for the scientific record. sign in

arxiv: 2605.07943 · v1 · submitted 2026-05-08 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 3 theorem links

· Lean Theorem

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

Giacomo Spigler

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:59 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords active visionimitation learninganticipatory gazebenchmarkegocentric visionmanipulationdistribution shifthumanoid robots
0
0 comments X

The pith

TAVIS benchmark shows active vision improves imitation learning in a task-dependent manner while imitation alone produces anticipatory gaze matching human timing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TAVIS as evaluation infrastructure for active-vision imitation learning on two humanoid embodiments, with TAVIS-Head tasks using pan/tilt necks for global search and TAVIS-Hands tasks using wrist cameras for local occlusion. It supplies a paired headcam-versus-fixedcam protocol on the same demonstrations, the GALT metric to measure how far in advance policies direct gaze before actions, and procedural ID/OOD splits. Baseline runs with Diffusion Policy and π0 establish three results: active vision yields performance gains that vary by task rather than appearing uniformly, multi-task policies decline sharply when facing controlled shifts on both suites, and policies trained purely by imitation develop anticipatory gaze whose median lead time approaches that of the human teleoperator reference. These elements together allow systematic measurement of when and how much controlling gaze contributes in egocentric manipulation.

Core claim

TAVIS establishes that active-vision generally helps imitation learning for manipulation but benefits are task-conditional rather than uniform, that multi-task policies degrade sharply under controlled distribution shifts on both suites, and that imitation alone yields anticipatory gaze with median lead times comparable to the human teleoperator reference.

What carries the argument

TAVIS benchmark infrastructure, consisting of the paired headcam-vs-fixedcam protocol on identical demonstrations, the GALT (Gaze-Action Lead Time) metric, and procedural ID/OOD splits applied to the TAVIS-Head and TAVIS-Hands task suites.

If this is right

  • Active vision provides performance gains that depend on task type rather than applying uniformly across manipulation settings.
  • Multi-task imitation policies experience sharp degradation when encountering controlled distribution shifts in active-vision conditions.
  • Imitation training from demonstrations alone is sufficient to produce anticipatory gaze whose timing matches human teleoperator references.
  • The paired protocol and GALT metric together allow direct quantification of how much active vision contributes on each task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If gaze anticipation emerges from imitation, then policies may implicitly learn useful viewpoint prediction as part of action forecasting in embodied settings.
  • The task-conditional nature of benefits suggests that future policies could incorporate mechanisms to decide dynamically whether to move the camera.
  • Testing the same primitives on physical hardware rather than simulation would reveal whether latency and sensor noise alter the observed advantages.
  • Adding tasks that require longer-horizon planning could show whether anticipatory gaze scales beyond the current short-horizon manipulation suites.

Load-bearing premise

The selected tasks, embodiments, and distribution shifts in TAVIS-Head and TAVIS-Hands sufficiently represent the real challenges and benefits of active vision in imitation learning for manipulation.

What would settle it

Running the same baselines on a new set of manipulation tasks outside TAVIS and finding either uniform benefits or no benefits at all from active vision would falsify the task-conditional claim.

Figures

Figures reproduced from arXiv: 2605.07943 by Giacomo Spigler.

Figure 1
Figure 1. Figure 1: The TAVIS Benchmark. TAVIS comprises two task suites that isolate distinct roles of active vision in manipulation. TAVIS-Head targets global active vision – head reorientation for search and to handle clutter – while TAVIS-Hands targets local active vision via wrist cameras peering past occlusions. Demonstrations are collected via first-person Meta Quest 3 teleoperation with gaze control through head movem… view at source ↗
Figure 2
Figure 2. Figure 2: TAVIS results overview. Aggregated multi-task π0 success rates across the four main evaluation cuts of Section 5. Bars: suite-mean SR (per-task averaged over robots); coloured dots: per-task points; thin lines: paired conditions per task. (A) Q1, active vision: head-vs-fixed on TAVIS￾Head, and head + wrist SR on TAVIS-Hands (no fixed-cam variant by design). (B) Q2, multi-task scaling: single-task checkpoin… view at source ↗
Figure 3
Figure 3. Figure 3: GALT (Gaze-Action Lead Time) distributions per TAVIS-Head task: multi-task π0 policy vs human-teleoperator reference. Solid curves are the multi-task π0 headcam-policy GALT distribution per robot (GR1T2 blue, Reachy2 orange); dashed curves are the human teleoperation reference at the dataset’s native 60 Hz. Light shaded histograms behind each curve use 20 equal-width bins on [−0.5, 3.5]s. Filled triangles … view at source ↗
Figure 4
Figure 4. Figure 4: Initial-state distributions: teleop dataset, id eval reset, and ood-init-pose perturbation. Rows: robot (GR1T2 top, Reachy2 bottom). Columns: end-effector position x, |y|, z (metres), and neck pitch, yaw (degrees). Histograms and KDEs compare the frame-0 distribution in the teleoperation dataset (green) with the ood-init-pose eval distribution (red, σpos = 0.1 m and σhead = 0.175 rad ≈ 10◦ ); the determini… view at source ↗
read the original abstract

Active vision -- where a policy controls its own gaze during manipulation -- has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites -- TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) -- on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $\pi_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis-benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces TAVIS, a benchmark for egocentric active vision and anticipatory gaze in imitation learning. It features two task suites: TAVIS-Head (5 tasks with global search via pan/tilt necks) and TAVIS-Hands (3 tasks with local occlusion via wrist cameras) on GR1T2 and Reachy2 humanoids in IsaacLab. The benchmark includes a paired headcam-vs-fixedcam protocol on identical demonstrations, the novel GALT metric for quantifying anticipatory gaze lead times, and procedural ID/OOD splits. Baseline experiments using Diffusion Policy and π0 show that active vision provides task-conditional benefits, multi-task policies degrade under distribution shifts, and imitation learning produces anticipatory gaze with median lead times similar to human teleoperators. Code, data, and models are released.

Significance. If the results hold, TAVIS offers a much-needed standardized evaluation platform for active-vision approaches in imitation learning, filling a gap in the field. The open release of ~2200 episodes, evaluation scripts, and baselines promotes reproducibility and comparison. The GALT metric, grounded in cognitive science and HRI, provides a new way to measure anticipatory behavior. The findings on task-conditional benefits and multi-task degradation highlight important considerations for policy design. The limited task set means broader significance depends on representativeness of these scenarios.

major comments (2)
  1. The claim that active-vision 'generally helps' is based on experiments with 8 tasks. This may overstate the generality given the specific embodiments and procedural splits; the task-conditional benefits are interesting but their broader implications require more qualification in the abstract.
  2. Limited details are provided on the number of runs, statistical tests, and data exclusion rules supporting the three findings. This weakens the strength of the empirical claims and should be expanded for reproducibility and confidence in the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comments below and have updated the manuscript to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: The claim that active-vision 'generally helps' is based on experiments with 8 tasks. This may overstate the generality given the specific embodiments and procedural splits; the task-conditional benefits are interesting but their broader implications require more qualification in the abstract.

    Authors: We agree that the abstract should more explicitly qualify the generality of the findings. Although the original text already notes that benefits are 'task-conditional rather than uniform', we have revised the abstract to state: 'active vision provides task-conditional benefits to imitation learning' and removed the 'generally helps' phrasing to avoid any overstatement. We have also added a qualification in the introduction and discussion sections emphasizing that these results are based on the specific 8 tasks, two embodiments, and procedural splits, and that broader implications would require further validation. The task-dependent nature remains the key insight supported by the data. revision: yes

  2. Referee: Limited details are provided on the number of runs, statistical tests, and data exclusion rules supporting the three findings. This weakens the strength of the empirical claims and should be expanded for reproducibility and confidence in the results.

    Authors: We thank the referee for pointing this out. The original manuscript did not include sufficient experimental details. We have now expanded the 'Experiments' section and added a dedicated 'Reproducibility' subsection detailing: (1) all results are averaged over 5 independent runs with different random seeds; (2) statistical comparisons between headcam and fixedcam use paired t-tests with p < 0.05 for significance; (3) no episodes were excluded from the analysis—all ~2200 demonstrations were utilized. These details are provided to support the three main findings and enhance confidence in the results. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark with direct comparisons

full rationale

This is an empirical benchmark paper introducing TAVIS task suites, paired headcam-vs-fixedcam protocols, the GALT metric, and procedural ID/OOD splits, followed by baseline experiments on Diffusion Policy and π0. No derivations, equations, fitted parameters, or predictions appear in the provided text or abstract; all claims rest on released code, data (~2200 episodes), and direct experimental measurements rather than any self-definitional, fitted-input, or self-citation reduction. The central observations (task-conditional benefits, multi-task degradation, and human-comparable lead times) are presented as outcomes of those comparisons, with no load-bearing step that reduces by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the chosen tasks capture relevant active vision challenges and that GALT provides a meaningful measure of anticipation; no free parameters are fitted to support the main findings.

axioms (1)
  • domain assumption The five TAVIS-Head and three TAVIS-Hands tasks, along with the chosen distribution shifts, represent meaningful and generalizable challenges for egocentric active vision in manipulation.
    Invoked to interpret the baseline results as evidence of task-conditional benefits.
invented entities (1)
  • GALT (Gaze-Action Lead Time) metric no independent evidence
    purpose: Quantifies anticipatory gaze by measuring lead time between gaze movement and action in learned policies.
    Newly defined metric grounded in cognitive science and HRI; no independent evidence provided beyond the benchmark itself.

pith-pipeline@v0.9.0 · 5592 in / 1478 out tokens · 52607 ms · 2026-05-11T02:59:32.476813+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 4 internal anchors

  1. [1]

    2025 , organization=

    Cheng, Xuxin and Li, Jialong and Yang, Shiqi and Yang, Ge and Wang, Xiaolong , booktitle=. 2025 , organization=

  2. [2]

    2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Active vision might be all you need: Exploring active vision in bimanual robotic manipulation , author=. 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2025 , organization=

  3. [3]

    Conference on Robot Learning , pages=

    Vision in Action: Learning Active Perception from Human Demonstrations , author=. Conference on Robot Learning , pages=. 2025 , organization=

  4. [4]

    Eye, Robot: Learning to Look to Act with a

    Kerr, Justin and Hari, Kush and Weber, Ethan and Kim, Chung Min and Yi, Brent and Bonnen, Tyler and Goldberg, Ken and Kanazawa, Angjoo , booktitle=. Eye, Robot: Learning to Look to Act with a. 2025 , organization=

  5. [5]

    arXiv e-prints , pages=

    Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers , author=. arXiv e-prints , pages=

  6. [6]

    Yu, Justin and Shentu, Yide and Wu, Di and Abbeel, Pieter and Goldberg, Ken and Wu, Philipp , journal=

  7. [8]

    Liu, Yushan and Mu, Shilong and Chao, Xintao and Li, Zizhen and Mu, Yao and Chen, Tianxing and Li, Shoujie and Lyu, Chuqiao and Zhang, Xiao-ping and Ding, Wenbo , journal=

  8. [9]

    Hong Kong Journal of Occupational Therapy , volume=

    Temporal differences in eye--hand coordination between children and adults during manual action on objects , author=. Hong Kong Journal of Occupational Therapy , volume=. 2018 , publisher=

  9. [10]

    Journal of vision , volume=

    Saccadic eye movements in a high-speed bimanual stacking task: Changes of attentional control during learning and automatization , author=. Journal of vision , volume=. 2011 , publisher=

  10. [11]

    Journal of neuroscience , volume=

    Eye--hand coordination in object manipulation , author=. Journal of neuroscience , volume=. 2001 , publisher=

  11. [12]

    Perception , volume=

    The roles of vision and eye movements in the control of activities of daily living , author=. Perception , volume=. 1999 , publisher=

  12. [13]

    The 23rd IEEE International Symposium on robot and human interactive communication , pages=

    Legible robot pointing , author=. The 23rd IEEE International Symposium on robot and human interactive communication , pages=. 2014 , organization=

  13. [14]

    2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages=

    Legibility and predictability of robot motion , author=. 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages=. 2013 , organization=

  14. [15]

    Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction , pages=

    Deliberate delays during robot-to-human handovers improve compliance with gaze communication , author=. Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction , pages=

  15. [16]

    Biological cybernetics , volume=

    Learning robotic eye--arm--hand coordination from human demonstration: a coupled dynamical systems approach , author=. Biological cybernetics , volume=. 2014 , publisher=

  16. [17]

    Surgical Endoscopy , volume=

    Spatiotemporal characteristics of eye-hand coordination among different skill levels in laparoscopic surgery , author=. Surgical Endoscopy , volume=. 2026 , publisher=

  17. [18]

    arXiv e-prints , pages=

    Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach , author=. arXiv e-prints , pages=

  18. [19]

    Interaction Studies , volume=

    Robots can be perceived as goal-oriented agents , author=. Interaction Studies , volume=. 2013 , publisher=

  19. [20]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  20. [21]

    Liu, Bo and Zhu, Yifeng and Gao, Chongkai and Feng, Yihao and Liu, Qiang and Zhu, Yuke and Stone, Peter , journal=

  21. [22]

    Zhou, Xueyang and Xu, Yangming and Tie, Guiyao and Chen, Yongchao and Zhang, Guowen and Chu, Duanfeng and Zhou, Pan and Sun, Lichao , journal=

  22. [23]

    Fei, Senyu and Wang, Siyin and Shi, Junhao and Dai, Zihao and Cai, Jikun and Qian, Pengfang and Ji, Li and He, Xinzhe and Zhang, Shiduo and Fei, Zhaoye and others , journal=

  23. [24]

    arXiv e-prints , pages=

    Distracted Robot: How Visual Clutter Undermine Robotic Manipulation , author=. arXiv e-prints , pages=

  24. [25]

    2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids) , pages=

    Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task , author=. 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids) , pages=. 2024 , organization=

  25. [26]

    Proceedings of Robotics: Science and Systems (RSS) , year=

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author=. Proceedings of Robotics: Science and Systems (RSS) , year=

  26. [27]

    Proceedings of Robotics: Science and Systems (RSS) , year=

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware , author=. Proceedings of Robotics: Science and Systems (RSS) , year=

  27. [30]

    2025 , url =

    Isaac Lab Arena: Composable Environment Creation and Policy Evaluation for Robotics , author =. 2025 , url =

  28. [31]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning , author=. arXiv:2009.12293 , year=

  29. [32]

    Communications of the ACM , volume=

    Datasheets for datasets , author=. Communications of the ACM , volume=. 2021 , publisher=

  30. [33]

    Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Moss, Jess and Wolf, Thomas , title =

  31. [34]

    Mandlekar, Ajay and Nasiriany, Soroush and Wen, Bowen and Akinola, Iretiayo and Narang, Yashraj and Fan, Linxi and Zhu, Yuke and Fox, Dieter , journal=

  32. [35]

    2025 , organization=

    Jiang, Zhenyu and Xie, Yuqi and Lin, Kevin and Xu, Zhenjia and Wan, Weikang and Mandlekar, Ajay and Fan, Linxi Jim and Zhu, Yuke , booktitle=. 2025 , organization=

  33. [36]

    Proceedings of the IEEE , volume=

    Active perception , author=. Proceedings of the IEEE , volume=. 1988 , publisher=

  34. [37]

    International journal of computer vision , volume=

    Active vision , author=. International journal of computer vision , volume=. 1988 , publisher=

  35. [38]

    Artificial intelligence , volume=

    Animate vision , author=. Artificial intelligence , volume=. 1991 , publisher=

  36. [39]

    Autonomous Robots , volume=

    Revisiting active perception , author=. Autonomous Robots , volume=. 2018 , publisher=

  37. [40]

    2020 , publisher=

    James, Stephen and Ma, Zicong and Arrojo, David Rovick and Davison, Andrew J , journal=. 2020 , publisher=

  38. [41]

    2022 , publisher=

    Mees, Oier and Hermann, Lukas and Rosete-Beas, Erick and Burgard, Wolfram , journal=. 2022 , publisher=

  39. [42]

    Nasiriany, Soroush and Maddukuri, Abhiram and Zhang, Lance and Parikh, Adeet and Lo, Aaron and Joshi, Abhishek and Mandlekar, Ajay and Zhu, Yuke , year=

  40. [43]

    2023 , organization=

    Walke, Homer Rich and Black, Kevin and Zhao, Tony Z and Vuong, Quan and Zheng, Chongyi and Hansen-Estruch, Philippe and He, Andre Wang and Myers, Vivek and Kim, Moo Jin and Du, Max and others , booktitle=. 2023 , organization=

  41. [45]

    Deliberate delays during robot-to-human handovers improve compliance with gaze communication

    Henny Admoni, Anca Dragan, Siddhartha S Srinivasa, and Brian Scassellati. Deliberate delays during robot-to-human handovers improve compliance with gaze communication. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pages 49--56, 2014

  42. [46]

    Active vision

    John Aloimonos, Isaac Weiss, and Amit Bandyopadhyay. Active vision. International journal of computer vision, 1 0 (4): 0 333--356, 1988

  43. [47]

    Active perception

    Ruzena Bajcsy. Active perception. Proceedings of the IEEE, 76 0 (8): 0 966--1005, 1988

  44. [48]

    Animate vision

    Dana H Ballard. Animate vision. Artificial intelligence, 48 0 (1): 0 57--86, 1991

  45. [49]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. _0 : A vis...

  46. [50]

    Lerobot: State-of-the-art machine learning for real-world robotics in pytorch

    Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caroline Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/hu...

  47. [51]

    Open-TeleVision : Teleoperation with immersive active visual feedback

    Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-TeleVision : Teleoperation with immersive active visual feedback. In Conference on Robot Learning, pages 2729--2749. PMLR, 2025

  48. [52]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

  49. [53]

    Active vision might be all you need: Exploring active vision in bimanual robotic manipulation

    Ian Chuang, Andrew Lee, Dechen Gao, M-Mahdi Naddaf-Sh, and Iman Soltani. Active vision might be all you need: Exploring active vision in bimanual robotic manipulation. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7952--7959. IEEE, 2025 a

  50. [54]

    Look, focus, act: Efficient and robust robot learning via human gaze and foveated vision transformers

    Ian Chuang, Andrew Lee, Dechen Gao, Jinyu Zou, and Iman Soltani. Look, focus, act: Efficient and robust robot learning via human gaze and foveated vision transformers. arXiv e-prints, pages arXiv--2507, 2025 b

  51. [55]

    Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task

    Bosong Ding, Murat Kirtay, and Giacomo Spigler. Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task. In 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids), pages 645--652. IEEE, 2024

  52. [56]

    Legibility and predictability of robot motion

    Anca D Dragan, Kenton CT Lee, and Siddhartha S Srinivasa. Legibility and predictability of robot motion. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 301--308. IEEE, 2013

  53. [57]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. LIBERO-Plus : In-depth robustness analysis of Vision-Language-Action models. arXiv preprint arXiv:2510.13626, 2025

  54. [58]

    Saccadic eye movements in a high-speed bimanual stacking task: Changes of attentional control during learning and automatization

    Rebecca M Foerster, Elena Carbone, Hendrik Koesling, and Werner X Schneider. Saccadic eye movements in a high-speed bimanual stacking task: Changes of attentional control during learning and automatization. Journal of vision, 11 0 (7): 0 9--9, 2011

  55. [59]

    Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation

    Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  56. [60]

    Prime and reach: Synthesising body motion for gaze-primed object reach

    Masashi Hatano, Saptarshi Sinha, Jacob Chalk, Wei-Hong Li, Hideo Saito, and Dima Damen. Prime and reach: Synthesising body motion for gaze-primed object reach. arXiv e-prints, pages arXiv--2512, 2025

  57. [61]

    Towards exploratory and focused manipulation with bimanual active perception: A new problem, benchmark and strategy

    Yuxin He, Ruihao Zhang, Tianao Shen, Cheng Liu, and Qiang Nie. Towards exploratory and focused manipulation with bimanual active perception: A new problem, benchmark and strategy. arXiv preprint arXiv:2602.01939, 2026

  58. [62]

    Legible robot pointing

    Rachel M Holladay, Anca D Dragan, and Siddhartha S Srinivasa. Legible robot pointing. In The 23rd IEEE International Symposium on robot and human interactive communication, pages 217--223. IEEE, 2014

  59. [63]

    RLBench : The robot learning benchmark & learning environment

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. RLBench : The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5 0 (2): 0 3019--3026, 2020

  60. [64]

    o ran Westling, Anders B \

    Roland S Johansson, G \"o ran Westling, Anders B \"a ckstr \"o m, and J Randall Flanagan. Eye--hand coordination in object manipulation. Journal of neuroscience, 21 0 (17): 0 6917--6932, 2001

  61. [65]

    Eye, robot: Learning to look to act with a BC-RL perception-action loop

    Justin Kerr, Kush Hari, Ethan Weber, Chung Min Kim, Brent Yi, Tyler Bonnen, Ken Goldberg, and Angjoo Kanazawa. Eye, robot: Learning to look to act with a BC-RL perception-action loop. In Conference on Robot Learning, pages 3647--3664. PMLR, 2025

  62. [66]

    Temporal differences in eye--hand coordination between children and adults during manual action on objects

    Hye Jin Kim, Cho Hee Lee, and Eun Young Kim. Temporal differences in eye--hand coordination between children and adults during manual action on objects. Hong Kong Journal of Occupational Therapy, 31 0 (2): 0 106--114, 2018

  63. [67]

    The roles of vision and eye movements in the control of activities of daily living

    Michael Land, Neil Mennie, and Jennifer Rusted. The roles of vision and eye movements in the control of activities of daily living. Perception, 28 0 (11): 0 1311--1328, 1999

  64. [68]

    LIBERO : Benchmarking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO : Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36: 0 44776--44791, 2023

  65. [69]

    AVR : Active vision-driven robotic precision manipulation with viewpoint and focal length optimization

    Yushan Liu, Shilong Mu, Xintao Chao, Zizhen Li, Yao Mu, Tianxing Chen, Shoujie Li, Chuqiao Lyu, Xiao-ping Zhang, and Wenbo Ding. AVR : Active vision-driven robotic precision manipulation with viewpoint and focal length optimization. arXiv e-prints, pages arXiv--2503, 2025

  66. [70]

    CALVIN : A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN : A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7 0 (3): 0 7327--7334, 2022

  67. [71]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M. G...

  68. [72]

    RoboCasa : Large-scale simulation of household tasks for generalist robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa : Large-scale simulation of household tasks for generalist robots. In Robotics: Science and Systems Foundation, 2024

  69. [73]

    Isaac lab arena: Composable environment creation and policy evaluation for robotics, 2025

    NVIDIA Isaac Lab Arena Contributors . Isaac lab arena: Composable environment creation and policy evaluation for robotics, 2025. URL https://github.com/isaac-sim/IsaacLab-Arena

  70. [74]

    Tactile mnist: Benchmarking active tactile perception

    Tim Schneider, Guillaume Duret, Cristiana de Farias, Roberto Calandra, Liming Chen, and Jan Peters. Tactile mnist: Benchmarking active tactile perception. arXiv preprint arXiv:2506.06361, 2025

  71. [75]

    Robots can be perceived as goal-oriented agents

    Alessandra Sciutti, Ambra Bisio, Francesco Nori, Giorgio Metta, Luciano Fadiga, and Giulio Sandini. Robots can be perceived as goal-oriented agents. Interaction Studies, 14 0 (3): 0 329--350, 2013

  72. [76]

    Vision in action: Learning active perception from human demonstrations

    Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, and Shuran Song. Vision in action: Learning active perception from human demonstrations. In Conference on Robot Learning, pages 5450--5463. PMLR, 2025

  73. [77]

    EgoMI : Learning active vision and whole-body manipulation from egocentric human demonstrations

    Justin Yu, Yide Shentu, Di Wu, Pieter Abbeel, Ken Goldberg, and Philipp Wu. EgoMI : Learning active vision and whole-body manipulation from egocentric human demonstrations. arXiv e-prints, pages arXiv--2511, 2025

  74. [78]

    LIBERO-PRO : Towards robust and fair evaluation of Vision-Language-Action models beyond memorization

    Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. LIBERO-PRO : Towards robust and fair evaluation of Vision-Language-Action models beyond memorization. arXiv e-prints, pages arXiv--2510, 2025