Recognition: no theorem link
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
Pith reviewed 2026-05-12 04:04 UTC · model grok-4.3
The pith
SABER dataset from natural retail footage more than doubles robot success on manipulation tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SABER is a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple grocery environments, yielding 44.8K training samples in three streams: 25K latent action sequences, 18.6K dexterous hand-pose trajectories, and 1.2K whole-body motion sequences. When used to post-train GR00T N1.6 via a shared-backbone multi-task recipe, it produces a mean success rate of 29.3 percent on ten retail manipulation tasks, more than 2.19 times the 13.4 percent achieved by fine-tuning baselines. The paper states that this shows capable retail robots can be built from better data collected today at scale without a robot in the loop.
What carries the argument
The dual egocentric-exocentric camera setup that simultaneously records fine-grained hand activity at the point of interaction and full-scene dynamics, paired with retargeting of human actions into three robot-compatible representation streams.
If this is right
- Retail manipulation tasks become trainable from natural human behavior without teleoperation or robots present during data collection.
- A single shared-backbone model can adapt to multiple action representations from the same underlying captures.
- Whole-body and dexterous streams enable embodiment-specific retargeting while sharing the same source footage.
- Data collection for domain-specific robot adaptation can scale to new environments using existing camera hardware.
Where Pith is reading between the lines
- The egocentric-plus-exocentric capture method may generalize to other unstructured settings where context and fine motor details both matter.
- Combining multiple action streams in one dataset could support testing how well models transfer across different robot hardware without new recordings.
- If the natural capture approach works here, similar low-overhead methods might reduce reliance on scripted or simulated data in broader embodied AI work.
Load-bearing premise
The performance gains are caused by the SABER dataset and its action representations rather than the post-training recipe, model choice, or the specific selection of ten tasks.
What would settle it
Retraining the identical GR00T N1.6 model on an equal-sized set of teleoperated or simulated retail data using the same multi-task recipe and measuring whether success rates reach or exceed 29.3 percent.
read the original abstract
Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SABER, a dataset of 44.8K action samples collected from over 100 hours of natural, unscripted in-store footage in real grocery environments using synchronized egocentric head-mounted cameras and exocentric 360° scene cameras. The corpus is organized into three streams (25K LAPA-style latent action sequences, 18.6K retargeted dexterous hand-pose trajectories, and 1.2K whole-body motion sequences) and is used for shared-backbone multi-task post-training of GR00T N1.6, yielding a reported mean success rate of 29.3% across ten retail manipulation tasks—more than 2.19× the 13.4% achieved by fine-tuning baselines. The work argues that scalable, non-teleoperated real-world action data is the key missing ingredient for deploying generalist VLAs in retail settings.
Significance. If the reported gains prove robust and causally attributable to the SABER data rather than training-recipe details, the contribution would be substantial: it supplies a concrete, publicly released pathway for adapting foundation models to a high-value but previously underrepresented domain (retail manipulation) without requiring robot-in-the-loop teleoperation. The dual egocentric/exocentric capture strategy and multi-stream action representations address a genuine data gap in current VLA pretraining distributions.
major comments (2)
- [Abstract] Abstract: The central quantitative claim (29.3% mean success rate, 2.19× improvement over 13.4% baselines) is stated without any information on evaluation protocol, number of trials per task, variance or standard deviation, task definitions, baseline training hyperparameters, or statistical tests. This omission renders the primary empirical result impossible to assess for reliability or reproducibility.
- [Results] Results / Methods: No ablation experiments are described that hold the shared-backbone multi-task post-training recipe fixed while varying only the addition of the 44.8K SABER samples (or that compare against matched non-retail data of similar volume). Without such controls, it is impossible to attribute the performance lift specifically to the SABER dataset and its action representations rather than to the post-training procedure, the GR00T N1.6 backbone, or the particular choice and definition of the ten retail tasks.
minor comments (2)
- [Abstract] The abstract would benefit from explicitly stating the total capture duration (over 100 hours) alongside the sample count to better convey scale.
- Dataset release link is given, but the manuscript should include a brief summary table of the three action streams with exact sample counts and retargeting details for quick reference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below and outline the revisions we will make to strengthen the presentation of our empirical results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central quantitative claim (29.3% mean success rate, 2.19× improvement over 13.4% baselines) is stated without any information on evaluation protocol, number of trials per task, variance or standard deviation, task definitions, baseline training hyperparameters, or statistical tests. This omission renders the primary empirical result impossible to assess for reliability or reproducibility.
Authors: We agree that the abstract, due to its brevity, omits these supporting details. The full manuscript describes the evaluation protocol, task definitions, and baseline setup in the Results and Methods sections. In the revised version, we will expand the abstract with a concise clause noting the evaluation scale (e.g., 'mean success rate over multiple trials per task with standard deviations; see Methods for protocol and hyperparameters') while preserving length constraints. This will improve assessability and reproducibility without changing the reported numbers. revision: yes
-
Referee: [Results] Results / Methods: No ablation experiments are described that hold the shared-backbone multi-task post-training recipe fixed while varying only the addition of the 44.8K SABER samples (or that compare against matched non-retail data of similar volume). Without such controls, it is impossible to attribute the performance lift specifically to the SABER dataset and its action representations rather than to the post-training procedure, the GR00T N1.6 backbone, or the particular choice and definition of the ten retail tasks.
Authors: The 13.4% baseline reflects fine-tuning of the identical GR00T N1.6 model and shared-backbone multi-task post-training recipe without SABER data, which holds the recipe fixed while varying only the data. We acknowledge that additional controls would further isolate the contribution of the retail-specific action representations. In the revision, we will add an ablation subsection comparing SABER against an equivalent volume of non-retail data drawn from existing public sources, and we will explicitly clarify the baseline configuration in the text to make the attribution clearer. revision: yes
Circularity Check
No circularity: purely empirical dataset and evaluation
full rationale
The paper introduces an empirical dataset (SABER) collected from real-world captures and reports measured success rates on ten retail tasks when used for post-training of GR00T N1.6. No mathematical derivations, equations, fitted parameters presented as predictions, or self-referential definitions appear in the abstract or described methods. Performance numbers (29.3% mean success, 2.19x over 13.4% baselines) are obtained via direct experimentation against external baselines rather than by construction from the dataset inputs themselves. The work contains no load-bearing steps that reduce to self-definition, self-citation chains, or renamed known results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
AGIBOT World Challenge at ICRA 2026: Reasoning-to-action and world model tracks
AGIBOT Research. AGIBOT World Challenge at ICRA 2026: Reasoning-to-action and world model tracks. https://huggingface.co/datasets/agibot-world/AgiBotWorldChallenge-2026, 2026
work page 2026
-
[2]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Alia - system and method for capturing omni-stereo videos using multi-sensors.https://patents
DreamVu Inc. Alia - system and method for capturing omni-stereo videos using multi-sensors.https://patents. google.com/patent/US11025888B2/en, 2021
work page 2021
-
[5]
Alia - system and method for capturing omni-stereo videos using multi-sensors.https://patents
DreamVu Inc. Alia - system and method for capturing omni-stereo videos using multi-sensors.https://patents. google.com/patent/US11523101B2/en, 2022
work page 2022
-
[6]
Helix: A vision-language-action model for generalist humanoid control.Technical Report, 2025
Figure AI. Helix: A vision-language-action model for generalist humanoid control.Technical Report, 2025
work page 2025
-
[7]
Ego4D: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[8]
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Shyamal Bansal, Bryce Boote, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 18
work page 2024
-
[9]
Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system
Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020
work page 2020
-
[10]
Xixi Hu, Bo Liu, Xingchao Liu, and Qiang Liu. Adaflow: Imitation learning with variance-adaptive flow-based policies.Advances in Neural Information Processing Systems, 37:138836–138858, 2024
work page 2024
-
[11]
JALA Authors. JALA: Joint-aligned latent action learning for cross-embodiment robot policy training.arXiv preprint arXiv:2602.21736, 2026
-
[12]
Egomimic: Scaling imitation learning via egocentric video, 2024
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. EgoMimic: Scaling imitation learning through egocentric video.arXiv preprint arXiv:2410.24221, 2024
-
[13]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
LIBERO-PRO Authors. LIBERO-PRO: Probing the robustness of vision-language-action models under distribu- tion shift.arXiv preprint, 2025
work page 2025
-
[15]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
SMPL: A skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015
work page 2015
-
[18]
NVIDIA Research. Cosmos-Reason1: From physical world understanding to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025
-
[19]
DreamGen: Unlocking generalization in robot learning through neural trajectory generation
NVIDIA Research. DreamGen: Unlocking generalization in robot learning through neural trajectory generation. arXiv preprint, 2025
work page 2025
-
[20]
GR00T N1: A generalist foundation model for humanoid robots.arXiv preprint, 2025
NVIDIA Research. GR00T N1: A generalist foundation model for humanoid robots.arXiv preprint, 2025
work page 2025
-
[21]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, et al. Open X- Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system
Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023
work page 2023
-
[23]
RoboBenchMart: A benchmark for retail robot manipulation.arXiv preprint, 2025
RoboBenchMart Authors. RoboBenchMart: A benchmark for retail robot manipulation.arXiv preprint, 2025
work page 2025
-
[24]
RoboMIND Authors. RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot ma- nipulation.arXiv preprint, 2025
work page 2025
-
[25]
Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, and Sashi Reddi. Prism: A multi-view multi-capability retail video dataset for embodied vision-language models, 2026.https://arxiv.org/abs/2603.29281
-
[26]
Sari Sandbox / SariBench: A photorealistic retail simulation benchmark for embodied agents
SariBench Authors. Sari Sandbox / SariBench: A photorealistic retail simulation benchmark for embodied agents. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[27]
Sereact Cortex: Vision-language-action platform for retail and grocery fulfillment
Sereact GmbH. Sereact Cortex: Vision-language-action platform for retail and grocery fulfillment. Industry Technical Report, 2026
work page 2026
-
[28]
Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.arXiv preprint arXiv:2202.10448, 2022
-
[29]
Unitree G1 humanoid robot.https://www.unitree.com/g1, 2024
Unitree Robotics. Unitree G1 humanoid robot.https://www.unitree.com/g1, 2024
work page 2024
-
[30]
Latent Action Pretraining from Videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Lin, et al. LAPA: Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024. 19
work page Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.