arxiv: 2605.09613 · v1 · submitted 2026-05-10 · 💻 cs.RO · cs.CV

Recognition: no theorem link

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

Amirreza Rouhi, Anirudh Govil, Anoop Namboodiri, Narsimha Menga, Parikshit Sakurikar, Rajat Aggarwal, Sashi Reddi, Satya Sai Reddy, Sri Harsha Chittajallu

Pith reviewed 2026-05-12 04:04 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords retail roboticsembodied datasetaction representationsVLA adaptationhuman demonstrationmanipulation tasksnatural capture

0 comments

The pith

SABER dataset from natural retail footage more than doubles robot success on manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SABER, a dataset collected from over 100 hours of unscripted activity in real grocery stores using both head-mounted cameras for hand details and 360-degree cameras for full scenes. It supplies 44.8K samples in three formats: latent action sequences, hand poses retargeted to robot joints, and whole-body motions retargeted to humanoids. Applying this data to adapt the GR00T N1.6 model through shared-backbone multi-task training raises average success across ten retail tasks from 13.4 percent to 29.3 percent. A sympathetic reader would care because retail environments have been missing from robot pretraining data, and expensive teleoperation has blocked scaling. The work shows that high-fidelity human action data gathered without robots can directly improve real-world deployment.

Core claim

SABER is a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple grocery environments, yielding 44.8K training samples in three streams: 25K latent action sequences, 18.6K dexterous hand-pose trajectories, and 1.2K whole-body motion sequences. When used to post-train GR00T N1.6 via a shared-backbone multi-task recipe, it produces a mean success rate of 29.3 percent on ten retail manipulation tasks, more than 2.19 times the 13.4 percent achieved by fine-tuning baselines. The paper states that this shows capable retail robots can be built from better data collected today at scale without a robot in the loop.

What carries the argument

The dual egocentric-exocentric camera setup that simultaneously records fine-grained hand activity at the point of interaction and full-scene dynamics, paired with retargeting of human actions into three robot-compatible representation streams.

If this is right

Retail manipulation tasks become trainable from natural human behavior without teleoperation or robots present during data collection.
A single shared-backbone model can adapt to multiple action representations from the same underlying captures.
Whole-body and dexterous streams enable embodiment-specific retargeting while sharing the same source footage.
Data collection for domain-specific robot adaptation can scale to new environments using existing camera hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The egocentric-plus-exocentric capture method may generalize to other unstructured settings where context and fine motor details both matter.
Combining multiple action streams in one dataset could support testing how well models transfer across different robot hardware without new recordings.
If the natural capture approach works here, similar low-overhead methods might reduce reliance on scripted or simulated data in broader embodied AI work.

Load-bearing premise

The performance gains are caused by the SABER dataset and its action representations rather than the post-training recipe, model choice, or the specific selection of ten tasks.

What would settle it

Retraining the identical GR00T N1.6 model on an equal-sized set of teleoperated or simulated retail data using the same multi-task recipe and measuring whether success rates reach or exceed 29.3 percent.

read the original abstract

Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SABER provides useful retail robot data from natural captures but its performance claims lack the ablations needed to confirm the source of the gains.

read the letter

SABER is a new dataset from real grocery store captures that gives robot researchers a way to adapt general models to retail tasks without heavy teleoperation. The dual-view natural footage and three action streams are the fresh parts. The collection across multiple stores with egocentric hand views and exocentric scene views captures unscripted behavior at scale. Turning that into LAPA latents, retargeted hand trajectories, and whole-body motions provides flexible options for training. Applying it to GR00T with multi-task post-training is a reasonable way to test domain adaptation. What stands out is the focus on practical data collection that avoids robot-in-the-loop costs. This could help others build similar datasets for other domains. The soft spot is in the performance reporting. The abstract gives a 29.3% mean success rate and 2.19 times improvement, but skips the evaluation protocol, trial counts, variance, and any ablations that separate the dataset effect from the shared-backbone recipe or task choices. Without those, it's difficult to credit the gains specifically to SABER. The stress-test concern holds based on what's shown. This paper is for the embodied AI community, especially teams working on VLA models and real-world deployment in retail or service robotics. Readers who need concrete data examples or ideas for action representation will get something out of it. It deserves a serious referee because the dataset is a tangible addition even if the experiments need tightening. I would recommend sending it for review with specific asks for ablations and full experimental details.

Referee Report

2 major / 2 minor

Summary. The paper introduces SABER, a dataset of 44.8K action samples collected from over 100 hours of natural, unscripted in-store footage in real grocery environments using synchronized egocentric head-mounted cameras and exocentric 360° scene cameras. The corpus is organized into three streams (25K LAPA-style latent action sequences, 18.6K retargeted dexterous hand-pose trajectories, and 1.2K whole-body motion sequences) and is used for shared-backbone multi-task post-training of GR00T N1.6, yielding a reported mean success rate of 29.3% across ten retail manipulation tasks—more than 2.19× the 13.4% achieved by fine-tuning baselines. The work argues that scalable, non-teleoperated real-world action data is the key missing ingredient for deploying generalist VLAs in retail settings.

Significance. If the reported gains prove robust and causally attributable to the SABER data rather than training-recipe details, the contribution would be substantial: it supplies a concrete, publicly released pathway for adapting foundation models to a high-value but previously underrepresented domain (retail manipulation) without requiring robot-in-the-loop teleoperation. The dual egocentric/exocentric capture strategy and multi-stream action representations address a genuine data gap in current VLA pretraining distributions.

major comments (2)

[Abstract] Abstract: The central quantitative claim (29.3% mean success rate, 2.19× improvement over 13.4% baselines) is stated without any information on evaluation protocol, number of trials per task, variance or standard deviation, task definitions, baseline training hyperparameters, or statistical tests. This omission renders the primary empirical result impossible to assess for reliability or reproducibility.
[Results] Results / Methods: No ablation experiments are described that hold the shared-backbone multi-task post-training recipe fixed while varying only the addition of the 44.8K SABER samples (or that compare against matched non-retail data of similar volume). Without such controls, it is impossible to attribute the performance lift specifically to the SABER dataset and its action representations rather than to the post-training procedure, the GR00T N1.6 backbone, or the particular choice and definition of the ten retail tasks.

minor comments (2)

[Abstract] The abstract would benefit from explicitly stating the total capture duration (over 100 hours) alongside the sample count to better convey scale.
Dataset release link is given, but the manuscript should include a brief summary table of the three action streams with exact sample counts and retargeting details for quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below and outline the revisions we will make to strengthen the presentation of our empirical results.

read point-by-point responses

Referee: [Abstract] Abstract: The central quantitative claim (29.3% mean success rate, 2.19× improvement over 13.4% baselines) is stated without any information on evaluation protocol, number of trials per task, variance or standard deviation, task definitions, baseline training hyperparameters, or statistical tests. This omission renders the primary empirical result impossible to assess for reliability or reproducibility.

Authors: We agree that the abstract, due to its brevity, omits these supporting details. The full manuscript describes the evaluation protocol, task definitions, and baseline setup in the Results and Methods sections. In the revised version, we will expand the abstract with a concise clause noting the evaluation scale (e.g., 'mean success rate over multiple trials per task with standard deviations; see Methods for protocol and hyperparameters') while preserving length constraints. This will improve assessability and reproducibility without changing the reported numbers. revision: yes
Referee: [Results] Results / Methods: No ablation experiments are described that hold the shared-backbone multi-task post-training recipe fixed while varying only the addition of the 44.8K SABER samples (or that compare against matched non-retail data of similar volume). Without such controls, it is impossible to attribute the performance lift specifically to the SABER dataset and its action representations rather than to the post-training procedure, the GR00T N1.6 backbone, or the particular choice and definition of the ten retail tasks.

Authors: The 13.4% baseline reflects fine-tuning of the identical GR00T N1.6 model and shared-backbone multi-task post-training recipe without SABER data, which holds the recipe fixed while varying only the data. We acknowledge that additional controls would further isolate the contribution of the retail-specific action representations. In the revision, we will add an ablation subsection comparing SABER against an equivalent volume of non-retail data drawn from existing public sources, and we will explicitly clarify the baseline configuration in the text to make the attribution clearer. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset and evaluation

full rationale

The paper introduces an empirical dataset (SABER) collected from real-world captures and reports measured success rates on ten retail tasks when used for post-training of GR00T N1.6. No mathematical derivations, equations, fitted parameters presented as predictions, or self-referential definitions appear in the abstract or described methods. Performance numbers (29.3% mean success, 2.19x over 13.4% baselines) are obtained via direct experimentation against external baselines rather than by construction from the dataset inputs themselves. The work contains no load-bearing steps that reduce to self-definition, self-citation chains, or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset paper with no mathematical derivations. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5699 in / 1236 out tokens · 39034 ms · 2026-05-12T04:04:30.343201+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 6 internal anchors

[1]

AGIBOT World Challenge at ICRA 2026: Reasoning-to-action and world model tracks

AGIBOT Research. AGIBOT World Challenge at ICRA 2026: Reasoning-to-action and world model tracks. https://huggingface.co/datasets/agibot-world/AgiBotWorldChallenge-2026, 2026

work page 2026
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Alia - system and method for capturing omni-stereo videos using multi-sensors.https://patents

DreamVu Inc. Alia - system and method for capturing omni-stereo videos using multi-sensors.https://patents. google.com/patent/US11025888B2/en, 2021

work page 2021
[5]

Alia - system and method for capturing omni-stereo videos using multi-sensors.https://patents

DreamVu Inc. Alia - system and method for capturing omni-stereo videos using multi-sensors.https://patents. google.com/patent/US11523101B2/en, 2022

work page 2022
[6]

Helix: A vision-language-action model for generalist humanoid control.Technical Report, 2025

Figure AI. Helix: A vision-language-action model for generalist humanoid control.Technical Report, 2025

work page 2025
[7]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[8]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Shyamal Bansal, Bryce Boote, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 18

work page 2024
[9]

Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system

Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020

work page 2020
[10]

Adaflow: Imitation learning with variance-adaptive flow-based policies.Advances in Neural Information Processing Systems, 37:138836–138858, 2024

Xixi Hu, Bo Liu, Xingchao Liu, and Qiang Liu. Adaflow: Imitation learning with variance-adaptive flow-based policies.Advances in Neural Information Processing Systems, 37:138836–138858, 2024

work page 2024
[11]

Joint-aligned latent action: To- wards scalable vla pretraining in the wild.arXiv preprint arXiv:2602.21736,

JALA Authors. JALA: Joint-aligned latent action learning for cross-embodiment robot policy training.arXiv preprint arXiv:2602.21736, 2026

work page arXiv 2026
[12]

Egomimic: Scaling imitation learning via egocentric video, 2024

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. EgoMimic: Scaling imitation learning through egocentric video.arXiv preprint arXiv:2410.24221, 2024

work page arXiv 2024
[13]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

LIBERO-PRO: Probing the robustness of vision-language-action models under distribu- tion shift.arXiv preprint, 2025

LIBERO-PRO Authors. LIBERO-PRO: Probing the robustness of vision-language-action models under distribu- tion shift.arXiv preprint, 2025

work page 2025
[15]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

SMPL: A skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015

work page 2015
[18]

Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

NVIDIA Research. Cosmos-Reason1: From physical world understanding to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

work page arXiv 2025
[19]

DreamGen: Unlocking generalization in robot learning through neural trajectory generation

NVIDIA Research. DreamGen: Unlocking generalization in robot learning through neural trajectory generation. arXiv preprint, 2025

work page 2025
[20]

GR00T N1: A generalist foundation model for humanoid robots.arXiv preprint, 2025

NVIDIA Research. GR00T N1: A generalist foundation model for humanoid robots.arXiv preprint, 2025

work page 2025
[21]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, et al. Open X- Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023

work page 2023
[23]

RoboBenchMart: A benchmark for retail robot manipulation.arXiv preprint, 2025

RoboBenchMart Authors. RoboBenchMart: A benchmark for retail robot manipulation.arXiv preprint, 2025

work page 2025
[24]

RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot ma- nipulation.arXiv preprint, 2025

RoboMIND Authors. RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot ma- nipulation.arXiv preprint, 2025

work page 2025
[25]

Prism: A multi-view multi-capability retail video dataset for embodied vision-language models, 2026.https://arxiv.org/abs/2603.29281

Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, and Sashi Reddi. Prism: A multi-view multi-capability retail video dataset for embodied vision-language models, 2026.https://arxiv.org/abs/2603.29281

work page arXiv 2026
[26]

Sari Sandbox / SariBench: A photorealistic retail simulation benchmark for embodied agents

SariBench Authors. Sari Sandbox / SariBench: A photorealistic retail simulation benchmark for embodied agents. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[27]

Sereact Cortex: Vision-language-action platform for retail and grocery fulfillment

Sereact GmbH. Sereact Cortex: Vision-language-action platform for retail and grocery fulfillment. Industry Technical Report, 2026

work page 2026
[28]

Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.arXiv preprint arXiv:2202.10448, 2022

Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.arXiv preprint arXiv:2202.10448, 2022

work page arXiv 2022
[29]

Unitree G1 humanoid robot.https://www.unitree.com/g1, 2024

Unitree Robotics. Unitree G1 humanoid robot.https://www.unitree.com/g1, 2024

work page 2024
[30]

Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Lin, et al. LAPA: Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024. 19

work page Pith review arXiv 2024