pith. sign in

arxiv: 2605.27284 · v1 · pith:GGC2N5KXnew · submitted 2026-05-26 · 💻 cs.RO · cs.AI

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

Pith reviewed 2026-06-29 17:16 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords fine-grained supervisionvision-language-actionsteerable policiesrobot manipulationfinevla-datadual-arm tasksVLM annotationinstruction alignment
0
0 comments X

The pith

Fine-grained language instructions improve how precisely robot policies follow execution details without lowering overall task success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current robot datasets pair trajectories only with coarse goal descriptions, which leaves out details needed for precise control such as which arm to move or from which direction to approach. It introduces a framework that converts existing trajectories into a verified set of 47,159 fine-grained examples and trains policies on mixtures of these detailed instructions and the original goal-level ones. Experiments demonstrate that adding the fine-grained layer does not reduce goal success and instead raises it in most cases, while producing the largest gains on the very attributes the coarse instructions cannot specify. A sympathetic reader would care because this points to a practical way to make language-directed robots more controllable in real settings.

Core claim

Fine-grained supervision does not sacrifice goal-level success and improves steerable control. FG-only improves over Raw-only by 1.4 to 8.1 success-rate points. Fine-grained and raw instructions are complementary, following a consistent inverted-U trend that peaks at FG:Raw ratios of 1:2 to 1:1. The best mixed setting reaches 86.8 percent and 82.5 percent in RoboTwin simulation and 62.7 out of 100 in real-world dual-arm manipulation, compared with 49.9 for raw-only. The largest real-world gains occur on pose, color, and approach direction.

What carries the argument

A steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions.

If this is right

  • Fine-grained only training improves goal success over raw-only training across multiple settings.
  • The complementarity peaks at roughly equal or slightly more raw instructions than fine-grained ones.
  • Steerability gains concentrate on attributes such as pose, color, and approach direction that goal-level language leaves unspecified.
  • The same mixed-training pattern produces measurable lifts in both simulation and real dual-arm manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the complementarity pattern holds beyond the tested datasets, then scaling the proportion of fine-grained labels in new collections could further raise both success and controllability.
  • The held-out benchmark of atomic facts and VQA questions could be reused to audit whether other VLA models also benefit from explicit execution details.
  • Policies trained under this mixture may transfer more readily to instructions that combine high-level goals with low-level constraints not seen in training.

Load-bearing premise

The human-verified fine-grained annotations and the robotics-specialized VLM annotator accurately capture execution-critical details such as active arm, approach direction, and contact region.

What would settle it

Train two otherwise identical policies, one on raw goal instructions alone and one on the mixed fine-grained set, then measure success rates on a held-out set of tasks that explicitly require a particular approach direction or arm choice; equal performance on those tasks would falsify the steerability claim.

Figures

Figures reproduced from arXiv: 2605.27284 by Jinyu Zhang, Junhao Chen, Mingsheng Li, Qiuyue Wang, Shuai Bai, Sicheng Xie, Tao Yu, Xintong Hu, Xuhong Huang, Yingming Zheng, Yitao Liu, Yixuan Chen, Yuchong Sun, Yutong Yao.

Figure 1
Figure 1. Figure 1: Overview of FineVLA. FineVLA builds a closed loop for action-instruction alignment, connecting fine-grained data construction, robotic video understanding, scalable annotation, and steerable VLA policy learning. Left: FineVLA-Tool unifies heterogeneous robot trajectories from 10 open-source datasets, removes redundant demonstrations through clustering and sampling, and annotates representative trajectories… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of FineVLA-Tool. FineVLA-Tool converts large-scale heterogeneous robot demonstrations into action-aligned fine-grained instruction data through four stages. Stage 1: raw trajectories from 10 open-source robot datasets are converted into a unified LeRobot-style format and filtered to remove invalid videos. Stage 2: action and state representations are canonicalized across embodiments, and an action… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of RoboFine-Bench. RoboFine-Bench evaluates fine-grained robotic video understanding through complementary VQA and captioning tracks. Left: benchmark statistics, including the video-duration distribution, the word cloud of manipulation skills and objects, and the distribution of ground-truth atomic facts across the ten FineVLA dimensions for captioning and VQA. Top right: the captioning track deco… view at source ↗
Figure 4
Figure 4. Figure 4: Correlation between benchmark caption scores and human ranking. We recruit 10 human raters to rank the six models on the 500 benchmark videos, and average the resulting subjective scores. Human ranks are normalized from the 1–6 range to [0, 1], while benchmark caption Overall scores are normalized from 0–100 to [0, 1] [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Paired real-world evaluation. Each column shows one control factor under the same visual scene with two language variants. From left to right: Color (red/blue), Pose (standing/lying), Approach (above/side), Rotation (clockwise/counterclockwise), Arm (right/left) [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RoboTwin mixing-ratio curves. Performance peaks around FG : Raw = 1 : 2 to 1 : 1 across all settings, yielding a consistent inverted-U trend. 5 Analysis This section analyzes why fine-grained supervision improves performance, how it should be mixed with raw goal-level instructions, and which control factors benefit most from action-aligned language. 5.1 Fine-Grained Supervision Does Not Sacrifice Goal-Leve… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples of DTW-based trajectory clustering in FineVLA-Tool. For each task, the left panel shows the pairwise DTW distance matrix and the right panel shows a 2D MDS embedding of the same distances. Clear cluster structure indicates that trajectories with similar manipulation dynamics are grouped together, while differences in gripper timing and end-effector motion patterns are separated into di… view at source ↗
Figure 8
Figure 8. Figure 8: Human ranking interface for caption evaluation. Annotators watch the benchmark video and rank the six candidate captions from best to worst according to fine-grained faithfulness and usefulness. The protocol is designed to validate whether benchmark-induced model ranking is aligned with direct human judgment. A.3.7 Caption Cost, Token, and Latency This subsection supports the efficiency discussion in Secti… view at source ↗
read the original abstract

Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 10,816 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions. Our experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at FG:Raw = 1:2 to 1:1. The best mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation (vs. 49.9 Raw-only). Third, fine-grained supervision improves steerable control: the largest real-world gains appear on pose (+23), color (+18), and approach direction (+18)--factors where goal-level instructions provide no guidance. Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve. Project page: https://finevla.xlang.ai/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces FineVLA, a framework for fine-grained instruction alignment in Vision-Language-Action (VLA) policies. It unifies 972,247 trajectories from 10 open-source datasets into FineVLA-Data (47,159 human-verified fine-grained trajectories), provides a held-out benchmark (500 videos, 10,816 atomic facts, 1,030 VQA questions), deploys a robotics-specialized VLM annotator, and trains policies on controlled FG:Raw instruction mixtures. Experiments report that fine-grained supervision improves steerability on execution factors (pose +23, color +18, approach direction +18) without harming goal-level success, with peak performance at FG:Raw ratios of 1:2 to 1:1 (e.g., 86.8%/82.5% RoboTwin, 62.7/100 real-world dual-arm vs. 49.9 Raw-only).

Significance. If the empirical claims hold, the work supplies a scalable path to augment coarse goal-level robot datasets with execution-critical details, directly addressing a documented limitation in VLA training. The consistent inverted-U mixture trend and factor-specific gains on held-out simulation and real-world trials constitute a concrete, falsifiable contribution to steerable policy learning.

major comments (3)
  1. [Experiments] Experiments section (results on RoboTwin and real-world dual-arm): success rates are reported as point estimates (86.8%, 82.5%, 62.7/100) with no error bars, standard deviations across seeds, or statistical significance tests. This undermines assessment of whether the reported gains over Raw-only (e.g., +12.8 on real-world) are reliable, especially given the central claim that FG supervision improves steerability without sacrificing goal success.
  2. [Data Construction] Data construction and VLM annotator (FineVLA-Data and benchmark): the manuscript states that the robotics-specialized VLM annotator produces scalable fine-grained labels verified by humans, yet provides no quantitative accuracy metrics, inter-annotator agreement, or ablation on annotator error rates for the 10,816 atomic facts. Because the weakest assumption is precisely the fidelity of these execution-critical details (active arm, contact region, approach direction), this detail is load-bearing for the data-quality premise.
  3. [Benchmark] Benchmark construction (held-out 500-video set): the paper does not describe the sampling procedure, task distribution, or how the 1,030 VQA questions were generated and balanced across the 10,816 facts. Without this, it is impossible to evaluate whether the steerability gains generalize beyond the specific factors highlighted or whether the benchmark inadvertently favors the FG annotations.
minor comments (1)
  1. [Abstract / Experiments] The abstract and results tables would benefit from explicit statement of the number of evaluation episodes per condition and whether the same policy seeds were used across mixture ratios.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (results on RoboTwin and real-world dual-arm): success rates are reported as point estimates (86.8%, 82.5%, 62.7/100) with no error bars, standard deviations across seeds, or statistical significance tests. This undermines assessment of whether the reported gains over Raw-only (e.g., +12.8 on real-world) are reliable, especially given the central claim that FG supervision improves steerability without sacrificing goal success.

    Authors: We agree that reporting only point estimates limits evaluation of reliability. In the revised manuscript we will add standard deviations across multiple random seeds for all reported success rates and include statistical significance tests (e.g., paired t-tests) comparing FG-mixed conditions against the Raw-only baseline. These additions will appear in the Experiments section and associated tables. revision: yes

  2. Referee: [Data Construction] Data construction and VLM annotator (FineVLA-Data and benchmark): the manuscript states that the robotics-specialized VLM annotator produces scalable fine-grained labels verified by humans, yet provides no quantitative accuracy metrics, inter-annotator agreement, or ablation on annotator error rates for the 10,816 atomic facts. Because the weakest assumption is precisely the fidelity of these execution-critical details (active arm, contact region, approach direction), this detail is load-bearing for the data-quality premise.

    Authors: The concern about missing quantitative validation of the annotator is valid. Although the 47,159 trajectories were human-verified, the original submission omitted inter-annotator agreement and accuracy metrics. We will add these in revision by reporting agreement on a sampled subset of atomic facts and any available error-rate ablations; if full retrospective computation is infeasible we will instead detail the exact human verification protocol used. revision: yes

  3. Referee: [Benchmark] Benchmark construction (held-out 500-video set): the paper does not describe the sampling procedure, task distribution, or how the 1,030 VQA questions were generated and balanced across the 10,816 facts. Without this, it is impossible to evaluate whether the steerability gains generalize beyond the specific factors highlighted or whether the benchmark inadvertently favors the FG annotations.

    Authors: We agree that the benchmark construction details are insufficient. In the revised manuscript we will expand the relevant section to specify the sampling procedure for the 500-video held-out set, the task distribution across source datasets, and the generation and balancing process for the 1,030 VQA questions relative to the 10,816 atomic facts. This will improve transparency and allow readers to assess potential biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on held-out data

full rationale

The paper presents an empirical framework for constructing FineVLA-Data from existing trajectories, training VLA policies on instruction mixtures, and evaluating on a held-out benchmark (500 videos, 10,816 facts, 1,030 VQA questions) plus real-world trials. All reported gains (e.g., +1.4 to +8.1 success rate, inverted-U trend peaking at FG:Raw = 1:2 to 1:1, factor-specific improvements on pose/color/direction) are measured outcomes on separate test sets rather than quantities defined in terms of fitted parameters or self-referential equations. No derivation chain reduces a claimed result to its own inputs by construction; the central claims rest on controlled experiments and external verification steps.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Abstract-only review yields limited visibility into modeling choices; the main unstated premises concern annotation fidelity and VLM reliability.

free parameters (1)
  • FG:Raw mixture ratio
    Experimental mixtures (1:2 to 1:1) selected to maximize reported performance; treated as tuned hyperparameter.
axioms (2)
  • domain assumption Human verification of the 47,159 trajectories produces accurate execution-critical labels
    The dataset is described as human-verified, but verification protocol, inter-annotator agreement, and coverage of edge cases are not specified in the abstract.
  • domain assumption The robotics-specialized VLM annotator produces labels of sufficient quality to support policy training at scale
    The annotator is presented as a core component without reported accuracy metrics or failure modes in the abstract.

pith-pipeline@v0.9.1-grok · 5958 in / 1590 out tokens · 35018 ms · 2026-06-29T17:16:28.894198+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 27 canonical work pages · 13 internal anchors

  1. [1]

    Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...

  2. [2]

    A Pragmatic VLA Foundation Model

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic vla foundation model, 2026. URL https://arxiv.org/a...

  3. [3]

    Nvidia isaac gr00t

    NVIDIA. Nvidia isaac gr00t. https://github.com/NVIDIA/Isaac-GR00T, 2026. GitHub repository, accessed April 13, 2026

  4. [4]

    Generalist AI. Gen-1. https://generalistai.com/blog/apr-02-2026-GEN-1 , 2026. Blog post, accessed April 2, 2026

  5. [5]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models, 2023. URLhttps://arxiv.org/abs/2310.08864

  6. [6]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation, 2025. URLhttps://arxiv.org/abs/2410.07864

  7. [7]

    Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, and Yuan Cao

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrish- nan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J. Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, and Yuan Cao. RoboVQA: Multimodal long-horizon r...

  8. [8]

    Robobench: A comprehensive evaluation benchmark for multimodal large language models as embodied brain, 2025

    Yulin Luo, Chun-Kai Fan, Menghang Dong, Jiayu Shi, Mengdi Zhao, Bo-Wen Zhang, Cheng Chi, Jiaming Liu, Gaole Dai, Rongyu Zhang, Ruichuan An, Kun Wu, Zhengping Che, Shaoxuan Xie, Guocai Yao, Zhongxia Zhao, Pengwei Wang, Guang Liu, Zhongyuan Wang, Tiejun Huang, and Shanghang Zhang. Robobench: A comprehensive evaluation benchmark for multimodal large language...

  9. [9]

    HanDyVQA: A video QA benchmark for fine-grained hand-object interaction dynamics, 2025

    Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, and Takuma Yagi. HanDyVQA: A video QA benchmark for fine-grained hand-object interaction dynamics, 2025. URL https: //arxiv.org/abs/2512.00885

  10. [10]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  11. [11]

    BridgeData V2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, et al. BridgeData V2: A dataset for robot learning at scale. InProceedings of the Conference on Robot Learning, 2023. URL https://arxiv.org/abs/2308.12952. 14

  12. [12]

    Bc-z: Zero-shot task generalization with robotic imitation learning,

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning,

  13. [13]

    URLhttps://arxiv.org/abs/2202.02005

  14. [14]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. RT-1: Robotics transformer for real-world control at scale, 2022. URLhttps://arxiv.org/abs/2212.06817

  15. [15]

    Jiang, T

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, et al. Galaxea open-world dataset and G0 dual-system VLA model, 2025. URLhttps://arxiv.org/abs/2509.00576

  16. [16]

    RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, Shichao Fan, Xinhua Wang, Fei Liao, Zhen Zhao, Guangyu Li, Zhao Jin, Lecheng Wang, Jilei Mao, Ning Liu, Pei Ren, Qiang Zhang, Yaoxu Lyu, Mengzhen Liu, He Jingyang, Yulin Luo, Zeyu Gao, Chenxuan Li, Chenyang Gu, Yankai Fu, Di Wu, Xingyu W...

  17. [17]

    RoboMIND 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence, 2025

    Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, et al. RoboMIND 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence, 2025. URL https://arxiv.org/abs/2512.24653

  18. [18]

    RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, et al. RoboCOIN: An open- sourced bimanual robotic data collection for integrated manipulation, 2025. URL https: //arxiv.org/abs/2511.17441

  19. [19]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

  20. [20]

    URLhttps://arxiv.org/abs/2307.00595

  21. [21]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset, 2024. URL https: //arxiv.org/abs/2403.12945

  22. [22]

    StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

    StarVLA Community. Starvla: A lego-like codebase for vision-language-action model develop- ing, 2026. URLhttps://arxiv.org/abs/2604.05014

  23. [23]

    RoboTwin: Dual-arm robot benchmark with generative digital twins, 2024

    Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins, 2024. URLhttps://arxiv.org/abs/2409.02920

  24. [24]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control, 2023. URL https://arxiv.org/abs/2307.15818

  25. [25]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model, 2024. URLhttps://arxiv.org/abs/2406.09246

  26. [26]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control, 2024. URLhttps://arxiv.org/abs/2410.24164

  27. [27]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213. 15

  28. [28]

    RoboInter: A holistic intermediate representation suite towards robotic manipulation, 2026

    Hao Li, Ziqin Wang, Zi-han Ding, Shuai Yang, et al. RoboInter: A holistic intermediate representation suite towards robotic manipulation, 2026. URL https://arxiv.org/abs/ 2602.09973

  29. [29]

    STEER: Flexible robotic manipulation via dense language ground- ing, 2024

    Laura Smith, Alex Irpan, Montserrat Gonzalez Arenas, Sean Kirmani, Dmitry Kalashnikov, Dhruv Shah, and Ted Xiao. STEER: Flexible robotic manipulation via dense language ground- ing, 2024. URLhttps://arxiv.org/abs/2411.03409

  30. [30]

    PartInstruct: Part-level instruction following for fine- grained robot manipulation, 2025

    Yifan Yin, Zhengtao Han, Shivam Aarya, Jianxin Wang, Shuhang Xu, Jiawei Peng, Angtian Wang, Alan Yuille, and Tianmin Shu. PartInstruct: Part-level instruction following for fine- grained robot manipulation, 2025. URLhttps://arxiv.org/abs/2505.21652

  31. [31]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  32. [32]

    Qwen3.5-Omni technical report, 2026

    Qwen Team. Qwen3.5-Omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

  33. [33]

    Wolf: Dense video captioning with a world summarization framework, 2025

    Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, Xinshuo Weng, Fuzhao Xue, Linxi Fan, Yuke Zhu, Jan Kautz, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor Darrell, Jitendra Malik, Song Han, and Marco Pavone. Wolf: Dense video captioning with a world summarization framework, 20...

  34. [34]

    Robotic skill acquisition via instruction augmentation with vision-language models

    Ted Xiao, Harris Chan, Pierre Sermanet, Ayzaan Wahid, Anthony Brohan, Karol Hausman, Sergey Levine, and Jonathan Tompson. Robotic skill acquisition via instruction augmentation with vision-language models. InRobotics: Science and Systems, 2023. URL https://arxiv. org/abs/2211.11736

  35. [35]

    RoboAnnotatorX: A comprehensive and universal annotation framework for accurate understanding of long-horizon robot demonstration

    Longxin Kou, Fei Ni, Yan Zheng, Peilong Han, Jinyi Liu, Haiqin Cui, Rui Liu, and Jianye Hao. RoboAnnotatorX: A comprehensive and universal annotation framework for accurate understanding of long-horizon robot demonstration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10353–10363,

  36. [36]

    16 A Appendix Contents A.1 FineVLA-Tool Details

    URL https://openaccess.thecvf.com/content/ICCV2025/html/Kou_ RoboAnnotatorX_A_Comprehensive_and_Universal_Annotation_Framework_for_ Accurate_Understanding_ICCV_2025_paper.html. 16 A Appendix Contents A.1 FineVLA-Tool Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.1.1 Data Sources and Format Conversion . . . . . . . . . . . . ...

  37. [37]

    Canonicalization.All trajectories within a task are converted to their canonical action rep- resentation (joint-space or EEF-space with quaternion rotations) following the procedure in Appendix A.1.2

  38. [38]

    Pairwise DTW distance computation.For each pair of trajectories within a task, we compute the DTW distance using a representation-specific frame cost function (defined below)

  39. [39]

    Hierarchical clustering.Agglomerative clustering with average linkage is applied to the pairwise distance matrix, and the number of clusters is determined automatically via the largest relative gap in merge heights

  40. [40]

    Step1":

    Representative selection.Two to three high-quality trajectories are selected from each cluster based on proximity to the cluster medoid and trajectory quality metrics (video integrity, action smoothness). DTW formulation.Open robot datasets are highly redundant: many demonstrations differ only in speed, minor spatial offsets, or camera viewpoint, while ex...

  41. [41]

    action_primitive –- fundamental action type (grasp, push, rotate, etc.)

  42. [42]

    actor_identity –- which arm/hand/gripper performs the action

  43. [43]

    object_recognition –- object category, color, material, shape, size

  44. [44]

    object_disambiguation –- distinguishing similar objects via spatial/attribute cues

  45. [45]

    contact_region –- specific part where gripper contacts the object

  46. [46]

    source_state_or_location –- initial state/position before manipulation

  47. [47]

    trajectory_and_orientation –- direction, path, or rotation during motion

  48. [48]

    placement_specification –- final target location or spatial relation

  49. [49]

    interaction_with_other_objects –- contact/disturbance of non-target objects

  50. [50]

    success_failure_retry –- whether the action succeeds, fails, or retries

  51. [51]

    gripper_state –- open/close/release state at a specific moment

  52. [52]

    temporal_order_and_step_boundary –- ordering of steps and boundaries

  53. [53]

    all/none of the above

    body_motion –- robot base/torso/camera movement Dimension Balancing: For Mode B, randomly select dimensions per sample. Do NOT ask two questions on the same dimension within one sample. Across the batch, aim for roughly equal coverage. Answer Types: –- multiple_choice: 4–8 mutually exclusive options, no “all/none of the above” –- yes_no: answer exactly “y...

  54. [54]

    Pre-extracted GT atomic facts (structured, grouped by capability dimension)

  55. [55]

    A raw AI-generated caption (a list of step descriptions, NOT pre-extracted into atomic facts). Your task is to evaluate each GT atomic fact against the raw caption text and determine: – For each GT fact: is it match, partial, contradiction, or omission? – Additionally, identify any hallucinated action events in the caption that do NOT appear in the GT act...