AttenA+: Rectifying Action Inequality in Robotic Foundation Models

Andrew F. Luo; Boyu Zhou; Daojie Peng; Fulong Ma; Jiahang Cao; Jian Guo; Jun Ma; Ping Luo; Qiang Zhang; Xupeng Xie

arxiv: 2605.13548 · v3 · pith:LLGNM4DInew · submitted 2026-05-13 · 💻 cs.RO · cs.AI

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

Daojie Peng , Fulong Ma , Jiahang Cao , Qiang Zhang , Xupeng Xie , Jian Guo , Ping Luo , Andrew F. Luo

show 2 more authors

Boyu Zhou Jun Ma

This is my paper

Pith reviewed 2026-06-30 21:39 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robotic foundation modelsaction attentionvelocity reweightingVLA modelsmanipulation trajectoriesphysics-aware trainingtrajectory heterogeneity

0 comments

The pith

Reweighting robot action losses by inverse velocity aligns training with physical task demands and raises benchmark performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current robotic foundation models optimize all actions in a trajectory with equal weight, an assumption carried over from language modeling that overlooks the physical reality of manipulation tasks. In actual robot motion, low-velocity segments require precision and determine success, while high-velocity segments are more forgiving transitions. The paper proposes AttenA+ as a plug-in method that computes an inverse-velocity attention weight and multiplies it into the training loss, directing model capacity toward the critical segments. This change requires no architecture edits or extra parameters yet produces measurable gains when added to existing Vision-Language-Action and World-Action models. The approach therefore treats the intrinsic structure of action sequences as an efficient, physics-informed prior that can complement further scaling.

Core claim

The central claim is that uniform loss weighting in robotic foundation models creates a misalignment with the heterogeneous kinematics of manipulation trajectories; low-velocity intervals carry the decisive precision interactions while high-velocity intervals are error-tolerant. AttenA+ rectifies this by introducing velocity-driven action attention that reweights the objective by the inverse velocity field, thereby aligning the model's learning capacity with physical criticality. The resulting framework is architecture-agnostic and integrates into existing backbones without structural modifications or added parameters, producing higher success rates on long-horizon benchmarks.

What carries the argument

velocity-driven action attention that multiplies the training objective by the inverse velocity field

If this is right

OpenVLA-OFT reaches 98.6 percent success on the Libero benchmark, an absolute gain of 1.5 percent.
FastWAM reaches 92.4 percent success on RoboTwin 2.0, an absolute gain of 0.6 percent.
The method integrates into existing model backbones with no architectural changes and no additional parameters.
Real-world tests on a Franka manipulator confirm robustness and cross-task generalization.
Mining intrinsic structural priors of action sequences supplies a physics-aware complement to standard scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inverse-velocity weighting could be tested on other sequential decision tasks where action speed varies in importance.
Applying the reweighting only during later training stages might further reduce any risk of over-emphasizing early noisy data.
Combining velocity attention with existing spatial or temporal attention layers inside VLAs is a direct next experiment.

Load-bearing premise

Low-velocity segments in manipulation trajectories are the decisive precision points that determine task success, so reweighting the loss toward them will improve model performance.

What would settle it

Running the same training schedules on OpenVLA-OFT or FastWAM with and without the inverse-velocity reweighting and finding no difference or a drop in Libero or RoboTwin 2.0 success rates would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.13548 by Andrew F. Luo, Boyu Zhou, Daojie Peng, Fulong Ma, Jiahang Cao, Jian Guo, Jun Ma, Ping Luo, Qiang Zhang, Xupeng Xie.

**Figure 1.** Figure 1: Overview of AttenA+. AttenA+ is a paradigm-agnostic enhancement framework for action robotic foundation models, introducing velocity-field-based action attention to prioritize slow, critical manipulation steps. It seamlessly plugs into mainstream discriminative (e.g., OpenVLA-OFT) and generative (π0, π0.5, Diffusion Policy) architectures, as well as emerging World-Action Models (WAM). Without modifying cor… view at source ↗

**Figure 2.** Figure 2: Analysis of velocity fields revealing the inherent action inequality. We observe that the informational density of the robot dataset is non-uniformly distributed: rapid motions are often redundant transitions, while slow-motion phases dominate task success or failure. The discovery of this kinematic hierarchy motivates the development of AttenA+, a plug-and-play mechanism designed to rectify the uniform we… view at source ↗

**Figure 3.** Figure 3: Overview of AttenA+. Given visual and language observations from datasets, we derive a velocity field. With attention weighting function FA, this field assigns higher attention weights to slow, critical manipulation steps and lower weights to fast transitional motions, prioritizing learning on error-sensitive actions while training the models. 3.3 Velocity-Field Attention (AttenA+) To rectify the uniform w… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of task execution with/without AttenA+. (a) The original baseline fails due to accumulated errors in slow, critical manipulation steps (clip, align, release), which receive equal loss weight to fast transitional motions. (b) AttenA+ prioritizes these highprecision segments with larger attention weights, leading to stable task completion [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of experimental tasks. I. Simulation: (a) Four LIBERO benchmark tasks; (b) 50 diverse RoboTwin tasks, including clean and randomized environments. II. Real-world experiments on Franka Panda: (a)–(d) Four representative tasks (drawer opening, pick-and-place, multi-objects, and sequential manipulation), showing AttenA+ enhanced policy execution. Model Close Draw Put Cube Multi-object Long OpenVLA-OF… view at source ↗

**Figure 6.** Figure 6: Real robot experimental results on Franka (Each task is tested over 50 trials): (a) Quantitative success rates (%); (b) Qualitative performance visualization. demonstration. Notably, during demonstration data collection, we use different speed for different phase: at the beginning, we use the baseline speed for approaching the object for grasping, then we change the speed to be 1/3 of the baseline to fine … view at source ↗

**Figure 7.** Figure 7: Visualization of Action Speed in LIBERO-OBJECT Task with Different clipmax = 1.0 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of Action Speed in LIBERO-OBJECT Task with Different clipmax = 2.0 actions and exert mild, localized suppression on fast actions. Within this group, the intensity of low-speed amplification follows a clear hierarchy: inverse squared (Equation 13) yields the strongest enhancement, followed by logarithmic weighting (Equation 15), and then inverse weighting (Equation 12). This consistent trend i… view at source ↗

**Figure 9.** Figure 9: Visualization of Action Speed in LIBERO-OBJECT Task with Different clipmax = 5.0 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of Action Speed in LIBERO-OBJECT Task with Different clipmax = 10.0 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Third-person views of example LIBERO manipulation tasks. Frames labeled ‘critical’ highlight slow, high-precision actions (e.g., grasping, alignment) where AttenA+ applies increased attention weights to improve task success. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Third-person views of example RoboTwin tasks in both clean and randomized environ [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Third-person views of four representative real-world Franka tasks. The ‘critical’ labels [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

read the original abstract

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AttenA+ adds a velocity-based loss reweighting to VLA models that yields small reported gains on two benchmarks, but the supporting analysis for why velocity should drive the weights is missing.

read the letter

The paper's main contribution is a simple, architecture-agnostic reweighting of the action loss in robotic foundation models. It multiplies the per-timestep loss by the inverse of the velocity at that step, on the idea that slow segments require more precision and therefore deserve higher weight during training. They apply it to OpenVLA-OFT and FastWAM and report lifts of 1.5% on Libero and 0.6% on RoboTwin 2.0, plus some real-robot checks on a Franka.

The approach is genuinely lightweight—no extra parameters, no architecture changes—so it is easy to test on existing training runs. That is the part that could be useful to groups already running these models.

The weakness is that the central premise is not checked. The abstract states that low-velocity parts drive success through precision interactions while high-velocity parts are forgiving, yet there is no reported correlation between velocity and actual failure modes on the benchmarks, and no ablation against uniform weighting, random weighting, or other simple schedules under the same compute. The gains are small enough that they could come from any non-uniform modulation or from run-to-run variance; without error bars or seed counts it is difficult to judge stability.

The method itself is a reasonable engineering adjustment, but the physics-aware framing is not yet grounded in the data shown. This is the sort of incremental tweak that might interest labs working on VLA or WAM training who want cheap ways to adjust the objective. It does not open new directions on its own.

I would send it to peer review. The experiments use standard benchmarks and the change is cheap to implement, so referees can ask for the missing ablations and statistics without much extra work from the authors.

Referee Report

4 major / 0 minor

Summary. The manuscript introduces AttenA+, an architecture-agnostic, parameter-free framework that reweights the training objective of robotic foundation models (VLAs and WAMs) using the inverse velocity field. It claims this prioritizes kinematically critical low-velocity segments over error-tolerant high-velocity transitions, yielding improvements such as OpenVLA-OFT reaching 98.6% (+1.5%) on Libero and FastWAM reaching 92.4% (+0.6%) on RoboTwin 2.0, plus real-world Franka validation.

Significance. If the core assumption and reported gains can be substantiated with derivations, ablations, and statistical validation, the method would offer an efficient, physics-informed complement to scaling that requires no architectural changes. The current text, however, supplies insufficient detail to evaluate whether the gains arise from the claimed mechanism or from generic non-uniform weighting.

major comments (4)

[Abstract] Abstract, paragraph 2: The premise that low-velocity segments dictate task success via precision interactions (while high-velocity segments are error-tolerant) is stated without any reported correlation analysis between velocity and failure modes on Libero or RoboTwin, nor any justification that this hierarchy is the dominant factor limiting current models.
[Abstract] Abstract: No equation, derivation, or pseudocode is supplied for computing the inverse velocity field from trajectories or for its exact integration into the loss function, preventing assessment of whether the reweighting is well-defined or reduces to a fitted hyperparameter.
[Abstract] Abstract: The numerical claims (+1.5% and +0.6%) are presented without error bars, number of random seeds, statistical significance tests, or full baseline tables, so it is impossible to determine whether the improvements exceed run-to-run variance under identical training budgets.
[Abstract] Abstract: The description states that AttenA+ is a 'plug-and-play enhancement' with 'no additional parameters,' yet supplies no implementation details on how the reweighting is applied during training of OpenVLA-OFT or FastWAM, nor any ablation against random, position-based, or uniform weighting alternatives.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying details from the full paper where applicable and committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph 2: The premise that low-velocity segments dictate task success via precision interactions (while high-velocity segments are error-tolerant) is stated without any reported correlation analysis between velocity and failure modes on Libero or RoboTwin, nor any justification that this hierarchy is the dominant factor limiting current models.

Authors: We agree that an explicit correlation analysis would provide stronger empirical grounding for the premise. The full manuscript motivates the hierarchy from the physics of manipulation and supports it indirectly via the observed performance gains, but does not include a dedicated quantitative correlation study. In the revised version we will add a new analysis subsection reporting Pearson correlations between per-segment velocity and failure rates on both benchmarks, along with failure-mode visualizations. revision: yes
Referee: [Abstract] Abstract: No equation, derivation, or pseudocode is supplied for computing the inverse velocity field from trajectories or for its exact integration into the loss function, preventing assessment of whether the reweighting is well-defined or reduces to a fitted hyperparameter.

Authors: The methods section (Section 3) defines the inverse velocity weight as w_t = 1/(||v_t|| + ε) with ε = 1e-6 for stability and integrates it directly as a per-timestep multiplier on the action prediction loss; no learned parameters are introduced. To make this immediately verifiable from the abstract, we will add the defining equations and a short pseudocode block to the revised abstract or a new methods figure. revision: yes
Referee: [Abstract] Abstract: The numerical claims (+1.5% and +0.6%) are presented without error bars, number of random seeds, statistical significance tests, or full baseline tables, so it is impossible to determine whether the improvements exceed run-to-run variance under identical training budgets.

Authors: The reported deltas reflect the primary experimental configuration. We will revise the results section to include means and standard deviations over three random seeds, paired statistical significance tests (t-tests), and expanded baseline tables that keep total training compute constant across conditions. revision: yes
Referee: [Abstract] Abstract: The description states that AttenA+ is a 'plug-and-play enhancement' with 'no additional parameters,' yet supplies no implementation details on how the reweighting is applied during training of OpenVLA-OFT or FastWAM, nor any ablation against random, position-based, or uniform weighting alternatives.

Authors: Section 4.1 already describes the training-loop integration (pre-computed velocity weights applied to the action head loss for both models) and confirms zero added parameters. To directly address the concern about generic non-uniform weighting, the revised manuscript will add an ablation table comparing velocity-based reweighting against random, position-based, and uniform baselines under matched compute budgets. revision: yes

Circularity Check

0 steps flagged

No circularity; method is an independent reweighting scheme validated on external benchmarks

full rationale

The paper defines AttenA+ directly as reweighting the training objective by the inverse velocity field to prioritize low-velocity segments, with the alignment to physical criticality presented as a motivating assumption rather than a derived result. No equations, self-citations, or fitted parameters are shown that reduce the claimed performance gains to the inputs by construction. Improvements are reported on independent benchmarks (Libero, RoboTwin), satisfying the criterion for self-contained external validation. No load-bearing self-citation chains or self-definitional steps are identifiable from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on a single domain assumption about velocity and task criticality; no free parameters, invented entities, or additional axioms are stated.

axioms (1)

domain assumption Robot trajectories are fundamentally heterogeneous, with low-velocity segments dictating task success through precision-demanding interactions.
Invoked to justify the misalignment between uniform loss weighting and physical demands (abstract paragraph 1).

pith-pipeline@v0.9.1-grok · 5862 in / 1175 out tokens · 34128 ms · 2026-06-30T21:39:26.553904+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GeoSem-WAM: Geometry- and Semantic-Aware World Action Models
cs.RO 2026-06 unverdicted novelty 5.0

GeoSem-WAM adds geometric and semantic auxiliary prediction tasks to World Action Models during training to improve latent representations and action prediction accuracy while keeping inference efficient by avoiding e...

Reference graph

Works this paper leans on

42 extracted references · 32 canonical work pages · cited by 1 Pith paper · 22 internal anchors

[1]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Structured observation language for efficient and generalizable vision-language navigation.arXiv preprint arXiv:2603.27577, 2026

Daojie Peng, Fulong Ma, and Jun Ma. Structured observation language for efficient and generalizable vision-language navigation.arXiv preprint arXiv:2603.27577, 2026

work page arXiv 2026
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[6]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Lovon: Legged open-vocabulary object navigator

Daojie Peng, Jiahang Cao, Qiang Zhang, and Jun Ma. Lovon: Legged open-vocabulary object navigator. arXiv preprint arXiv:2507.06747, 2025

work page arXiv 2025
[10]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi05: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Compose your policies! improving diffusion-based or flow-based robot policies via test-time distribution-level composition

Jiahang Cao, Yize Huang, Hanzhong Guo, Qiang Zhang, Rui Zhang, Weijian Mai, Mu Nan, Jiaxu Wang, Hao Cheng, Jingkai SUN, Gang Han, Wen Zhao, Yijie Guo, Qihao Zheng, Xiao Li, Chunfeng Song, Ping Luo, and Andrew Luo. Compose your policies! improving diffusion-based or flow-based robot policies via test-time distribution-level composition. InThe Fourteenth In...

2026
[12]

Hejna, S

Joey Hejna, Suvir Mirchandani, Ashwin Balakrishna, Annie Xie, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, Dhruv Shah, Coline Devin, and Dorsa Sadigh. Robot data curation with mutual information estimators.arXiv preprint arXiv:2502.08623, 2025

work page arXiv 2025
[13]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Action- aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

Xiaohuan Pei, Yuxing Chen, Siyu Xu, Yunke Wang, Yuheng Shi, and Chang Xu. Action-aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

work page arXiv 2025
[20]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language- action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Sp-vla:Ajointmodelschedulingandtoken pruning approach for vla model acceleration.arXiv preprint arXiv:2506.12723, 2025

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, and Wenwu Zhu. Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723, 2025

work page arXiv 2025
[24]

Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025

Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, and Tao Chen. Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025

work page arXiv 2025
[25]

Vla-cache: Efficient vision- language-action manipulation via adaptive token caching.Advances in Neural Information Processing Systems, 38:164448–164473, 2026

Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla-cache: Efficient vision- language-action manipulation via adaptive token caching.Advances in Neural Information Processing Systems, 38:164448–164473, 2026

2026
[26]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

2024
[27]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

Xiaohuan Pei, Tao Huang, and Chang Xu. Cross-self kv cache pruning for efficient vision-language inference.arXiv preprint arXiv:2412.04652, 2024

work page arXiv 2024
[29]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipula- tion with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[32]

Coarse-to-fine imitation learning: Robot manipulation from a single demonstration

Edward Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021

2021
[33]

Robust imitation learning from noisy demonstrations.arXiv preprint arXiv:2010.10181, 2020

V oot Tangkaratt, Nontawat Charoenphakdee, and Masashi Sugiyama. Robust imitation learning from noisy demonstrations.arXiv preprint arXiv:2010.10181, 2020

work page arXiv 2010
[34]

Mandlekar, F

Ajay Mandlekar, Fabio Ramos, Byron Boots, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Dieter Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data.arXiv preprint arXiv:1911.05321, 2019. 11

work page arXiv 1911
[35]

Exponentially weighted imitation learning for batched historical data.Advances in Neural Information Processing Systems, 31, 2018

Qing Wang, Jiechao Xiong, Lei Han, Han Liu, Tong Zhang, et al. Exponentially weighted imitation learning for batched historical data.Advances in Neural Information Processing Systems, 31, 2018

2018
[36]

Huang, T

Yiqi Huang, Travis Davies, Jiahuan Yan, Jiankai Sun, Xiang Chen, and Luhui Hu. Spatial robograsp: Generalized robotic grasping control policy.arXiv preprint arXiv:2505.20814, 2025

work page arXiv 2025
[37]

Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

2023
[38]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Mamba policy: Towards efficient 3d diffusion policy with hybrid selective state models

Jiahang Cao, Qiang Zhang, Jingkai Sun, Jiaxu Wang, Hao Cheng, Yulin Li, Jun Ma, Kun Wu, Zhiyuan Xu, Yecheng Shao, et al. Mamba policy: Towards efficient 3d diffusion policy with hybrid selective state models. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11359–11366. IEEE, 2025

2025
[40]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

stack the blue block on the red block

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. A Pr...

2019

[1] [1]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Structured observation language for efficient and generalizable vision-language navigation.arXiv preprint arXiv:2603.27577, 2026

Daojie Peng, Fulong Ma, and Jun Ma. Structured observation language for efficient and generalizable vision-language navigation.arXiv preprint arXiv:2603.27577, 2026

work page arXiv 2026

[4] [4]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[6] [6]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Lovon: Legged open-vocabulary object navigator

Daojie Peng, Jiahang Cao, Qiang Zhang, and Jun Ma. Lovon: Legged open-vocabulary object navigator. arXiv preprint arXiv:2507.06747, 2025

work page arXiv 2025

[10] [10]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi05: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Compose your policies! improving diffusion-based or flow-based robot policies via test-time distribution-level composition

Jiahang Cao, Yize Huang, Hanzhong Guo, Qiang Zhang, Rui Zhang, Weijian Mai, Mu Nan, Jiaxu Wang, Hao Cheng, Jingkai SUN, Gang Han, Wen Zhao, Yijie Guo, Qihao Zheng, Xiao Li, Chunfeng Song, Ping Luo, and Andrew Luo. Compose your policies! improving diffusion-based or flow-based robot policies via test-time distribution-level composition. InThe Fourteenth In...

2026

[12] [12]

Hejna, S

Joey Hejna, Suvir Mirchandani, Ashwin Balakrishna, Annie Xie, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, Dhruv Shah, Coline Devin, and Dorsa Sadigh. Robot data curation with mutual information estimators.arXiv preprint arXiv:2502.08623, 2025

work page arXiv 2025

[13] [13]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Action- aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

Xiaohuan Pei, Yuxing Chen, Siyu Xu, Yunke Wang, Yuheng Shi, and Chang Xu. Action-aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

work page arXiv 2025

[20] [20]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language- action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Sp-vla:Ajointmodelschedulingandtoken pruning approach for vla model acceleration.arXiv preprint arXiv:2506.12723, 2025

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, and Wenwu Zhu. Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723, 2025

work page arXiv 2025

[24] [24]

Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025

Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, and Tao Chen. Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025

work page arXiv 2025

[25] [25]

Vla-cache: Efficient vision- language-action manipulation via adaptive token caching.Advances in Neural Information Processing Systems, 38:164448–164473, 2026

Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla-cache: Efficient vision- language-action manipulation via adaptive token caching.Advances in Neural Information Processing Systems, 38:164448–164473, 2026

2026

[26] [26]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

2024

[27] [27]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

Xiaohuan Pei, Tao Huang, and Chang Xu. Cross-self kv cache pruning for efficient vision-language inference.arXiv preprint arXiv:2412.04652, 2024

work page arXiv 2024

[29] [29]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipula- tion with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[32] [32]

Coarse-to-fine imitation learning: Robot manipulation from a single demonstration

Edward Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021

2021

[33] [33]

Robust imitation learning from noisy demonstrations.arXiv preprint arXiv:2010.10181, 2020

V oot Tangkaratt, Nontawat Charoenphakdee, and Masashi Sugiyama. Robust imitation learning from noisy demonstrations.arXiv preprint arXiv:2010.10181, 2020

work page arXiv 2010

[34] [34]

Mandlekar, F

Ajay Mandlekar, Fabio Ramos, Byron Boots, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Dieter Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data.arXiv preprint arXiv:1911.05321, 2019. 11

work page arXiv 1911

[35] [35]

Exponentially weighted imitation learning for batched historical data.Advances in Neural Information Processing Systems, 31, 2018

Qing Wang, Jiechao Xiong, Lei Han, Han Liu, Tong Zhang, et al. Exponentially weighted imitation learning for batched historical data.Advances in Neural Information Processing Systems, 31, 2018

2018

[36] [36]

Huang, T

Yiqi Huang, Travis Davies, Jiahuan Yan, Jiankai Sun, Xiang Chen, and Luhui Hu. Spatial robograsp: Generalized robotic grasping control policy.arXiv preprint arXiv:2505.20814, 2025

work page arXiv 2025

[37] [37]

Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

2023

[38] [38]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Mamba policy: Towards efficient 3d diffusion policy with hybrid selective state models

Jiahang Cao, Qiang Zhang, Jingkai Sun, Jiaxu Wang, Hao Cheng, Yulin Li, Jun Ma, Kun Wu, Zhiyuan Xu, Yecheng Shao, et al. Mamba policy: Towards efficient 3d diffusion policy with hybrid selective state models. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11359–11366. IEEE, 2025

2025

[40] [40]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

stack the blue block on the red block

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. A Pr...

2019