arxiv: 2605.03269 · v2 · submitted 2026-05-05 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

RLDX-1 Technical Report

Dongyoung Kim , Huiwon Jang , Myungkyu Koo , Suhyeok Jang , Taeyoung Kim , Beomjun Kim , Byungjun Yoon , Changsung Jang

show 60 more authors

Daewon Choi Dongsu Han Donguk Lee Heeseung Kwon Hojin Jeon Jaehyun Kang Jaekyoung Bae Jihyuk Lee Jimin Lee John Won Joonwoo Ahn Junhyeong Park Junyoung Sung Kyungmin Lee Minseong Han Minsung Yoon Sejune Joo Seonil Son Seungcheol Park Seunggeun Cho Seungjun Moon Seungku Kim Yonghoon Dong Yongjin Cho Youngchan Kim Chang Hwan Kim Dohyeon Kim Hazel Lee Heecheol Kim Hensen Ahn Hyungkyu Ryu Hyunsoo Choi Hyunsoo Shin Jaeheon Jung Jaewoo Kim Jinwook Kim Joochul Chang Joonsoo Kim Junghun Park Jungwoo Park Junho Cho Junhyeok Park Junwon Lee Kangwook Lee Kwanghoon Kim Kyoungwhan Choe Manoj Bhadu Nayoung Oh Sangjun Kim Sangwoo Kim Seunghoon Shim Seunghyun Kim Seungjun Lee Seungyup Ka Sungryol Yang Wook Jung Yashu Shukla Yeonjae Lee Yeonwoo Bae Jinwoo Shin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:56 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords vision-language-action modelsdexterous manipulationhumanoid robotsmulti-stream transformerrobotic policiesreal-world tasksALLEX benchmark

0 comments

The pith

RLDX-1 uses a multi-stream transformer to reach 86.8% success on ALLEX humanoid tasks while other VLAs stay near 40%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RLDX-1 as a general-purpose robotic policy for dexterous manipulation. It builds this policy on the Multi-Stream Action Transformer architecture that processes vision, language, and action through separate streams before applying cross-modal joint self-attention. The approach adds functional capabilities such as motion awareness, long-term memory, and physical sensing that standard vision-language-action models still lack. Evaluations show RLDX-1 outperforms recent models including π0.5 and GR00T N1.6 on both simulation benchmarks and real-world tasks, with the largest gap appearing on high-DoF humanoid control.

Core claim

RLDX-1 is a robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer that integrates heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. Combined with data synthesis for rare scenarios, specialized learning procedures for human-like manipulation, and inference optimizations, the policy achieves higher success rates than frontier VLAs across simulation and real-world settings. On ALLEX humanoid tasks it reaches 86.8% success while π0.5 and GR00T N1.6 reach around 40%, showing effective control of high-DoF robots under diverse functional demands.

What carries the argument

The Multi-Stream Action Transformer (MSAT), which unifies modalities by running each in its own stream and then applying joint self-attention across streams to combine scene understanding with action generation.

If this is right

RLDX-1 can control high-DoF humanoid robots reliably across contact-rich and dynamic tasks.
Data synthesis for rare manipulation scenarios improves coverage of edge cases that general pre-training misses.
Specialized learning procedures and real-time inference optimizations make the policy practical for physical robot deployment.
Vision-language-action models can be extended to support broader functional capabilities without losing general versatility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A multi-stream design might reduce reliance on ever-larger unified pre-training by letting each modality develop its own representations first.
The same architecture could be tested on non-humanoid platforms to check whether the performance lift depends on the specific robot morphology.
Future comparisons could isolate whether cross-modal joint self-attention or the separate streams contribute most to the observed gains.

Load-bearing premise

The performance differences arise from the MSAT architecture and listed system-level choices rather than uncontrolled differences in training data volume, evaluation protocols, robot hardware calibration, or task definitions.

What would settle it

A side-by-side retraining and evaluation of RLDX-1, π0.5, and GR00T N1.6 on identical data volumes, identical task definitions, and identical hardware calibration to test whether the success-rate gap remains.

Figures

Figures reproduced from arXiv: 2605.03269 by Beomjun Kim, Byungjun Yoon, Chang Hwan Kim, Changsung Jang, Daewon Choi, Dohyeon Kim, Dongsu Han, Donguk Lee, Dongyoung Kim, Hazel Lee, Heecheol Kim, Heeseung Kwon, Hensen Ahn, Hojin Jeon, Huiwon Jang, Hyungkyu Ryu, Hyunsoo Choi, Hyunsoo Shin, Jaeheon Jung, Jaehyun Kang, Jaekyoung Bae, Jaewoo Kim, Jihyuk Lee, Jimin Lee, Jinwook Kim, Jinwoo Shin, John Won, Joochul Chang, Joonsoo Kim, Joonwoo Ahn, Junghun Park, Jungwoo Park, Junho Cho, Junhyeok Park, Junhyeong Park, Junwon Lee, Junyoung Sung, Kangwook Lee, Kwanghoon Kim, Kyoungwhan Choe, Kyungmin Lee, Manoj Bhadu, Minseong Han, Minsung Yoon, Myungkyu Koo, Nayoung Oh, Sangjun Kim, Sangwoo Kim, Sejune Joo, Seonil Son, Seungcheol Park, Seunggeun Cho, Seunghoon Shim, Seunghyun Kim, Seungjun Lee, Seungjun Moon, Seungku Kim, Seungyup Ka, Suhyeok Jang, Sungryol Yang, Taeyoung Kim, Wook Jung, Yashu Shukla, Yeonjae Lee, Yeonwoo Bae, Yonghoon Dong, Yongjin Cho, Youngchan Kim.

**Figure 1.** Figure 1: Overview of RLDX-1. RLDX-1 is a Vision-Language-Action model (VLA) that integrates diverse functional capabilities for dexterous manipulation in real-world deployment. All project leads contributed equally, and authors with equal contribution are listed alphabetically by first name. Names without additional markers indicate core contributors and additional contributors are listed in Appendix A. 1 arXiv:260… view at source ↗

**Figure 2.** Figure 2: Overview of RLDX-1. Given video observations and a language instruction, RLDX-1 predicts future actions through three key functionalities: motion awareness via the Motion Module, long-term memory via the Memory Module, and physical sensing via the Physics Stream that ingests torque and tactile signals. A VLM backbone grounds vision and language into a cognition representation, which is jointly denoised wit… view at source ↗

**Figure 3.** Figure 3: Overview of the RLDX-1 architecture. RLDX-1 consists of two main components: a Vision-Language Model (VLM) and an action model. The VLM takes video observations as input, captures motion-aware visual-language representations, and converts the extracted representations into cognition features that are passed to the action model. These cognition features are further augmented with memory features through a … view at source ↗

**Figure 4.** Figure 4: Overview of the synthetic data framework. (1) Data Generation: a source demonstration is diversified via scene and task augmentation, and an inverse dynamics model (IDM) annotates action labels for the generated videos. (2) Data Filtering: IDM-predicted actions are replayed in a simulator, and a motion-consistency classifier compares the rollout against the synthetic video to retain only consistent samples… view at source ↗

**Figure 5.** Figure 5: Examples of synthetic data. We visualize one example of our synthetic data: (a) original in-house ALLEX stack cup noodles demonstration, (b) task-augmented variant with a VLM-generated instruction, and (c) scene-augmented variant via I2I editing of the initial frame followed by I2V generation. Task Augmentation Task augmentation synthesizes executable task instructions conditioned on an initial scene frame… view at source ↗

**Figure 6.** Figure 6: Overview of dataset composition for pre-training RLDX-1. RLDX-1 pre-train dataset covers multiple embodiments spanning single-arm grippers, dual-arm grippers, and humanoid platforms equipped with dexterous hands, including synthetic GR-1 humanoid data (Section 3.3). Motion-Consistency Filtering While video quality filtering removes visually implausible or instruction-inconsistent videos, the IDM-predicted… view at source ↗

**Figure 7.** Figure 7: Overview of data compositions of RLDX-1 mid-training. The mid-training data covers two target platforms: ALLEX and Franka Research 3 platform (FR3). The ALLEX composition combines in-house teleoperation data with synthetic data from our generation pipeline (Section 3.3), while the FR3 composition combines in-house teleoperation data with the public DROID dataset (Khazatsky et al., 2024). Implementation Det… view at source ↗

**Figure 8.** Figure 8: Normalized value over timesteps for a cube pick-and-place task. The text prediction critic (Ours) better reflects the task progress than the distributional critic: (a) It produces more monotonically increasing values for episodes that succeed on the first attempt, and (b) captures both failure and recovery in episodes that succeed after a retry. 4.3. Post-Training Imitation learning (IL) on embodiment- and… view at source ↗

**Figure 9.** Figure 9: Dynamic graph vs. Static graph (Ours). Dynamic graph execution accumulates launch overhead across repeated graph launches. Static graph conversion captures the forward pass as a single CUDA Graph, reducing launch overhead. advantage-conditioned supervision derived from the critic. After that, we iteratively improve both components: at each iteration, we roll out the current policy to collect additional tra… view at source ↗

**Figure 10.** Figure 10: Effect of operator fusion on memory access. (a) Without fusion, each kernel writes its output to memory and the next kernel reads it back, and these memory round-trips dominate the runtime. (b) With fusion, the operators access memory only once for the input load and once for the output store, minimizing memory traffic. Graph Fragmentation The remaining launch overhead is caused by graph fragmentation dur… view at source ↗

**Figure 11.** Figure 11: Overview of the simulation benchmarks. (a) We consider established benchmarks, including LIBERO (Liu et al., 2023), SIMPLER (Li et al., 2024b) with Google Robot and WidowX for evaluating RLDX-1 on a single-arm robot, and consider LIBERO-Plus (Fei et al., 2025), SIMPLER Google-VA for evaluating robustness to diverse variations. (b) We further consider more challenging benchmarks, including RoboCasa Kitchen… view at source ↗

**Figure 12.** Figure 12: Real-robot platforms. We use (a) OpenArm with Inspire RH56F1 Hands, a 28-DoF upper-body humanoid with stereo egocentric cameras; (b) ALLEX, a 48-DoF upper-body humanoid with stereo egocentric cameras; (c) Franka Research 3 platform (FR3), a 7-DoF single-arm robot with an AnySkin tactile sensor, wrist and third-person cameras. The advantage of RLDX-1 becomes more pronounced on challenging benchmarks. On Ro… view at source ↗

**Figure 13.** Figure 13: OpenArm humanoid benchmark. We visualize the initial setup for six tasks for evaluating versatility in humanoid manipulation: Basic PnP, Directional PnP (Shelf), Directional PnP (Dish Rack), Unseen Object (Instance), Unseen Task (Placement), and Object Grounding. Basic PnP Directional PnP (shelf) Directional PnP (dish rack) Unseen Object (instance) Unseen Task (placement) Object Grounding 0 20 40 60 80 Su… view at source ↗

**Figure 14.** Figure 14: OpenArm humanoid benchmark results. We report the success rates (%) of fine-tuned VLAs. RLDX-1 substantially improves performance across all tasks, spanning both seen and unseen settings during training. Evaluation Protocol We conduct each evaluation trial in a three-object tabletop scene consisting of one target object and two distractors, and vary the initial object layout across three predefined config… view at source ↗

**Figure 15.** Figure 15: ALLEX humanoid benchmark. We design four tasks for evaluating functional capabilities in dexterous humanoid manipulation: Conveyor Pick-and-Place, Object-in-Box Selection, Card Slide-and-Pick, and Pot-to-Cup Pouring. 6.3. Real-World Experiments: ALLEX Humanoid To evaluate the functional capabilities of RLDX-1 in real-world dexterous manipulation, we curate a set of realistic, practical task scenarios for … view at source ↗

**Figure 16.** Figure 16: ALLEX humanoid benchmark results. We report the success rates (%) of VLAs fine-tuned on the training dataset of each task. RLDX-1 substantially outperforms the baselines across all task categories, including motion awareness (Conveyor Pick-and-Place), long-term memory (Object-in-Box Selection), and physical sensing (Card Slide-and-Pick, Pot-toCup-Pouring). • Pot-to-Cup Pouring. This task evaluates unders… view at source ↗

**Figure 17.** Figure 17: Franka Research 3 benchmark. We design six tasks for evaluating functional capabilities in dexterous single-arm robot manipulation: Spin Tracking, Pong Game, Cup Swapping, Shell Game, Plug Insertion, and Egg Pick-and-Place. 6.4. Real-World Experiments: Franka Research 3 To further evaluate the same functional capabilities on a different real-robot embodiment, we conduct experiments on the Franka Research … view at source ↗

**Figure 18.** Figure 18: Franka Research 3 benchmark results. We report the success rates (%) of VLAs fine-tuned on the training dataset of each task. RLDX-1 substantially outperforms the baselines across all task categories, including motion awareness (Spin Tracking, Pong Game), long-term history (Cup Swapping, Shell Game), and physical sensing (Plug Insertion, Egg PnP). Evaluation Protocol We adopt task-specific evaluation crit… view at source ↗

**Figure 19.** Figure 19: Attention-based analysis. We visualize the self-attention scores from the last prompt token to the image patch tokens during task execution. abstraction for action generation, whereas earlier layers lack sufficient semantics and later layers may become less aligned with fine-grained visual details required for manipulation. Meanwhile, robot-specific VQA training also plays an important role in adapting VL… view at source ↗

**Figure 21.** Figure 21: Results of RLDX-1 on the Light Bulb Twisting task. We report the box-and-whisker plot by (a) frames and (b) number of attempts, for expert demonstration, RLDX-1 after adopting imitation learning (BC), and reinforcement learning (RECAP1 to RECAP3) view at source ↗

**Figure 22.** Figure 22: On the offline dataset, we sampled N = 10 from the RECAP3 policy and measured the Q value to see whether test-time sampling would be effective. (a) Picking the best action chunk according to Q-value makes a difference, and (b), (c) the gap becomes clear as temperature increases. (d) Q-value versus frame of an Light Bulb Twisting episode shows that Q improvement by temperature takes place uniformly within … view at source ↗

**Figure 23.** Figure 23: Test-time sampling result of RL-trained RLDX-1 checkpoints (RECAP view at source ↗

read the original abstract

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, long-term memory, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including data synthesis for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. $\pi_{0.5}$ and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while $\pi_{0.5}$ and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLDX-1 claims a solid performance jump on ALLEX humanoid tasks but the abstract supplies no ablations or controls to credit the MSAT architecture over data or evaluation differences.

read the letter

RLDX-1 reports a clear performance jump on high-DoF humanoid dexterous tasks, but the evidence tying that jump to the MSAT architecture is still thin based on what is shown so far. The main numbers are 86.8% success on ALLEX tasks versus roughly 40% for π0.5 and GR00T N1.6, and that gap is the part worth noting if it survives scrutiny. The paper introduces the Multi-Stream Action Transformer, which runs modality-specific streams before cross-modal joint self-attention, plus data synthesis for rare cases, specialized learning procedures, and inference optimizations for real-time use. These choices extend the VLA line by trying to add motion awareness and physical sensing without claiming a full paradigm shift. The motivation section does a straightforward job explaining why general versatility alone falls short on contact-rich, long-horizon manipulation. The results section presents the outperformance across simulation and real-world settings as consistent. The soft spots are exactly where the stress-test note points: no ablations on the stream design, no matched data volumes or compositions for the baselines, no error bars, and no explicit checks on task definitions or hardware calibration. Without those, the attribution to MSAT remains an assumption rather than a demonstrated result. The full manuscript might contain the missing methods and tables, but the abstract alone leaves the central claim under-supported. This work is aimed at robotics groups already running VLA experiments on humanoids or dexterous arms. Readers who need concrete architecture details or new benchmark numbers could extract value even if the controls are incomplete. It deserves a serious referee. The empirical gap is large enough that reviewers could usefully demand the ablations and protocol details, which would turn the report into something more usable.

Referee Report

3 major / 0 minor

Summary. The paper introduces RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT) architecture that integrates heterogeneous modalities via modality-specific streams and cross-modal attention. It further incorporates system-level choices including data synthesis for rare scenarios, specialized learning procedures, and inference optimizations. The central claim is that RLDX-1 consistently outperforms recent frontier VLAs (e.g., π_{0.5} and GR00T N1.6) on simulation benchmarks and real-world tasks, with a highlighted result of 86.8% success rate on ALLEX humanoid tasks versus approximately 40% for the baselines.

Significance. If the performance differences can be rigorously attributed to the MSAT architecture and listed design choices rather than confounds, this would constitute a notable contribution to vision-language-action models for complex, high-DoF humanoid manipulation requiring functional capabilities beyond general versatility. The architecture's unification of modalities addresses a recognized limitation in current VLAs, but the current lack of supporting evidence prevents assessing whether the result holds.

major comments (3)

Abstract: The claim of consistent outperformance, including the specific 86.8% versus ~40% success rates on ALLEX humanoid tasks, is presented without any description of the experimental setup, baseline implementations, training data volumes or compositions, task definitions, success criteria, or hardware calibration details. This prevents determining whether the gap is attributable to MSAT rather than uncontrolled differences in training regime or evaluation protocols.
Abstract: No ablation studies, controlled experiments, or comparisons isolating the MSAT architecture from the system-level choices (data synthesis, specialized learning procedures, inference optimizations) are reported, leaving the central performance attribution unsupported.
Abstract: The manuscript supplies no information on the number of evaluation trials, error bars, variance, or statistical tests for the reported success rates, which is required to substantiate claims of consistent superiority across benchmarks and real-world tasks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where additional clarity and evidence are needed to support our claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The claim of consistent outperformance, including the specific 86.8% versus ~40% success rates on ALLEX humanoid tasks, is presented without any description of the experimental setup, baseline implementations, training data volumes or compositions, task definitions, success criteria, or hardware calibration details. This prevents determining whether the gap is attributable to MSAT rather than uncontrolled differences in training regime or evaluation protocols.

Authors: We agree that the abstract is too concise and omits critical context on the experimental setup, baselines, data composition, task definitions, success criteria, and hardware details. We will revise the abstract to include a brief high-level summary of the evaluation protocol and benchmarks. We will also expand the main text to provide complete descriptions of training data volumes, baseline implementations, task specifications, success criteria, and hardware calibration, enabling readers to assess potential confounds and attribute performance differences more clearly. revision: yes
Referee: Abstract: No ablation studies, controlled experiments, or comparisons isolating the MSAT architecture from the system-level choices (data synthesis, specialized learning procedures, inference optimizations) are reported, leaving the central performance attribution unsupported.

Authors: The referee correctly notes the absence of ablations or controlled experiments that isolate the MSAT architecture from the system-level choices. While the full RLDX-1 system shows strong results against frontier VLAs, this does not rigorously separate the architectural contribution. We will add a dedicated ablation subsection in the Experiments section, including controlled comparisons (e.g., MSAT versus a standard transformer backbone with other components held fixed) to better support the attribution of gains to the multi-stream design. revision: yes
Referee: Abstract: The manuscript supplies no information on the number of evaluation trials, error bars, variance, or statistical tests for the reported success rates, which is required to substantiate claims of consistent superiority across benchmarks and real-world tasks.

Authors: We acknowledge that the reported success rates are presented without details on the number of trials, error bars, variance, or statistical tests. In the revised manuscript, we will update all results tables and figures to specify the number of evaluation trials per task, include standard deviations or confidence intervals, and report appropriate statistical tests (such as paired t-tests) to substantiate the consistency of the observed outperformance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance comparison with no derivations or self-defined reductions

full rationale

The paper is an empirical technical report introducing the RLDX-1 policy and MSAT architecture, then reporting success rates on benchmarks and real-world tasks (e.g., 86.8% on ALLEX vs. ~40% for baselines). No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described content. Claims rest on direct experimental outcomes rather than any reduction to self-citations, ansatzes, or renamed inputs. The central attribution to architecture and system choices is presented as an empirical finding, not a mathematical necessity, making the work self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the central claim rests on unstated assumptions about fair comparison and the causal role of the listed design choices.

pith-pipeline@v0.9.0 · 5907 in / 1252 out tokens · 35334 ms · 2026-05-08T18:56:24.981497+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

125 extracted references · 55 canonical work pages · 21 internal anchors

[1]

Gemini Robotics: Bringing AI into the Physical World

Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review arXiv 2025
[2]

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al. π0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Eliz- abeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review arXiv 2025
[4]

Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π ∗ 0.6: A VLA That Learns From Experience.arXiv preprint arXiv:2511.14759, 2025

work page Pith review arXiv 2025
[5]

Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InACM International Conference on Architectural Support for Programming Languages and Operating Syst...

2024
[6]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review arXiv 2025
[7]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page Pith review arXiv 2025
[8]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. InAdvances in Neural Information Processing Systems, 2022

2022
[9]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review arXiv 2024
[10]

Anyskin: Plug-and-play skin sensing for robotic touch

Raunaq Bhirangi, Venkatesh Pattabiraman, Enes Erciyes, Yifeng Cao, Tess Hellebrekers, and Lerrel Pinto. Anyskin: Plug-and-play skin sensing for robotic touch. InIEEE International Conference on Robotics and Automation, 2025

2025
[11]

VLA-Touch: Enhancing vision-language- action models with dual-level tactile feedback.arXiv preprint arXiv:2507.17294, 2025

Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Zheng Shou, and Harold Soh. Vla-touch: Enhancing vision-language- action models with dual-level tactile feedback.arXiv preprint arXiv:2507.17294, 2025

work page arXiv 2025
[12]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review arXiv 2025
[13]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5 a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025a

work page Pith review arXiv
[14]

π0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems, 2025b. 29 RLDX-1 Technical Report
[15]

Real-time execution of action chunking flow policies

Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies. InAdvances in Neural Information Processing Systems, 2025c
[16]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025
[17]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems, 2023

2023
[18]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020

2020
[19]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2025a
[20]

Univla: Learning to act anywhere with task-centric latent actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. InRobotics: Science and Systems, 2025b
[21]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review arXiv 2024
[22]

Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271,

Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, et al. Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271, 2025a

work page arXiv
[23]

Offline reinforcement learning via high-fidelity generative behavior modeling

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. InInternational Conference on Learning Representations, 2023

2023
[24]

Training strategies for efficient embodied reasoning

William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning. InConference on Robot Learning, 2025b
[25]

Villa-x: enhancing latent action modeling in vision-language-action models,

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025c

work page arXiv
[26]

Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing,

Zhengxue Cheng, Yiqian Zhang, Wenkang Zhang, Haoyu Li, Keyu Wang, Li Song, and Hengdi Zhang. Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706, 2025

work page arXiv 2025
[27]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

2023
[28]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Machine Learning, 2023

2023
[30]

Knowledge insulating vision-language-action models: Train fast, run fast, generalize better

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better. InAdvances in Neural Information Processing Systems, 2025

2025
[31]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024

2024
[32]

Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025

Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025. 30 RLDX-1 Technical Report

work page arXiv 2025
[33]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. LIBERO-Plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review arXiv 2025
[34]

Fourier actionnet dataset.https://action-net.org, 2025

Yao Mu Fourier ActionNet Team. Fourier actionnet dataset.https://action-net.org, 2025

2025
[35]

GR00T N1.5: An improved open foundation model for generalist humanoid robots

NVIDIA GEAR. GR00T N1.5: An improved open foundation model for generalist humanoid robots. https: //research.nvidia.com/labs/gear/gr00t-n1_5/, June 2025a
[36]

GR00T N1.6: An improved open foundation model for generalist humanoid robots

NVIDIA GEAR. GR00T N1.6: An improved open foundation model for generalist humanoid robots. https: //research.nvidia.com/labs/gear/gr00t-n1_6/, December 2025b
[37]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review arXiv 2025
[38]

Octo: An open-source generalist robot policy

Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024

2024
[39]

arXiv preprint arXiv:2511.19861 (2025)

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

work page arXiv 2025
[40]

Gemini API | Google AI for Developers.https://ai.google.dev/api, 2026

Google. Gemini API | Google AI for Developers.https://ai.google.dev/api, 2026. [Accessed 03-05-2026]

2026
[41]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

2022
[42]

Otter: A vision-language-action model with text-aware visual feature extraction.arXiv preprint arXiv:2503.03734, 2025a

Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, and Pieter Abbeel. Otter: A vision-language-action model with text-aware visual feature extraction.arXiv preprint arXiv:2503.03734, 2025a

work page arXiv
[43]

Tactile-vla: Unlocking vision-language-action model’s physical knowledge for tactile generalization,

Jialei Huang, Shuo Wang, Fanqi Lin, Yihang Hu, Chuan Wen, and Yang Gao. Tactile-vla: unlocking vision-language- action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025b

work page arXiv
[44]

V oxposer: Composable 3d value maps for robotic manipulation with language models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConference on Robot Learning, 2023

2023
[45]

arXiv preprint arXiv:2510.04246 (2025) 13

Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. Contextvla: Vision-language- action model with amortized multi-frame context.arXiv preprint arXiv:2510.04246, 2025a

work page arXiv
[46]

Dreamgen: Unlocking generalization in robot learning through video world models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. In Conference on Robot Learning, 2025b
[47]

Spatialboost: Enhancing visual representation through language-guided reasoning.arXiv preprint arXiv:2603.22057, 2026

Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, and Jinwoo Shin. Spatialboost: Enhancing visual representation through language-guided reasoning.arXiv preprint arXiv:2603.22057, 2026

work page arXiv 2026
[48]

Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576,

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025a

work page arXiv
[49]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. InIEEE International Conference on Robotics and Automation, 2025b
[50]

Interpreting physics in video world models.arXiv preprint arXiv:2602.07050, 2026

Sonia Joseph, Quentin Garrido, Randall Balestriero, Matthew Kowal, Thomas Fel, Shahab Bakhtiari, Blake Richards, and Mike Rabbat. Interpreting physics in video world models.arXiv preprint arXiv:2602.07050, 2026

work page arXiv 2026
[51]

Planning and acting in partially observable stochastic domains.Artificial intelligence, 1998

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 1998

1998
[52]

Droid: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems, 2024. 31 RLDX-1 Technical Report

2024
[53]

DEAS: Detached value learning with action sequence for scalable offline RL

Changyeon Kim, Haeone Lee, Younggyo Seo, Kimin Lee, and Yuke Zhu. DEAS: Detached value learning with action sequence for scalable offline RL. InInternational Conference on Learning Representations, 2026a
[54]

Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics

Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, and Younggyo Seo. Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics. InAdvances in Neural Information Processing Systems, 2025

2025
[55]

Roboalign: Learning test-time reasoning for language-action alignment in vision-language-action models.arXiv preprint arXiv:2603.21341, 2026b

Dongyoung Kim, Sumin Park, Woomin Song, Seungku Kim, Taeyoung Kim, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, and Younggyo Seo. Roboalign: Learning test-time reasoning for language-action alignment in vision-language-action models.arXiv preprint arXiv:2603.21341, 2026b

work page arXiv
[56]

Exploring High-Order Self-Similarity for Video Understanding

Manjin Kim, Heeseung Kwon, Karteek Alahari, and Minsu Cho. Exploring high-order self-similarity for video understanding.arXiv preprint arXiv:2604.20760, 2026c

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, 2024

2024
[58]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026d

work page internal anchor Pith review arXiv
[59]

Robocurate: Harnessing diversity with action-verified neural trajectory for robot learning.arXiv preprint arXiv:2602.18742, 2026e

Seungku Kim, Suhyeok Jang, Byungjun Yoon, Dongyoung Kim, John Won, and Jinwoo Shin. Robocurate: Harnessing diversity with action-verified neural trajectory for robot learning.arXiv preprint arXiv:2602.18742, 2026e

work page arXiv
[60]

Contrastive representation regularization for vision-language-action models

Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. Contrastive representation regularization for vision-language-action models. InInternational Conference on Machine Learning, 2026f
[61]

Hamlet: Switch your vision-language-action model into a history-aware policy

Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. Hamlet: Switch your vision-language-action model into a history-aware policy. InInternational Conference on Learning Representations, 2026

2026
[62]

Offline reinforcement learning with implicit Q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. InInterna- tional Conference on Learning Representations, 2022

2022
[63]

Learning self-similarity in space and time as generalized motion for video action recognition

Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. Learning self-similarity in space and time as generalized motion for video action recognition. InIEEE International Conference on Computer Vision, 2021

2021
[64]

FLUX.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. FLUX.https://github.com/black-forest-labs/flux, 2024

2024
[65]

Reinforcement learning with augmented data

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data. InAdvances in Neural Information Processing Systems, 2020

2020
[66]

Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

Jimin Lee, Huiwon Jang, Myungkyu Koo, Jungwoo Park, and Jinwoo Shin. Modular sensory stream for integrating physical feedback in vision-language-action models.arXiv preprint arXiv:2604.23272, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[67]

Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, 2023

2023
[68]

Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv e-prints, 2025a

Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv e-prints, 2025a
[69]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024a

work page Pith review arXiv
[70]

Reinforcement learning with action chunking

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. InAdvances in Neural Information Processing Systems, 2025b. 32 RLDX-1 Technical Report
[71]

Evaluating real-world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning, 2024b
[72]

Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S Huang, Luke Zettlemoyer, Dieter Fox, et al. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

work page internal anchor Pith review arXiv 2026
[73]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InIEEE International Conference on Robotics and Automation, 2023

2023
[74]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

2023
[75]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page Pith review arXiv 2024
[76]

LIBERO: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems, 2023

2023
[77]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019
[78]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning, 2023

2023
[79]

Steering your generalists: Improving robotic foundation models via value guidance

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InConference on Robot Learning, 2024

2024
[80]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems, 2024

2024

Showing first 80 references.