VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment

Fangxiang Feng; Fengfa Li; Guoyang Xia; Hongjin Ji; Kun Zhan; Lei Ren; Yan Xie

arxiv: 2607.01586 · v1 · pith:VL3QBDP6new · submitted 2026-07-02 · 💻 cs.CV · cs.AI· cs.RO

VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment

Guoyang Xia , Fengfa Li , Hongjin Ji , Lei Ren , Fangxiang Feng , Kun Zhan , Yan Xie This is my paper

Pith reviewed 2026-07-03 17:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords vision-language-action modelsflow matchingco-trainingfuture latent alignmentrobotic manipulationheterogeneous datatransfer performancepre-training paradigms

0 comments

The pith

Combining language-supervised co-training and future latent alignment produces the most stable transfer performance for vision-language-action models on heterogeneous robot data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up a single flow-matching framework called VLAFlow to compare four training objectives while holding architecture, backbone, action space, and a 5,000-hour mixed robot dataset fixed. It shows that pure action modeling struggles with data variety, language co-training keeps vision-language skills intact, future latent alignment strengthens prediction of state changes and outcomes, and the two signals together yield the steadiest results on LIBERO, LIBERO-Plus, and SimplerEnv. A sympathetic reader would care because most robot data in practice comes from many sources and robots, so any method that makes pre-training more robust could reduce the need for heavy per-task retraining. The work frames language and future latent representations as complementary intermediate constraints that smooth heterogeneous action supervision.

Core claim

VLAFlow fixes the pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space while training four variants on the OXEMix corpus: action-only modeling, language-supervised co-training, future latent alignment, and their combination. Action-only pre-training is sensitive to data heterogeneity; language supervision preserves generalization; future latent alignment improves state-transition and action-outcome modeling; and the combined model achieves the most stable overall transfer performance across the three benchmarks. The results support a meta-action space view in which language and future latent representations supply complementary intermediate constraints.

What carries the argument

VLAFlow, a unified flow-matching framework that isolates the effects of four training objectives by using identical architecture, backbone, action space, and the OXEMix heterogeneous robot corpus.

If this is right

Language-supervised co-training helps preserve vision-language generalization.
Future latent alignment improves state-transition and action-outcome modeling.
The combination of both signals produces the most stable transfer performance across benchmarks.
Language and future latent representations act as complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same controlled-comparison approach could be used to test whether the stability gains persist when the action space or backbone changes.
Real-world robot fleets that draw data from multiple manufacturers might adopt the combined paradigm to lower the cost of adapting to new tasks.
The meta-action space perspective could be explored in other sequential prediction settings such as video generation or autonomous driving.
Scaling the OXEMix corpus size while keeping the same four-objective comparison would test whether the stability advantage grows with data volume.

Load-bearing premise

Performance differences across the four training paradigms can be attributed primarily to the objectives themselves rather than to unmeasured interactions with the specific composition of the OXEMix data or the fixed architecture.

What would settle it

Re-running the four paradigms on LIBERO, LIBERO-Plus, and SimplerEnv and finding that the combined model no longer shows the most stable transfer performance would falsify the central claim.

read the original abstract

Vision-language-action models (VLAs) have recently advanced robotic manipulation, yet the effects of different robot-data pre-training paradigms remain difficult to compare because existing models often differ in architecture, data, action space, and evaluation protocol. We present VLAFlow (Vision-Language-Action Flow), a unified flow-matching framework for controlled comparison of VLA training objectives. Using a heterogeneous robot corpus, OXEMix, containing approximately 5,000 hours of data from DROID, OpenX-Embodiment, OpenX-Augmented, and RoboCOIN, we evaluate four paradigms under the same pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space: action-only modeling (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show that action-only pre-training is sensitive to heterogeneous data. In contrast, language supervision helps preserve vision-language generalization, while future latent alignment improves state-transition and action-outcome modeling. By combining both signals, MindLWPI achieves the most stable overall transfer performance across benchmarks. These results suggest a meta-action space view: language and future latent representations provide complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLAFlow sets up a controlled comparison of four VLA training paradigms on a new mixed dataset, finding the combined language-plus-latent version most stable, but data heterogeneity likely confounds clean attribution to the objectives.

read the letter

VLAFlow gives a controlled way to test different training objectives for vision-language-action models on mixed robot data. The key finding is that combining language-supervised co-training with future latent alignment produces the most stable transfer on the tested benchmarks.

The paper does a solid job of building one framework and one dataset so that the four paradigms—action-only, language co-training, future latent alignment, and both—run on identical architecture and data. OXEMix aggregates 5k hours from DROID, OpenX, and others, and they keep the VLM backbone, action expert, and 14-dim space the same. That removes a lot of the usual confounding variables when comparing VLA methods.

The results suggest language helps with generalization and latent alignment helps with transitions, and together they smooth out the heterogeneous supervision. This is useful for anyone trying to pre-train on diverse robot corpora.

The soft spots are in the evidence. The abstract shows benchmark wins for the combined version but gives no error bars, no statistical tests, and no breakdown by data source. The stress-test concern holds up: the heterogeneous sources have different distributions, and the objectives could interact with those differences in ways the four-way comparison doesn't isolate. Without per-source results or mix ablations, it's difficult to attribute the stability purely to the training signals.

This paper is for people working on VLA pre-training who want a shared testbed. It is worth sending to peer review because the controlled comparison setup is a practical advance, even though the current results need more verification on the data side.

Referee Report

2 major / 0 minor

Summary. The paper introduces VLAFlow, a unified flow-matching framework for controlled comparison of VLA training paradigms. Using the OXEMix corpus (~5k hours from DROID, OpenX-Embodiment, OpenX-Augmented, RoboCOIN), it evaluates four paradigms under identical pi0-style architecture, shared VLM backbone, action expert, and 14-dim action space: action-only (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show action-only pre-training is sensitive to heterogeneous data, language supervision preserves vision-language generalization, future latent alignment improves state-transition modeling, and MindLWPI yields the most stable transfer, supporting a meta-action space view with complementary intermediate constraints.

Significance. If the results hold under rigorous validation, the work provides a valuable standardized framework for isolating effects of VLA pre-training objectives on heterogeneous robot data, addressing a key limitation where prior models confound architecture, data, and objectives. The controlled setup and large corpus strengthen the ability to attribute benefits to language co-training and latent alignment, with potential to guide more transferable robotic policies.

major comments (2)

[Abstract] Abstract: the reported benchmark results favoring MindLWPI provide no error bars, statistical tests, or details on data splits and exclusion rules, preventing verification of the claimed stability and performance differences across paradigms.
[Abstract] Abstract: the attribution of performance differences primarily to the four training objectives (rather than unmeasured interactions with heterogeneous data sources) is not isolated, as no per-source breakdowns, data-mix ablations, or sampling controls are described despite distinct distributions across DROID/OpenX/etc. sources.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical reporting and experimental controls. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the reported benchmark results favoring MindLWPI provide no error bars, statistical tests, or details on data splits and exclusion rules, preventing verification of the claimed stability and performance differences across paradigms.

Authors: We agree that error bars, statistical tests, and explicit details on data splits and exclusion rules are necessary for verification. The full experimental sections report results over multiple random seeds with standard deviations in the tables, but these were not referenced in the abstract and statistical significance tests were omitted. In the revision we will (i) add error bars and significance tests (e.g., paired t-tests) to all benchmark tables, (ii) document the exact train/validation splits, seed counts, and any episode exclusion criteria in the experimental setup, and (iii) update the abstract to note that all comparisons use multiple seeds. revision: yes
Referee: [Abstract] Abstract: the attribution of performance differences primarily to the four training objectives (rather than unmeasured interactions with heterogeneous data sources) is not isolated, as no per-source breakdowns, data-mix ablations, or sampling controls are described despite distinct distributions across DROID/OpenX/etc. sources.

Authors: The controlled experimental design—identical pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space across all four paradigms—already isolates the training objectives from architectural and action-space confounds. Nevertheless, we acknowledge that source-specific interactions could still contribute. In the revision we will add (i) per-source performance breakdowns on LIBERO and SimplerEnv, (ii) a controlled data-mix ablation that varies sampling ratios while keeping total hours fixed, and (iii) explicit sampling controls. These additions will strengthen the claim that the observed stability gains are attributable to the complementary language and future-latent constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison is self-contained

full rationale

The paper conducts a controlled empirical evaluation of four VLA training paradigms (action-only, language co-training, future latent alignment, and their combination) on shared pi0-style architecture, VLM backbone, action expert, and 14-dim action space using the OXEMix corpus. Results are reported as observed transfer performance on external benchmarks (LIBERO, LIBERO-Plus, SimplerEnv). No equations, fitted parameters renamed as predictions, self-citation chains, or ansatzes are present that reduce any claimed result to its inputs by construction. The meta-action-space interpretation is presented as a post-hoc suggestion from the empirical outcomes rather than a deductive step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that flow-matching is a suitable generative model for robot actions and that the shared architecture isolates the effect of training objectives. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Flow-matching provides a valid generative model for 14-dimensional robot actions under a shared VLM backbone.
The framework is built on flow-matching for all four paradigms.

pith-pipeline@v0.9.1-grok · 5794 in / 1282 out tokens · 34436 ms · 2026-07-03T17:06:12.493998+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 34 canonical work pages · 29 internal anchors

[1]

Self-supervised learning from im- ages with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from im- ages with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 15619–15629, 2023

2023
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Liang- hao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann Le- Cun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter , et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

π0.5: a vision-language-action model with open-world generalization, 2025

Kevin Black, Noah Brown, Danny Driess, et al. π0.5: a vision-language-action model with open-world generalization, 2025

2025
[8]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar , Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111, 2025. 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zoui- tine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor , Dana Aubakirova, et al. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

2024
[11]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning , pages 794–803. PMLR, 2018

2018
[13]

Learning universal policies via text-guided video gener- ation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schu- urmans, and Pieter Abbeel. Learning universal policies via text-guided video gener- ation. Advances in neural information processing systems , 36:9156–9172, 2023

2023
[14]

Re- mix: Optimizing data mixtures for large scale imitation learning

Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. Re- mix: Optimizing data mixtures for large scale imitation learning. arXiv preprint arXiv:2408.14037, 2024

work page arXiv 2024
[15]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR) , 2022

2022
[16]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder , Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning

Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu, and Ken Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning. arXiv preprint arXiv:2512.13100, 2025

work page arXiv 2025
[18]

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Ran Jiang et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion. arXiv preprint arXiv:2602.12215, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

JoyAI-RA Team. Joyai-ra 0.1: A foundation model for robotic autonomy. arXiv preprint arXiv:2604.20100, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Droid: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair , Ashwin Balakrishna, Sudeep Dasari, et al. Droid: A large-scale in-the-wild robot manipulation dataset. In Proceedings of Robotics: Science and Systems (RSS) , 2024

2024
[21]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair , Rafael Rafailov, Ethan Foster , Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action mod- els: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025. 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, et al. Cosmos pol- icy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Lap: Language-action pre-training enables zero-shot cross- embodiment transfer .arXiv preprint arXiv:2602.10556, 2026

Haizhou Li et al. Lap: Language-action pre-training enables zero-shot cross- embodiment transfer .arXiv preprint arXiv:2602.10556, 2026

work page arXiv 2026
[25]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Sylvia Li et al. Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies. arXiv preprint arXiv:2508.20072, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Rep- resentations (ICLR), 2023

2023
[29]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , 2023

2023
[30]

Being-H0.7: A Latent World-Action Model from Egocentric Videos

Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, et al. Being-h0.7: A latent world-action model from egocentric videos. arXiv preprint arXiv:2605.00078, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Jepa-vla: Video predictive embedding is needed for vla models

Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, and Mingsheng Long. Jepa-vla: Video predictive embedding is needed for vla models. arXiv preprint arXiv:2602.11832, 2026

work page arXiv 2026
[32]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar , Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar , Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[34]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter , Danny Driess, Suraj Nair , Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025. 22

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international con- ference for high performance computing, networking, storage and analysis , pages 1–16. IEEE, 2020

2020
[37]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Vla-jepa: Enhancing vision- language-action model with latent world model

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, et al. Vla-jepa: Enhancing vision- language-action model with latent world model. arXiv preprint arXiv:2602.10098 , 2026

work page arXiv 2026
[39]

Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models. arXiv preprint arXiv:2505.17016, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

World Action Models are Zero-shot Policies

Wenhui Wang et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation

Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, and Xiaodan Liang. Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[43]

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, et al. Robocoin: An open- sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning. arXiv preprint arXiv:2602.11236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Fpc-vla: A vision-language- action framework with a supervisor for failure prediction and correction

Yifan Yang, Zhixiang Duan, Tianshi Xie, Fuyu Cao, Pinxi Shen, Peili Song, Chenyang Zhao, Piaopiao Jin, Guokang Sun, Shaoqing Xu, et al. Fpc-vla: A vision-language- action framework with a supervisor for failure prediction and correction. Expert Systems with Applications, page 131742, 2026. 23

2026
[48]

Rt-2: Vision- language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, et al. Rt-2: Vision- language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[49]

StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang, Yuxin Chen, Peng- guang Chen, Yilun Chen, Shu Liu, and Jiaya Jia. StarVLA- α: Reducing complexity in vision-language-action systems. arXiv preprint arXiv:2604.11757, 2026. A Implementation Details This appendix supplements the implementation details omitted from Section3.2. The main text reta...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Self-supervised learning from im- ages with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from im- ages with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 15619–15629, 2023

2023

[2] [2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Liang- hao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann Le- Cun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter , et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

π0.5: a vision-language-action model with open-world generalization, 2025

Kevin Black, Noah Brown, Danny Driess, et al. π0.5: a vision-language-action model with open-world generalization, 2025

2025

[8] [8]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar , Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111, 2025. 20

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zoui- tine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor , Dana Aubakirova, et al. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

2024

[11] [11]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning , pages 794–803. PMLR, 2018

2018

[13] [13]

Learning universal policies via text-guided video gener- ation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schu- urmans, and Pieter Abbeel. Learning universal policies via text-guided video gener- ation. Advances in neural information processing systems , 36:9156–9172, 2023

2023

[14] [14]

Re- mix: Optimizing data mixtures for large scale imitation learning

Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. Re- mix: Optimizing data mixtures for large scale imitation learning. arXiv preprint arXiv:2408.14037, 2024

work page arXiv 2024

[15] [15]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR) , 2022

2022

[16] [16]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder , Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning

Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu, and Ken Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning. arXiv preprint arXiv:2512.13100, 2025

work page arXiv 2025

[18] [18]

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Ran Jiang et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion. arXiv preprint arXiv:2602.12215, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

JoyAI-RA Team. Joyai-ra 0.1: A foundation model for robotic autonomy. arXiv preprint arXiv:2604.20100, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Droid: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair , Ashwin Balakrishna, Sudeep Dasari, et al. Droid: A large-scale in-the-wild robot manipulation dataset. In Proceedings of Robotics: Science and Systems (RSS) , 2024

2024

[21] [21]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair , Rafael Rafailov, Ethan Foster , Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action mod- els: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025. 21

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, et al. Cosmos pol- icy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Lap: Language-action pre-training enables zero-shot cross- embodiment transfer .arXiv preprint arXiv:2602.10556, 2026

Haizhou Li et al. Lap: Language-action pre-training enables zero-shot cross- embodiment transfer .arXiv preprint arXiv:2602.10556, 2026

work page arXiv 2026

[25] [25]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Sylvia Li et al. Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies. arXiv preprint arXiv:2508.20072, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Rep- resentations (ICLR), 2023

2023

[29] [29]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , 2023

2023

[30] [30]

Being-H0.7: A Latent World-Action Model from Egocentric Videos

Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, et al. Being-h0.7: A latent world-action model from egocentric videos. arXiv preprint arXiv:2605.00078, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Jepa-vla: Video predictive embedding is needed for vla models

Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, and Mingsheng Long. Jepa-vla: Video predictive embedding is needed for vla models. arXiv preprint arXiv:2602.11832, 2026

work page arXiv 2026

[32] [32]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar , Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar , Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[33] [33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[34] [34]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter , Danny Driess, Suraj Nair , Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025. 22

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international con- ference for high performance computing, networking, storage and analysis , pages 1–16. IEEE, 2020

2020

[37] [37]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Vla-jepa: Enhancing vision- language-action model with latent world model

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, et al. Vla-jepa: Enhancing vision- language-action model with latent world model. arXiv preprint arXiv:2602.10098 , 2026

work page arXiv 2026

[39] [39]

Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models. arXiv preprint arXiv:2505.17016, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

World Action Models are Zero-shot Policies

Wenhui Wang et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation

Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, and Xiaodan Liang. Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[43] [43]

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, et al. Robocoin: An open- sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning. arXiv preprint arXiv:2602.11236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Fpc-vla: A vision-language- action framework with a supervisor for failure prediction and correction

Yifan Yang, Zhixiang Duan, Tianshi Xie, Fuyu Cao, Pinxi Shen, Peili Song, Chenyang Zhao, Piaopiao Jin, Guokang Sun, Shaoqing Xu, et al. Fpc-vla: A vision-language- action framework with a supervisor for failure prediction and correction. Expert Systems with Applications, page 131742, 2026. 23

2026

[48] [48]

Rt-2: Vision- language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, et al. Rt-2: Vision- language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[49] [49]

StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang, Yuxin Chen, Peng- guang Chen, Yilun Chen, Shu Liu, and Jiaya Jia. StarVLA- α: Reducing complexity in vision-language-action systems. arXiv preprint arXiv:2604.11757, 2026. A Implementation Details This appendix supplements the implementation details omitted from Section3.2. The main text reta...

work page internal anchor Pith review Pith/arXiv arXiv 2026