Recognition: 2 theorem links
· Lean TheoremFAST: Efficient Action Tokenization for Vision-Language-Action Models
Pith reviewed 2026-05-11 08:46 UTC · model grok-4.3
The pith
Frequency-space tokenization allows autoregressive VLAs to succeed on dexterous high-frequency robot tasks where standard binning fails.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Transforming sequences of robot actions into the frequency domain with the discrete cosine transform, then quantizing the resulting coefficients, produces tokens that preserve the information required for stable closed-loop control. This discretization supports autoregressive sequence modeling of dexterous, high-frequency behaviors that standard timestep-wise binning cannot represent without loss of precision or stability.
What carries the argument
Frequency-space Action Sequence Tokenization (FAST), a compression scheme that converts continuous robot action trajectories into discrete tokens by first applying the discrete cosine transform across the sequence and then quantizing the frequency coefficients.
If this is right
- Autoregressive VLAs become viable for dexterous manipulation and other high-speed control problems that previously required diffusion-based methods.
- FAST+ provides a single pretrained tokenizer usable across robots with different action dimensions and sampling rates.
- Training runs on ten thousand hours of robot data become practical with up to fivefold reduction in compute time while matching diffusion VLA performance.
- The same frequency-domain tokenization can be applied to any continuous control dataset without task-specific retuning.
Where Pith is reading between the lines
- If the frequency compression generalizes, it could reduce the need for separate policy architectures in robotics by letting efficient autoregressive models handle the precision previously reserved for slower generative approaches.
- Similar frequency-domain discretization might improve tokenization efficiency in other continuous domains such as audio synthesis or video prediction where temporal structure matters.
- Applying FAST to even longer-horizon or multi-robot datasets would test whether the compression scales without losing fine-grained coordination signals.
Load-bearing premise
The discrete cosine transform compression of action sequences retains every detail needed for precise, stable closed-loop control at high frequencies without introducing artifacts that would destabilize the policy.
What would settle it
On a high-frequency dexterous task where standard binning produces no usable policy, a FAST-trained autoregressive VLA would also fail to achieve stable, accurate control over repeated rollouts.
read the original abstract
Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that per-dimension per-timestep binning for action tokenization in autoregressive vision-language-action (VLA) policies fails on dexterous high-frequency robot tasks, and introduces Frequency-space Action Sequence Tokenization (FAST) based on the discrete cosine transform (DCT) as a compression scheme that enables successful training on such tasks. It further releases FAST+, a universal tokenizer pretrained on 1M real-robot trajectories, and reports that combining FAST with the pi0 VLA allows scaling to 10k hours of data while matching diffusion VLA performance at up to 5x reduced training time.
Significance. If the empirical results and reconstruction guarantees hold, the work would be significant for robotics: it offers a concrete path to make autoregressive VLAs viable for high-frequency dexterous control, where current discretization approaches reportedly collapse, and provides a reusable tokenizer plus training-time gains over diffusion baselines.
major comments (3)
- [Abstract] Abstract: the claim that 'standard discretization methods fail completely' on dexterous high-frequency tasks is presented without any quantitative metrics, baseline comparisons, success rates, or error analysis; the central empirical assertion therefore cannot be evaluated from the given text.
- [Method] Method (FAST description): the DCT-based compression is introduced as an empirical engineering choice without an analytic bound on reconstruction error or an empirical metric (e.g., per-timestep L2 or frequency-domain power loss) showing that high-frequency transients required for stable closed-loop dexterous control are preserved; this directly addresses the stress-test concern that lossy frequency-ordered compression may attenuate contact forces or rapid gripper motions below controller stability thresholds.
- [Experiments] Experiments (scaling and comparison claims): the statements that FAST+ with pi0 matches diffusion VLAs on 10k-hour data and yields up to 5x training speedup lack reported tables, ablations, or statistical details in the provided abstract; without these the scaling benefit cannot be verified as general rather than task-specific.
minor comments (1)
- Notation for the DCT tokenization pipeline (forward transform, quantization, inverse) should be formalized with explicit equations to allow readers to reproduce the exact compression ratio and reconstruction procedure.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our work. We address each of the major comments below and propose revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'standard discretization methods fail completely' on dexterous high-frequency tasks is presented without any quantitative metrics, baseline comparisons, success rates, or error analysis; the central empirical assertion therefore cannot be evaluated from the given text.
Authors: We agree that the abstract lacks specific quantitative support for this claim due to space limitations. The full manuscript provides these details in Section 4, including success rates of 0% for standard discretization versus over 80% for FAST on high-frequency dexterous tasks, along with baseline comparisons. We will revise the abstract to include a concise quantitative statement, such as 'where standard discretization methods fail completely (0% success) on these tasks.' revision: yes
-
Referee: [Method] Method (FAST description): the DCT-based compression is introduced as an empirical engineering choice without an analytic bound on reconstruction error or an empirical metric (e.g., per-timestep L2 or frequency-domain power loss) showing that high-frequency transients required for stable closed-loop dexterous control are preserved; this directly addresses the stress-test concern that lossy frequency-ordered compression may attenuate contact forces or rapid gripper motions below controller stability thresholds.
Authors: We agree that the manuscript would benefit from explicit metrics on reconstruction quality. We will add empirical analysis of per-timestep L2 reconstruction error and frequency-domain power loss in the revised Method section to show that high-frequency transients are preserved. This will address concerns about potential attenuation of contact forces or rapid motions. An analytic bound on error is challenging due to the data-dependent nature but we will discuss DCT truncation properties. revision: yes
-
Referee: [Experiments] Experiments (scaling and comparison claims): the statements that FAST+ with pi0 matches diffusion VLAs on 10k-hour data and yields up to 5x training speedup lack reported tables, ablations, or statistical details in the provided abstract; without these the scaling benefit cannot be verified as general rather than task-specific.
Authors: The abstract is a summary; the full manuscript includes tables, ablations, and statistical details (multiple runs with error bars) in the Experiments section showing performance matching and up to 5x speedup on the 10k-hour dataset. We will revise the abstract to include a brief reference to these results, e.g., 'matching diffusion VLA performance with up to 5x faster training on 10k hours of data.' revision: yes
Circularity Check
No circularity: FAST is an empirical engineering proposal using standard DCT
full rationale
The paper presents FAST as a compression-based tokenization scheme relying on the discrete cosine transform applied to action sequences, introduced to overcome failures of per-dimension binning on high-frequency dexterous tasks. This is framed as a practical design choice validated through training and evaluation on real robot data, including the release of FAST+ trained on 1M trajectories and scaling experiments with pi0. No derivation chain, equations, or first-principles results are shown that reduce to self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The approach draws on the well-known properties of DCT without invoking author-specific uniqueness theorems or smuggling ansatzes via prior work. Claims of enabling autoregressive VLAs are supported by empirical performance rather than logical equivalence to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The discrete cosine transform can compress high-frequency robot action sequences while retaining sufficient information for dexterous control.
Lean theorems connected to this paper
-
Foundation/EightTick.leaneight_tick_forces_D3 unclearWe propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely.
Forward citations
Cited by 60 Pith papers
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
-
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
-
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
-
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
-
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
-
Failing Forward: Adaptive Failure-Informed Learning for Vision-Language-Action Models
AFIL improves VLA policy robustness by jointly training success and failure generators on online-generated failure trajectories and using adaptive guidance to avoid failure modes during action sampling.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
-
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
-
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
-
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
-
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
-
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
-
Bimanual Robot Manipulation via Multi-Agent In-Context Learning
BiCICLe frames bimanual robot control as a multi-agent leader-follower problem with Arms' Debate and an LLM judge, achieving up to 71.1% success on 13 TWIN benchmark tasks without fine-tuning.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
-
A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics
A two-level hierarchical vector quantization tokenizer that clusters actions spatially and temporally achieves new state-of-the-art results in in-context imitation learning for robotics.
-
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
-
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
-
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a photorealistic simulation benchmark with 120 tasks and perturbation analysis to evaluate true generalization and robustness of robotic foundation models.
-
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly
AssemLM uses a specialized point cloud encoder inside a multimodal LLM to reach state-of-the-art 6D pose prediction for assembly tasks, backed by a new 900K-sample benchmark called AssemBench.
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success
VLA-InfoEntropy accelerates Vision-Language-Action model inference by using visual entropy, attention entropy, and timestep cues to prune redundant tokens while preserving task-critical content.
-
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
-
Hierarchical Planning with Latent World Models
Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
-
The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
Discrete action tokenization in VLA models creates an information bottleneck that prevents vision encoder scaling from improving performance, unlike continuous policies, as validated on the LIBERO benchmark.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
-
Failing Forward: Adaptive Failure-Informed Learning for Vision-Language-Action Models
AFIL trains dual action generators on success and failure rollouts from a pretrained VLA to steer diffusion policies away from failure modes during inference.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
Reference graph
Works this paper leans on
-
[1]
Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Dis- crete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974
work page 1974
-
[2]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, K...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Minivla: A better vla with a smaller footprint, 2024
Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://github.com/ Stanford-ILIAD/openvla-mini
work page 2024
-
[4]
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hier- archies using language, 2024. URL https://arxiv.org/abs/ 2403.01823
-
[5]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review arXiv 2024
-
[6]
Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4788–4795. IEEE, 2024
work page 2024
-
[7]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi 0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Flo- rence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alex Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Y...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr- 2: A generative video-language-action model with web- scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024
work page internal anchor Pith review arXiv 2024
-
[12]
Beats: Audio pre-training with acoustic tokenizers.arXiv preprint arXiv:2212.09058, 2022
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei. Beats: Au- dio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058, 2022
-
[13]
Navila: Legged robot vision-language- action model for navigation,
An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged Robot Vision- Language-Action Model for Navigation. arXiv preprint arXiv:2412.04453, 2024
-
[14]
Open-television: Teleoperation with immersive active visual feedback,
Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. arXiv preprint arXiv:2407.01512, 2024
-
[15]
Dif- fusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023
work page 2023
-
[16]
Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots. In Proceedings of Robotics: Science and Systems (RSS) , 2024
work page 2024
-
[18]
An algorithm for the machine calculation of complex fourier series
James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation , 19(90):297–301, 1965
work page 1965
-
[19]
Keypoint action tokens enable in-context imitation learning in robotics
Norman Di Palo and Edward Johns. Keypoint action tokens enable in-context imitation learning in robotics. In Proceedings of Robotics: Science and Systems (RSS) , 2024
work page 2024
-
[20]
Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation
Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. In Conference on Robot Learning , 2024
work page 2024
-
[21]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Tam- ing transformers for high-resolution image synthesis, 2020
Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Tam- ing transformers for high-resolution image synthesis, 2020
work page 2020
-
[23]
Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale interactive motion forecasting for autonomous driving: The waymo open motion data...
work page 2021
-
[24]
Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot
Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 653–660. IEEE, 2024
work page 2024
-
[25]
Moka: Open-world robotic manipulation through mark-based visual prompting
Kuan Fang, Fangchen Liu, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting. Robotics: Science and Systems (RSS), 2024
work page 2024
-
[26]
Humanplus: Humanoid shadowing and imitation from humans
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. In Conference on Robot Learning (CoRL), 2024
work page 2024
-
[27]
A new algorithm for data compression
Philip Gage. A new algorithm for data compression. The C Users Journal , 12(2):23–38, 1994
work page 1994
-
[28]
Multilingual language processing from bytes, 2016
Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. Multilingual language processing from bytes, 2016. URL https://arxiv.org/abs/1512.00103
-
[29]
Yuan Gong, Yu-An Chung, and James Glass. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, pages 571–575, 2021. doi: 10.21437/Interspeech. 2021-698
-
[30]
Bridging the human to robot dex- terity gap through object-oriented rewards, 2024
Irmak Guzey, Yinlong Dai, Georgy Savva, Raunaq Bhi- rangi, and Lerrel Pinto. Bridging the human to robot dex- terity gap through object-oriented rewards, 2024. URL https://arxiv.org/abs/2410.23289
-
[31]
UMI on legs: Making manipulation policies mo- bile with manipulation-centric whole-body controllers
Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. UMI on legs: Making manipulation policies mo- bile with manipulation-centric whole-body controllers. In Proceedings of the 2024 Conference on Robot Learning , 2024
work page 2024
- [32]
-
[33]
David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE , 40(9):1098–1101, 1952. doi: 10.1109/JRPROC.1952. 273898
-
[34]
Efficient long video tokenization via coordinated-based patch reconstruction
Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, and Younggyo Seo. Efficient long video tokenization via coordinated-based patch reconstruction. arXiv preprint arXiv:2411.14762, 2024
-
[35]
Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. arXiv preprint arXiv:2410.24185, 2024
-
[36]
Joshua Jones, Oier Mees, Carmelo Sferrazza, Kyle Sta- chowicz, Pieter Abbeel, and Sergey Levine. Beyond sight: Finetuning generalist robot policies with hetero- geneous sensors via language grounding. arXiv preprint arXiv:2501.04693, 2025
-
[37]
Pris- matic vlms: Investigating the design space of visually- conditioned language models
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. In International Confer- ence on Machine Learning (ICML) , 2024
work page 2024
-
[38]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abra- ham Le...
work page 2024
-
[39]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Action chunking as conditional policy compression
Lucy Lai, Ann ZX Huang, and Samuel J Gershman. Action chunking as conditional policy compression
-
[41]
Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. arXiv preprint arXiv:2403.03181, 2024
-
[42]
Learning visuotactile skills with two multifingered hands.arXiv:2404.16823, 2024
Toru Lin, Yu Zhang, Qiyang Li, Haozhi Qi, Brent Yi, Sergey Levine, and Jitendra Malik. Learning visuotactile skills with two multifingered hands. arXiv:2404.16823, 2024
-
[43]
Libero: Benchmarking knowledge transfer for lifelong robot learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems , 36, 2024
work page 2024
-
[44]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS) , 2023
work page 2023
-
[45]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[46]
Serl: A software suite for sample-efficient robotic reinforcement learning, 2024
Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning, 2024
work page 2024
-
[47]
Roboturk: A crowdsourcing platform for robotic skill learning through imitation
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–
-
[48]
Finite scalar quantization: Vq- vae made simple, 2023
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq- vae made simple, 2023. URL https://arxiv.org/abs/2309. 15505
work page 2023
-
[49]
Quest: Self-supervised skill abstractions for learning continuous control, 2024.URL https://arxiv
Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions for learning continuous control, 2024. URL https://arxiv.org/abs/2407.15840
-
[50]
Pivot: Iterative visual prompting elicits actionable knowledge for vlms
Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, et al. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. In Forty-first International Conference on Machine Learning , 2024
work page 2024
-
[51]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Nether...
work page 2024
-
[52]
Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Her- zog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bern- hard Sch ¨olkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang H...
work page internal anchor Pith review arXiv 2023
-
[53]
Byte latent transformer: Patches scale better than tokens
Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, and Srinivasan Iyer. Byte latent transformer: Patches scale better than tokens. 2024. URL https://github.com/facebookresearch/blt
work page 2024
-
[54]
In-Hand Object Rotation via Rapid Motor Adaptation, October 2022
Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-hand object rotation via rapid motor adaptation, 2022. URL https://arxiv.org/abs/2210.04887
-
[55]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[56]
Scott Reed, Konrad Zolna, Emilio Parisotto, Ser- gio G ´omez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Gim ´enez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. Transactions on Machine Learning Research , 2022
work page 2022
-
[57]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 , 2015
work page internal anchor Pith review arXiv 2015
-
[58]
Hand-object interaction pretraining from videos, 2024
Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, and Ji- tendra Malik. Hand-object interaction pretraining from videos, 2024. URL https://arxiv.org/abs/2409.08273
-
[59]
Neural discrete representation learning,
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning,
-
[60]
URL https://arxiv.org/abs/1711.00937
-
[61]
BridgeData v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. BridgeData v2: A dataset for robot learning at scale. In Conference on Robot Learning , pages 1723–
-
[62]
The jpeg still picture compression standard
Gregory K Wallace. The jpeg still picture compression standard. IEEE transactions on consumer electronics, 38 (1):xviii–xxxiv, 1992
work page 1992
-
[63]
Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers
Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[64]
Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024
-
[65]
Elastictok: Adaptive tokenization for image and video
Wilson Yan, Matei Zaharia, V olodymyr Mnih, Pieter Abbeel, Aleksandra Faust, and Hao Liu. Elastictok: Adaptive tokenization for image and video. arXiv preprint arXiv:2410.08368, 2024
-
[66]
Latent Action Pretraining from Videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. La- tent action pretraining from videos. arXiv preprint arXiv:2410.11758, 2024
work page Pith review arXiv 2024
-
[67]
Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang
Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang. Magvit: Masked generative video transformer, 2023. URL https://arxiv.org/abs/2212.05199
-
[68]
Robotic control via embodied chain-of-thought reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In Conference on Robot Learning , 2024
work page 2024
-
[69]
Soundstream: An end-to-end neural audio codec, 2021
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec, 2021. URL https://arxiv. org/abs/2107.03312
-
[70]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126,
Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126 , 2024
-
[73]
3D-VLA: A 3D Vision-Language-Action Generative World Model
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 , 2024
work page internal anchor Pith review arXiv 2024
-
[74]
arXiv preprint arXiv:2412.10345 (2024)
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345 , 2024
-
[75]
Autonomous im- provement of instruction following skills via foundation models
Zhiyuan Zhou, Pranav Atreya, Abraham Lee, Homer Walke, Oier Mees, and Sergey Levine. Autonomous im- provement of instruction following skills via foundation models. In Conference on Robot Learning , 2024
work page 2024
-
[76]
Compression of individ- ual sequences via variable-rate coding
Jacob Ziv and Abraham Lempel. Compression of individ- ual sequences via variable-rate coding. IEEE transactions on Information Theory , 24(5):530–536, 1978. APPENDIX A. Data Mixture for Training Universal Tokenizer The training mixture for the universal tokenizer mainly consists of the π0 [7] datasets described in Section VI-F. For many datasets, we inclu...
work page 1978
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.