pith. machine review for the scientific record. sign in

arxiv: 2605.11678 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:04 UTC · model grok-4.3

classification 💻 cs.AI
keywords memory swappingvision-language-action modelsGPU offloadingautonomous drivinginference optimizationlayer residencyperformance prediction
0
0 comments X

The pith

CPU-GPU swapping lets 21.5 GB vision-language-action models run on 16 GB GPUs with 3.55x speedup over standard offloading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end vision-language-action models for autonomous driving combine perception, reasoning, and control but typically require 20-60 GB of GPU memory. The paper develops a framework that moves model layers between CPU and GPU on demand at layer granularity without altering the model. Sequential layering shrinks the memory footprint, pipelined layering overlaps transfers with computation, and a resident-layer policy keeps high-benefit modules in VRAM. A prediction model selects the best configuration after one profiling pass with low error. Applied to a 21.52 GB model on a 16 GB GPU, the system delivers full BF16 precision at substantially higher speed than conventional offloading.

Core claim

The framework proceeds through sequential demand layering to reduce VRAM usage to layer level, pipelined demand layering to hide transfer latency via overlap, and a GPU-resident layer decision policy based on per-module residency benefit analysis to remove residual overhead. A performance prediction model determines the optimal number and placement of resident layers from a single profiling run, achieving less than 1.3 percent prediction error across configurations. On NVIDIA's Alpamayo-R1-10B model requiring 21.52 GB, the approach runs on an RTX 5070 Ti with 16 GB VRAM at up to 3.55 times the speed of Accelerate offloading while preserving full BF16 precision.

What carries the argument

The GPU-resident layer decision policy, which uses per-module residency benefit analysis to select layers that remain in VRAM and thereby eliminate transfer overhead that pipelining cannot hide.

If this is right

  • Large VLA models become runnable on commodity GPUs with 12-16 GB VRAM without quantization, pruning, or other model changes.
  • Inference speed improves by up to 3.55x compared with basic offloading while full BF16 precision is retained.
  • Optimal layer residency settings can be identified from a single profiling run with under 1.3 percent prediction error.
  • The same layering stages apply to any VLA model whose memory demand exceeds available VRAM by 5-44 GB.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The swapping approach may extend to other large multimodal models that face similar VRAM limits outside driving applications.
  • Further gains could appear if the resident-layer policy is combined with existing quantization or sparsity methods.
  • Real-time performance on vehicle hardware would depend on whether the single-run predictor generalizes across varying road and sensor conditions.

Load-bearing premise

The GPU-resident layer policy and single-run performance prediction model will produce the claimed speedups without introducing inference errors or latency spikes on real driving workloads.

What would settle it

Execute the optimized configuration on a standard autonomous driving benchmark and measure whether end-to-end latency reaches the predicted 3.55x improvement over offloading while accuracy and prediction error remain within the stated bounds.

read the original abstract

End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer--compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model that determines the optimal configuration-both the number and placement of resident layers-from a single profiling run with less than 1.3% prediction error across all configurations. Applied to NVIDIA's Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), our work achieves up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a framework for OOM-free inference of large Vision-Language-Action models on VRAM-limited GPUs using CPU-GPU memory swapping. It details Sequential Demand Layering to reduce granularity to layers, Pipelined Demand Layering for transfer-compute overlap, a GPU-Resident Layer Decision Policy, and a single-run performance prediction model with <1.3% error. Applied to Alpamayo-R1-10B on an RTX 5070Ti, it reports up to 3.55x speedup over Accelerate offloading at full BF16 precision.

Significance. Should the experimental results and methods validate the claims, this contribution is significant as it provides a practical solution for running 20-60GB VLA models on 12-16GB GPUs without model changes, which is crucial for autonomous driving applications where end-to-end models are increasingly used but hardware constraints limit deployment.

major comments (1)
  1. [Abstract] No specific algorithms, equations, or data are provided to support the 3.55x speedup and 1.3% error claims. The GPU-Resident Layer Decision Policy and performance prediction model are described at a high level; detailed analysis is needed to confirm they do not introduce errors or fail to hide transfer overhead on driving workloads as assumed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for recognizing the practical significance of our framework for deploying large VLA models on commodity GPUs in autonomous driving. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] No specific algorithms, equations, or data are provided to support the 3.55x speedup and 1.3% error claims. The GPU-Resident Layer Decision Policy and performance prediction model are described at a high level; detailed analysis is needed to confirm they do not introduce errors or fail to hide transfer overhead on driving workloads as assumed.

    Authors: The abstract is intentionally high-level and concise, as is standard, and therefore omits specific algorithms, equations, and raw data. The full manuscript provides these details: it specifies the algorithms for Sequential Demand Layering (reducing granularity to layers) and Pipelined Demand Layering (for transfer-compute overlap), the GPU-Resident Layer Decision Policy (including the per-module residency benefit analysis used to select resident layers), and the performance prediction model (its formulation, single-run profiling procedure, and validation with <1.3% error across configurations). The experimental evaluation on Alpamayo-R1-10B (21.52 GB) with the RTX 5070 Ti (16 GB) reports the measured 3.55x speedup over Accelerate offloading at full BF16 precision and confirms that the policy and model hide residual transfer overhead on the evaluated driving workloads without introducing errors. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

Only the abstract is available for analysis. It describes three stages of system-level optimization (Sequential Demand Layering, Pipelined Demand Layering, GPU-Resident Layer Decision Policy) plus a profiling-based performance prediction model that achieves <1.3% error, but provides no equations, derivations, fitted parameters, or self-citations. No load-bearing step reduces by construction to its inputs, and the claimed 3.55x speedup on Alpamayo-R1-10B is presented as an empirical outcome of the optimizations rather than a self-referential prediction. This is the expected outcome when concrete methods and data are absent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the performance prediction model is stated to achieve <1.3% error but its internal parameters are unspecified.

pith-pipeline@v0.9.0 · 5508 in / 1035 out tokens · 39662 ms · 2026-05-13T01:04:39.723331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.