A 35B Hybrid-Attention Mixture-of-Experts Model on a 6GB 2011 GPU: Hand-Written 4-bit CUDA Inference for Fermi
Pith reviewed 2026-06-25 22:04 UTC · model grok-4.3
The pith
Hand-written 4-bit CUDA kernels enable a 35B MoE model to run on a 6GB 2011 Fermi GPU.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
End-to-end inference of Qwen3.6-35B-A3B succeeds on the 6 GB Fermi device by running GPU batched prefill with streamed weights and CPU decode via a hand-written SSSE3 pmaddubsw integer-SIMD GEMV, all compiled under CUDA 8.0 for compute capability 2.0. This yields 37.5 s prefill latency (down 34 %) and 8.6 tokens-per-second decode (up roughly 3 imes) on a 947-token prompt, while the snapshot cache restores prefix reuse from 78 s to 0.5 s. Several attempted accelerations, including GPU offload of the language-model head, produce no benefit.
What carries the argument
Hybrid execution strategy that pairs GPU prefill with layer-by-layer weight streaming from host RAM and CPU-based integer-SIMD decode, augmented by the position-indexed snapshot cache for recurrent gated-delta-net state.
If this is right
- Expert pinning, single-pass prefill, and NUMA interleaving cut prefill latency from 57.2 s to 37.5 s.
- The integer-SIMD kernel raises decode throughput from 2.8 to 8.6 tokens per second.
- The position-indexed snapshot cache restores prefix reuse, reducing a repeated prefill from 78 s to 0.5 s.
- Offloading the language-model head to the GPU, hyper-threading, and three kernel rewrites yield no improvement.
Where Pith is reading between the lines
- The same streaming-plus-CPU-offload pattern could be applied to other memory-constrained legacy GPUs to test whether the 34 % and 3 imes gains generalize beyond this specific Fermi card.
- The negative results isolate memory-bandwidth and integer-throughput ceilings that future hand-written kernels on similar hardware would still need to respect.
- Recurrent architectures with gated-delta-net states may become more practical on old devices once snapshot caches are routinely paired with hybrid execution.
Load-bearing premise
The hand-written 4-bit kernels and hybrid execution strategy correctly reproduce the original model's outputs without introducing numerical errors or functional bugs that would invalidate the reported latency and throughput numbers.
What would settle it
Execute the same 947-token prompt through the described kernels and compare the resulting token sequence or logit vectors against a reference implementation on modern hardware; any systematic divergence would falsify the claim that the kernels are functionally equivalent.
Figures
read the original abstract
We report end-to-end inference of \textbf{Qwen3.6-35B-A3B} -- a 35-billion-parameter, $\sim$3B-active Mixture-of-Experts (MoE) model with a hybrid gated-delta-net / full-attention backbone -- on a \textbf{2011 NVIDIA Tesla C2075} (Fermi, compute capability \smtwenty, 6\,GB), a GPU that predates tensor cores, native FP16 arithmetic, the \texttt{DP4A} integer dot-product instruction, and support in every modern CUDA toolchain. Because the 4-bit model ($\approx$10.5\,GB) is roughly twice the device memory, we adopt a \emph{hybrid} execution strategy: the GPU performs batched prompt \emph{prefill} with expert weights streamed layer-by-layer from host RAM, while \emph{decode} runs on the host CPU using a hand-written W4A8 integer GEMV built on the SSSE3 \texttt{pmaddubsw} instruction. The entire engine -- GEMM, hybrid-attention recurrence, MoE routing, and a from-scratch vision tower -- is written by hand for \smtwenty{} and compiled with the legacy CUDA 8.0 toolchain. On a 947-token prompt we reduce prefill latency from 57.2\,s to 37.5\,s ($-34\%$) through expert pinning, single-pass prefill, and NUMA interleaving, and we raise decode throughput from 2.8 to 8.6\,\tps{} ($\approx 3\times$) with the integer-SIMD kernel. A position-indexed snapshot cache for the recurrent (gated-delta-net) state restores prefix reuse on a recurrent architecture, cutting a repeated 78\,s prefill to 0.5\,s. We also report a set of \emph{negative} results -- offloading the language-model head to the idle GPU, hyper-threading, and three GPU-kernel rewrites all fail to help -- % which together pin down the practical floor of this hardware. Our aim is not a speed record but a careful account of what it takes, and where the walls are, to run a contemporary frontier-class MoE on fourteen-year-old silicon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports end-to-end inference of the 35B-parameter Qwen3.6-35B-A3B hybrid-attention MoE model on a 2011 Fermi Tesla C2075 (6 GB) using a hybrid GPU-CPU strategy: layer-by-layer weight streaming for prefill on the GPU and a hand-written SSSE3 pmaddubsw W4A8 GEMV for decode on the host CPU. It claims a reduction in prefill latency from 57.2 s to 37.5 s (-34%) via expert pinning, single-pass prefill and NUMA interleaving, and an increase in decode throughput from 2.8 to 8.6 tps (~3x) on a 947-token prompt, plus a position-indexed recurrent cache that reduces repeated prefill from 78 s to 0.5 s. Negative results on GPU offload of the LM head, hyper-threading and kernel rewrites are also presented.
Significance. If the hand-written kernels are numerically faithful, the work supplies a concrete, reproducible empirical map of the practical limits of running a contemporary frontier MoE on fourteen-year-old silicon, with the negative-result section serving as a useful bound on what optimizations are viable. The position-indexed snapshot cache for gated-delta-net state is a targeted engineering contribution for recurrent architectures.
major comments (2)
- [Abstract] Abstract and implementation description: the headline claims (37.5 s prefill, 8.6 tps decode) are only interpretable if the custom 4-bit Fermi kernels and hybrid offload strategy compute the identical function as the reference model. No token-level agreement, perplexity delta, or even a statement that outputs were cross-checked against a reference run is supplied, despite the use of integer arithmetic absent from the original training. This verification is load-bearing for the validity of all reported speedups.
- [Results] Results (latency/throughput measurements): no error bars, multiple-run statistics, or hardware-counter validation are reported for the 57.2 s o 37.5 s and 2.8 o 8.6 tps figures, leaving open the possibility that measurement variability or unaccounted host-GPU transfer overheads affect the quoted percentages.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing numerical fidelity and statistical rigor. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and data.
read point-by-point responses
-
Referee: [Abstract] Abstract and implementation description: the headline claims (37.5 s prefill, 8.6 tps decode) are only interpretable if the custom 4-bit Fermi kernels and hybrid offload strategy compute the identical function as the reference model. No token-level agreement, perplexity delta, or even a statement that outputs were cross-checked against a reference run is supplied, despite the use of integer arithmetic absent from the original training. This verification is load-bearing for the validity of all reported speedups.
Authors: We agree that explicit verification of numerical equivalence is essential for the validity of the reported speedups, particularly given the use of custom integer kernels. The kernels implement the precise forward-pass operations of the 4-bit quantized model (W4A8 GEMV via pmaddubsw and layer-wise streaming), but the original submission omitted a direct cross-check statement. In the revised manuscript we will add a dedicated verification paragraph reporting token-level agreement on sample generations and a small perplexity delta against a reference FP16 run on modern hardware, confirming functional identity. revision: yes
-
Referee: [Results] Results (latency/throughput measurements): no error bars, multiple-run statistics, or hardware-counter validation are reported for the 57.2 s o 37.5 s and 2.8 o 8.6 tps figures, leaving open the possibility that measurement variability or unaccounted host-GPU transfer overheads affect the quoted percentages.
Authors: The reported numbers derive from single timed executions on the target hardware. We acknowledge that the absence of error bars or repeated-run statistics weakens confidence in the exact percentages. In revision we will re-measure the key prefill and decode metrics over at least three independent runs, report means with standard deviations, and add a short methods note on timing methodology and transfer overhead accounting. revision: yes
Circularity Check
No circularity: purely empirical runtime measurements
full rationale
The manuscript reports measured prefill and decode times on Fermi hardware using hand-written kernels. No equations, fitted parameters, predictions, or first-principles derivations appear; the central claims are direct experimental observations of latency and throughput. The skeptic concern about output verification is a correctness issue, not a circularity issue. No self-citations, ansatzes, or reductions to inputs by construction are present.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. Yang, A. Li, et al. (Qwen Team, Alibaba). Qwen3 Technical Report. arXiv:2505.09388, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
S. Yang, J. Kautz, and A. Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. arXiv:2412.06464, 2024. (ICLR 2025.)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
M. Lasby, I. Lazarevich, N. Sinnadurai, S. Lie, Y. Ioannou, and V. Thangarasa. REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression. arXiv:2510.13999, 2025. (ICLR 2026.)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323, 2022. (ICLR 2023.)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, C. Gan, and S. Han. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978, 2023. (MLSys 2024.)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and Memory- Efficient Exact Attention with IO-Awareness. arXiv:2205.14135, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Karpathy
A. Karpathy. llama2.c: Inference Llama 2 in one file of pure C.https://github.com/ karpathy/llama2.c
-
[10]
Gerganov and the llama.cpp contributors
G. Gerganov and the llama.cpp contributors. llama.cpp: LLM inference in C/C++. https://github.com/ggml-org/llama.cpp
-
[11]
R. Allen. llama2.cu: A CUDA port of llama2.c.https://github.com/rogerallen/llama2. cu. 8
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.