A 35B Hybrid-Attention Mixture-of-Experts Model on a 6GB 2011 GPU: Hand-Written 4-bit CUDA Inference for Fermi

A. C. Opus; J. Q. Lu

arxiv: 2606.24031 · v1 · pith:LGDTXFMXnew · submitted 2026-06-23 · ❄️ cond-mat.other

A 35B Hybrid-Attention Mixture-of-Experts Model on a 6GB 2011 GPU: Hand-Written 4-bit CUDA Inference for Fermi

A. C. Opus , J. Q. Lu This is my paper

Pith reviewed 2026-06-25 22:04 UTC · model grok-4.3

classification ❄️ cond-mat.other

keywords mixture of expertslegacy GPU4-bit inferencehybrid CPU-GPU executionhand-written CUDAFermi architectureprefill optimizationrecurrent state cache

0 comments

The pith

Hand-written 4-bit CUDA kernels enable a 35B MoE model to run on a 6GB 2011 Fermi GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a 35-billion-parameter hybrid-attention MoE model can execute end-to-end on a Tesla C2075 GPU from 2011 despite the model exceeding device memory by a factor of two. It does so through a hybrid strategy that streams expert weights layer-by-layer to the GPU for prefill while shifting decode to the host CPU with a custom W4A8 integer GEMV kernel written for the legacy architecture. Optimizations including expert pinning, single-pass prefill, NUMA interleaving, and a position-indexed snapshot cache for the gated-delta-net state deliver the measured latency and throughput gains. A sympathetic reader cares because the work maps the concrete engineering steps and hard limits required to adapt current-scale models to fourteen-year-old silicon.

Core claim

End-to-end inference of Qwen3.6-35B-A3B succeeds on the 6 GB Fermi device by running GPU batched prefill with streamed weights and CPU decode via a hand-written SSSE3 pmaddubsw integer-SIMD GEMV, all compiled under CUDA 8.0 for compute capability 2.0. This yields 37.5 s prefill latency (down 34 %) and 8.6 tokens-per-second decode (up roughly 3 imes) on a 947-token prompt, while the snapshot cache restores prefix reuse from 78 s to 0.5 s. Several attempted accelerations, including GPU offload of the language-model head, produce no benefit.

What carries the argument

Hybrid execution strategy that pairs GPU prefill with layer-by-layer weight streaming from host RAM and CPU-based integer-SIMD decode, augmented by the position-indexed snapshot cache for recurrent gated-delta-net state.

If this is right

Expert pinning, single-pass prefill, and NUMA interleaving cut prefill latency from 57.2 s to 37.5 s.
The integer-SIMD kernel raises decode throughput from 2.8 to 8.6 tokens per second.
The position-indexed snapshot cache restores prefix reuse, reducing a repeated prefill from 78 s to 0.5 s.
Offloading the language-model head to the GPU, hyper-threading, and three kernel rewrites yield no improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same streaming-plus-CPU-offload pattern could be applied to other memory-constrained legacy GPUs to test whether the 34 % and 3 imes gains generalize beyond this specific Fermi card.
The negative results isolate memory-bandwidth and integer-throughput ceilings that future hand-written kernels on similar hardware would still need to respect.
Recurrent architectures with gated-delta-net states may become more practical on old devices once snapshot caches are routinely paired with hybrid execution.

Load-bearing premise

The hand-written 4-bit kernels and hybrid execution strategy correctly reproduce the original model's outputs without introducing numerical errors or functional bugs that would invalidate the reported latency and throughput numbers.

What would settle it

Execute the same 947-token prompt through the described kernels and compare the resulting token sequence or logit vectors against a reference implementation on modern hardware; any systematic divergence would falsify the claim that the kernels are functionally equivalent.

Figures

Figures reproduced from arXiv: 2606.24031 by A. C. Opus, J. Q. Lu.

**Figure 2.** Figure 2: Where prefill time goes (single-pass, 947-token prompt; all three are GPU kernels). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

We report end-to-end inference of \textbf{Qwen3.6-35B-A3B} -- a 35-billion-parameter, $\sim$3B-active Mixture-of-Experts (MoE) model with a hybrid gated-delta-net / full-attention backbone -- on a \textbf{2011 NVIDIA Tesla C2075} (Fermi, compute capability \smtwenty, 6\,GB), a GPU that predates tensor cores, native FP16 arithmetic, the \texttt{DP4A} integer dot-product instruction, and support in every modern CUDA toolchain. Because the 4-bit model ($\approx$10.5\,GB) is roughly twice the device memory, we adopt a \emph{hybrid} execution strategy: the GPU performs batched prompt \emph{prefill} with expert weights streamed layer-by-layer from host RAM, while \emph{decode} runs on the host CPU using a hand-written W4A8 integer GEMV built on the SSSE3 \texttt{pmaddubsw} instruction. The entire engine -- GEMM, hybrid-attention recurrence, MoE routing, and a from-scratch vision tower -- is written by hand for \smtwenty{} and compiled with the legacy CUDA 8.0 toolchain. On a 947-token prompt we reduce prefill latency from 57.2\,s to 37.5\,s ($-34\%$) through expert pinning, single-pass prefill, and NUMA interleaving, and we raise decode throughput from 2.8 to 8.6\,\tps{} ($\approx 3\times$) with the integer-SIMD kernel. A position-indexed snapshot cache for the recurrent (gated-delta-net) state restores prefix reuse on a recurrent architecture, cutting a repeated 78\,s prefill to 0.5\,s. We also report a set of \emph{negative} results -- offloading the language-model head to the idle GPU, hyper-threading, and three GPU-kernel rewrites all fail to help -- % which together pin down the practical floor of this hardware. Our aim is not a speed record but a careful account of what it takes, and where the walls are, to run a contemporary frontier-class MoE on fourteen-year-old silicon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a detailed engineering report on hand-written 4-bit kernels for a 35B MoE on 2011 Fermi hardware, but the speed claims rest on unverified outputs.

read the letter

The paper demonstrates that a 35B hybrid-attention MoE can be made to run on a 6GB 2011 Tesla C2075 through layer-by-layer weight streaming, expert pinning, NUMA interleaving, and a custom SSSE3 integer GEMV kernel for decode. They also add a position-indexed cache for the gated-delta-net state that drops repeated prefill from 78s to 0.5s. The negative results on offloading the LM head, hyper-threading, and several kernel rewrites are the most useful part; they actually narrow down what works on this platform.

The implementation choices are concrete and the numbers are given with specific prompts and baselines. That level of reporting is rare in this kind of systems work.

The main gap is the absence of any check that the hand-written kernels match the original model outputs. The abstract gives no token-level agreement, perplexity delta, or even a statement that outputs were compared to a reference run. With integer arithmetic on compute capability 2.0 and no DP4A, numerical drift or functional bugs are possible and would make the 34% prefill and 3x decode claims meaningless. The paper also supplies no error bars or accuracy verification.

This is an engineering note, not a research contribution. It belongs to people doing low-level inference on legacy hardware or extreme memory-constrained setups. It does not change how we build or train MoEs. I would not cite it. It does not need journal peer review; an arXiv tech report or systems workshop is the right place.

Referee Report

2 major / 0 minor

Summary. The manuscript reports end-to-end inference of the 35B-parameter Qwen3.6-35B-A3B hybrid-attention MoE model on a 2011 Fermi Tesla C2075 (6 GB) using a hybrid GPU-CPU strategy: layer-by-layer weight streaming for prefill on the GPU and a hand-written SSSE3 pmaddubsw W4A8 GEMV for decode on the host CPU. It claims a reduction in prefill latency from 57.2 s to 37.5 s (-34%) via expert pinning, single-pass prefill and NUMA interleaving, and an increase in decode throughput from 2.8 to 8.6 tps (~3x) on a 947-token prompt, plus a position-indexed recurrent cache that reduces repeated prefill from 78 s to 0.5 s. Negative results on GPU offload of the LM head, hyper-threading and kernel rewrites are also presented.

Significance. If the hand-written kernels are numerically faithful, the work supplies a concrete, reproducible empirical map of the practical limits of running a contemporary frontier MoE on fourteen-year-old silicon, with the negative-result section serving as a useful bound on what optimizations are viable. The position-indexed snapshot cache for gated-delta-net state is a targeted engineering contribution for recurrent architectures.

major comments (2)

[Abstract] Abstract and implementation description: the headline claims (37.5 s prefill, 8.6 tps decode) are only interpretable if the custom 4-bit Fermi kernels and hybrid offload strategy compute the identical function as the reference model. No token-level agreement, perplexity delta, or even a statement that outputs were cross-checked against a reference run is supplied, despite the use of integer arithmetic absent from the original training. This verification is load-bearing for the validity of all reported speedups.
[Results] Results (latency/throughput measurements): no error bars, multiple-run statistics, or hardware-counter validation are reported for the 57.2 s o 37.5 s and 2.8 o 8.6 tps figures, leaving open the possibility that measurement variability or unaccounted host-GPU transfer overheads affect the quoted percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing numerical fidelity and statistical rigor. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and data.

read point-by-point responses

Referee: [Abstract] Abstract and implementation description: the headline claims (37.5 s prefill, 8.6 tps decode) are only interpretable if the custom 4-bit Fermi kernels and hybrid offload strategy compute the identical function as the reference model. No token-level agreement, perplexity delta, or even a statement that outputs were cross-checked against a reference run is supplied, despite the use of integer arithmetic absent from the original training. This verification is load-bearing for the validity of all reported speedups.

Authors: We agree that explicit verification of numerical equivalence is essential for the validity of the reported speedups, particularly given the use of custom integer kernels. The kernels implement the precise forward-pass operations of the 4-bit quantized model (W4A8 GEMV via pmaddubsw and layer-wise streaming), but the original submission omitted a direct cross-check statement. In the revised manuscript we will add a dedicated verification paragraph reporting token-level agreement on sample generations and a small perplexity delta against a reference FP16 run on modern hardware, confirming functional identity. revision: yes
Referee: [Results] Results (latency/throughput measurements): no error bars, multiple-run statistics, or hardware-counter validation are reported for the 57.2 s o 37.5 s and 2.8 o 8.6 tps figures, leaving open the possibility that measurement variability or unaccounted host-GPU transfer overheads affect the quoted percentages.

Authors: The reported numbers derive from single timed executions on the target hardware. We acknowledge that the absence of error bars or repeated-run statistics weakens confidence in the exact percentages. In revision we will re-measure the key prefill and decode metrics over at least three independent runs, report means with standard deviations, and add a short methods note on timing methodology and transfer overhead accounting. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical runtime measurements

full rationale

The manuscript reports measured prefill and decode times on Fermi hardware using hand-written kernels. No equations, fitted parameters, predictions, or first-principles derivations appear; the central claims are direct experimental observations of latency and throughput. The skeptic concern about output verification is a correctness issue, not a circularity issue. No self-citations, ansatzes, or reductions to inputs by construction are present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical engineering demonstration report. No free parameters, mathematical axioms, or invented scientific entities are introduced or required by the central claims.

pith-pipeline@v0.9.1-grok · 5989 in / 1150 out tokens · 30036 ms · 2026-06-25T22:04:23.555341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 8 canonical work pages · 8 internal anchors

[1]

A. Yang, A. Li, et al. (Qwen Team, Alibaba). Qwen3 Technical Report. arXiv:2505.09388, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

S. Yang, J. Kautz, and A. Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. arXiv:2412.06464, 2024. (ICLR 2025.)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

M. Lasby, I. Lazarevich, N. Sinnadurai, S. Lie, Y. Ioannou, and V. Thangarasa. REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression. arXiv:2510.13999, 2025. (ICLR 2026.)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323, 2022. (ICLR 2023.)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, C. Gan, and S. Han. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978, 2023. (MLSys 2024.)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and Memory- Efficient Exact Attention with IO-Awareness. arXiv:2205.14135, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Karpathy

A. Karpathy. llama2.c: Inference Llama 2 in one file of pure C.https://github.com/ karpathy/llama2.c
[10]

Gerganov and the llama.cpp contributors

G. Gerganov and the llama.cpp contributors. llama.cpp: LLM inference in C/C++. https://github.com/ggml-org/llama.cpp
[11]

R. Allen. llama2.cu: A CUDA port of llama2.c.https://github.com/rogerallen/llama2. cu. 8

[1] [1]

A. Yang, A. Li, et al. (Qwen Team, Alibaba). Qwen3 Technical Report. arXiv:2505.09388, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

S. Yang, J. Kautz, and A. Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. arXiv:2412.06464, 2024. (ICLR 2025.)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

M. Lasby, I. Lazarevich, N. Sinnadurai, S. Lie, Y. Ioannou, and V. Thangarasa. REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression. arXiv:2510.13999, 2025. (ICLR 2026.)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323, 2022. (ICLR 2023.)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, C. Gan, and S. Han. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978, 2023. (MLSys 2024.)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and Memory- Efficient Exact Attention with IO-Awareness. arXiv:2205.14135, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Karpathy

A. Karpathy. llama2.c: Inference Llama 2 in one file of pure C.https://github.com/ karpathy/llama2.c

[10] [10]

Gerganov and the llama.cpp contributors

G. Gerganov and the llama.cpp contributors. llama.cpp: LLM inference in C/C++. https://github.com/ggml-org/llama.cpp

[11] [11]

R. Allen. llama2.cu: A CUDA port of llama2.c.https://github.com/rogerallen/llama2. cu. 8