WebLLM: A High-Performance In-Browser LLM Inference Engine

Akaash R. Parthasarathy; Bohan Hou; Charlie F. Ruan; Hangrui Cao; Hongyi Jin; Meng-Shiun Yu; Ruihang Lai; Siyuan Feng; Sudeep Agarwal; Tianqi Chen

arxiv: 2412.15803 · v2 · submitted 2024-12-20 · 💻 cs.LG · cs.AI

WebLLM: A High-Performance In-Browser LLM Inference Engine

Charlie F. Ruan , Yucheng Qin , Akaash R. Parthasarathy , Xun Zhou , Ruihang Lai , Hongyi Jin , Yixin Dong , Bohan Hou

show 6 more authors

Meng-Shiun Yu Yiyan Zhai Sudeep Agarwal Hangrui Cao Siyuan Feng Tianqi Chen

This is my paper

Pith reviewed 2026-05-23 06:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords in-browser LLM inferenceWebGPUWebAssemblyon-device deploymentJavaScript frameworkmachine learning compilersprivacy-preserving inference

0 comments

The pith

WebLLM runs large language models inside web browsers at up to 80 percent of native speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a JavaScript framework that performs LLM inference entirely inside web browsers. It uses WebGPU to accelerate computation on the local GPU and WebAssembly for CPU work, then applies machine learning compilers to produce efficient kernels that fill the gap left by missing native libraries. An OpenAI-compatible API makes the system easy to drop into existing web applications. If the performance numbers hold, this removes the need for cloud servers in many LLM use cases and keeps user data on the device. A sympathetic reader would care because the browser becomes a universal, private platform for running capable models on ordinary consumer hardware.

Core claim

WebLLM is an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. It provides an OpenAI-style API, leverages WebGPU for local GPU acceleration and WebAssembly for CPU computation, and uses machine learning compilers to generate optimized WebGPU kernels, retaining up to 80 percent of native performance on the same device.

What carries the argument

The combination of WebGPU acceleration with compiler-generated kernels that compensate for the absence of native high-performance WebGPU libraries for machine learning workloads.

If this is right

Web applications can integrate LLM capabilities through a standard API without external servers.
Inference stays local, which keeps data private and removes cloud latency and cost.
Personalized models can run directly on the user's device inside the browser.
The same code base can target many device vendors because the browser abstracts the hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Client-side web agents could use LLMs without ever transmitting prompts or responses to a remote service.
Model developers might create browser-specific quantized or distilled variants that trade a small amount of quality for faster execution.
Performance could improve further if future browser updates expose additional low-level GPU controls that the current kernel generation can exploit.

Load-bearing premise

WebGPU must be available and fast enough on ordinary consumer devices so that the ported kernels do not push performance well below the reported 80 percent level.

What would settle it

Direct measurements on several common consumer devices with WebGPU support that show average inference speed falling below 60 percent of native GPU performance would disprove the retention claim.

read the original abstract

Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebLLM ports MLC-LLM/TVM kernels to WebGPU for an OpenAI-style browser API and claims 80% native speed, but the abstract gives no models, devices, or measurement details to check that number.

read the letter

The main point is a practical engineering effort that brings LLM inference into the browser using WebGPU for GPU work and WebAssembly for CPU, built on top of MLC-LLM and Apache TVM for the kernels. They expose a familiar OpenAI-style JavaScript API so web apps can call it directly without servers. The code is open on GitHub, which makes the contribution usable right away for anyone who wants local, private inference on consumer devices.

Referee Report

1 major / 0 minor

Summary. The paper introduces WebLLM, an open-source JavaScript framework for high-performance LLM inference entirely within web browsers. It provides an OpenAI-style API, uses WebGPU for GPU acceleration and WebAssembly for CPU computation, and leverages MLC-LLM and Apache TVM to generate optimized WebGPU kernels. The central claim is that evaluations demonstrate WebLLM retaining up to 80% of native performance on the same device.

Significance. If the performance claim holds under detailed scrutiny, this work would be significant for enabling universally accessible, privacy-preserving on-device LLM inference in browsers, which abstract away hardware backends and support agentic web applications. The open-source release at https://github.com/mlc-ai/web-llm is a clear strength.

major comments (1)

[Abstract] Abstract (and any evaluation section): the headline result that WebLLM retains up to 80% native performance is load-bearing for the paper's contribution, yet the provided text supplies no specifics on the models evaluated, target consumer devices, native reference implementations (e.g., CUDA/Metal baselines), measurement methodology, or per-layer timing breakdowns. Without these, the assumptions that WebGPU kernel porting incurs negligible overhead and that WebGPU is mature on the tested hardware cannot be verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional specifics are required to substantiate the headline performance claim and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (and any evaluation section): the headline result that WebLLM retains up to 80% native performance is load-bearing for the paper's contribution, yet the provided text supplies no specifics on the models evaluated, target consumer devices, native reference implementations (e.g., CUDA/Metal baselines), measurement methodology, or per-layer timing breakdowns. Without these, the assumptions that WebGPU kernel porting incurs negligible overhead and that WebGPU is mature on the tested hardware cannot be verified.

Authors: We agree that the abstract and evaluation section must supply these details for the claim to be verifiable. In the revised manuscript we will expand both sections to report: the specific models evaluated (Llama-2-7B, Mistral-7B, Phi-2); the consumer devices (Apple M1 MacBook Pro, NVIDIA RTX 3060 laptop, Intel i7 + integrated GPU); the native reference implementations (MLC-LLM CUDA on Linux, Metal on macOS); the measurement methodology (tokens per second, averaged over 100 generations after 10 warm-up steps, using the same prompt set); and available per-layer timing breakdowns obtained via TVM profiling. These additions will directly address the overhead and maturity assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are direct empirical measurements against native baselines

full rationale

The paper is an engineering/systems contribution describing WebLLM, an in-browser inference framework built on WebGPU, WebAssembly, MLC-LLM, and TVM. Its central claim (up to 80% native performance) is presented as a measured result on the same device, not derived from equations, fitted parameters, or predictions. No self-definitional loops, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the provided text. The evaluation is externally falsifiable via device benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is an engineering implementation and benchmark; it rests on the existence and performance of WebGPU and WebAssembly standards plus the ability of the cited compilers to generate usable kernels. No free parameters are fitted to produce the central claim, and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Modern browsers expose WebGPU and WebAssembly with sufficient capability for LLM kernels
The entire system depends on these web platform features being present and performant on target devices.

pith-pipeline@v0.9.0 · 5807 in / 1234 out tokens · 22791 ms · 2026-05-23T06:33:14.294699+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
cs.DC 2026-05 conditional novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browse...
VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers
cs.CL 2026-03 conditional novelty 7.0

VIGIL is the first browser extension for real-time detection and mitigation of cognitive bias triggers, with scroll-synced highlighting, LLM reformulation, privacy tiers, and extensible validated plugins.