BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs

Andrey Vakunov; Karthik Raveendran; Matthias Grundmann; Valentin Bazarevsky; Yury Kartynnik

arxiv: 1907.05047 · v2 · pith:3Y4QGHODnew · submitted 2019-07-11 · 💻 cs.CV

BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs

Valentin Bazarevsky , Yury Kartynnik , Andrey Vakunov , Karthik Raveendran , Matthias Grundmann This is my paper

Pith reviewed 2026-05-24 23:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords face detectionmobile GPUreal-time inferenceaugmented realitylightweight neural networkSSD anchor schemetie resolution

0 comments

The pith

BlazeFace detects faces at 200-1000+ FPS on mobile GPUs using a custom lightweight network.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BlazeFace as a face detector built for mobile GPU hardware that delivers super-realtime performance. This speed makes it practical as the initial stage in augmented reality systems that need a facial region of interest to feed into later models for keypoints, expression analysis, or segmentation. The authors reach these rates through three changes: a feature extraction network lighter than MobileNet variants, an anchor scheme adjusted from SSD for GPU execution, and a tie-breaking method that replaces non-maximum suppression. A reader would care because such performance on phones could allow continuous facial processing without noticeable delay or high power draw.

Core claim

BlazeFace is a lightweight and well-performing face detector tailored for mobile GPU inference. It runs at a speed of 200-1000+ FPS on flagship devices. This super-realtime performance enables it to be applied to any augmented reality pipeline that requires an accurate facial region of interest as an input for task-specific models, such as 2D/3D facial keypoint or geometry estimation, facial features or expression classification, and face region segmentation. The contributions include a lightweight feature extraction network inspired by but distinct from MobileNetV1/V2, a GPU-friendly anchor scheme modified from Single Shot MultiBox Detector, and an improved tie resolution strategy.

What carries the argument

Lightweight feature extraction network combined with GPU-friendly SSD-style anchors and non-NMS tie resolution.

If this is right

The detector can supply accurate facial regions of interest to downstream AR models for keypoint estimation.
It supports real-time facial expression classification and feature analysis on phones.
Face region segmentation becomes feasible within live AR pipelines.
The approach works across flagship mobile devices without custom hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchor and tie-resolution adjustments might apply to other single-shot detectors on mobile GPUs.
Sustained high frame rates could lower average power use in always-on camera applications.
Design patterns here could guide speed optimizations for related tasks like hand or body detection.

Load-bearing premise

The described changes to the feature extractor, anchor scheme, and tie resolution produce the stated speed and accuracy on mobile GPUs.

What would settle it

Benchmark measurements on a flagship mobile device showing inference slower than 200 FPS or detection accuracy substantially below standard mobile face detectors.

read the original abstract

We present BlazeFace, a lightweight and well-performing face detector tailored for mobile GPU inference. It runs at a speed of 200-1000+ FPS on flagship devices. This super-realtime performance enables it to be applied to any augmented reality pipeline that requires an accurate facial region of interest as an input for task-specific models, such as 2D/3D facial keypoint or geometry estimation, facial features or expression classification, and face region segmentation. Our contributions include a lightweight feature extraction network inspired by, but distinct from MobileNetV1/V2, a GPU-friendly anchor scheme modified from Single Shot MultiBox Detector (SSD), and an improved tie resolution strategy alternative to non-maximum suppression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BlazeFace delivers a practical, very fast mobile face detector but the paper does not show that its three listed changes are what produce the speed.

read the letter

The paper's core contribution is a small face detector that hits 200-1000+ FPS on current flagship phones and is meant to feed AR pipelines. That speed number is the main thing a reader will take away. The architecture uses a custom lightweight backbone (distinct from MobileNetV1/V2), a modified SSD-style anchor layout tuned for GPU, and a tie-breaking rule that replaces standard NMS. These are presented as the reasons for the performance. The work is clearly aimed at production use rather than theory, and the authors supply concrete FPS figures on real devices plus example downstream tasks. That is useful engineering detail. The soft spot is the missing link between those three changes and the reported speed. The abstract and stress-test note give aggregate FPS but no ablation tables, no per-layer timing, and no head-to-head run against an unmodified MobileNet-SSD baseline on the same hardware. Without those measurements it is possible the gains come mostly from overall model size rather than the specific modifications. The citation pattern looks standard for this area and does not appear to hide prior work. This paper is for people building mobile AR or real-time vision stacks who need a drop-in face ROI. A practitioner might cite the speed claim or the released model if it ships. It is coherent on its own terms and shows clear engineering thinking, so it deserves a serious referee even if the attribution question needs tightening in revision.

Referee Report

1 major / 1 minor

Summary. The manuscript presents BlazeFace, a lightweight neural face detector optimized for mobile GPU inference. It claims to run at 200-1000+ FPS on flagship devices through three contributions: a custom feature extraction network distinct from MobileNetV1/V2, a GPU-friendly anchor scheme modified from SSD, and an improved tie resolution strategy as an alternative to NMS. The detector is intended to supply accurate facial regions of interest as input to downstream AR models for tasks such as 2D/3D keypoint estimation, expression classification, and segmentation.

Significance. If the reported sub-millisecond performance holds and the three architectural modifications can be shown to be responsible for the gains, the work would provide a useful engineering contribution for real-time face detection in mobile augmented reality pipelines. The emphasis on GPU-friendly design choices directly addresses deployment constraints on mobile hardware. The paper is presented as an applied artifact rather than a parameter-free derivation or theoretical result.

major comments (1)

[Results] Results section: the paper reports aggregate FPS on flagship devices and qualitative AR use-cases, but contains no ablation tables, no per-component timing breakdowns, and no direct comparison against an unmodified SSD-MobileNet baseline on the same hardware. This leaves open the possibility that model size alone, rather than the three cited modifications, accounts for the performance, undermining attribution of the central speed claim.

minor comments (1)

[Abstract] Abstract: the claim of 'accurate facial regions of interest' is stated without accompanying quantitative accuracy metrics (e.g., mAP or precision-recall) to accompany the FPS figures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's constructive feedback on our manuscript. We address the major comment below.

read point-by-point responses

Referee: [Results] Results section: the paper reports aggregate FPS on flagship devices and qualitative AR use-cases, but contains no ablation tables, no per-component timing breakdowns, and no direct comparison against an unmodified SSD-MobileNet baseline on the same hardware. This leaves open the possibility that model size alone, rather than the three cited modifications, accounts for the performance, undermining attribution of the central speed claim.

Authors: We agree that the manuscript would be strengthened by explicit ablation studies, per-component timing breakdowns, and a direct comparison to an unmodified SSD-MobileNet baseline on the same hardware. The current version focuses on the end-to-end performance of the integrated BlazeFace system on mobile GPUs. In the revised manuscript we will add ablation tables isolating the contributions of the custom backbone, modified anchor scheme, and tie-resolution method, along with the requested baseline comparison and timing details where available on the target devices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; engineering artifact without load-bearing derivations or self-referential reductions

full rationale

The paper presents an empirical engineering result: a lightweight face detector with three listed modifications (feature extractor distinct from MobileNet, modified SSD anchors, alternative tie resolution). No equations, fitted parameters, predictions, or uniqueness theorems appear. Claims rest on reported FPS measurements and qualitative AR use-cases rather than any derivation chain that reduces to its own inputs by construction. Self-citations (if present) are not load-bearing for a central premise, and the work is self-contained against external benchmarks without tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the paper is an applied neural architecture description rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5659 in / 931 out tokens · 17962 ms · 2026-05-24T23:19:49.874342+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean period8 definition and 8-tick periodicity in reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we have adopted an alternative anchor scheme that stops at the 8 ×8 feature map dimensions without further downsampling... replaced 2 anchors per pixel in each of the 8 ×8, 4×4 and 2×2 resolutions by 6 anchors at 8×8
IndisputableMonolith/Cost/FunctionalEquation.lean Jcost uniqueness and washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight feature extraction network inspired by, but distinct from MobileNetV1/V2... GPU-friendly anchor scheme modified from SSD... improved tie resolution strategy alternative to non-maximum suppression

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BIDO: A Biometric Identity Online Authentication Framework
cs.ET 2026-05 unverdicted novelty 5.0

BIDO derives transient ECDSA keys from live facial biometrics salted with a memorized secret to produce non-resident WebAuthn credentials, achieving 99.51% verification accuracy on LFW without storing templates or PII.
UNSEEN: A Cross-Stack LLM Unlearning Defense against AR-LLM Social Engineering Attacks
cs.CR 2026-04 unverdicted novelty 5.0

UNSEEN combines AR access control, LLM unlearning to suppress profiles, and agent guardrails to defend against AR-LLM social engineering attacks, tested in a 60-person user study with 360 conversations.