BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs
Pith reviewed 2026-05-24 23:19 UTC · model grok-4.3
The pith
BlazeFace detects faces at 200-1000+ FPS on mobile GPUs using a custom lightweight network.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BlazeFace is a lightweight and well-performing face detector tailored for mobile GPU inference. It runs at a speed of 200-1000+ FPS on flagship devices. This super-realtime performance enables it to be applied to any augmented reality pipeline that requires an accurate facial region of interest as an input for task-specific models, such as 2D/3D facial keypoint or geometry estimation, facial features or expression classification, and face region segmentation. The contributions include a lightweight feature extraction network inspired by but distinct from MobileNetV1/V2, a GPU-friendly anchor scheme modified from Single Shot MultiBox Detector, and an improved tie resolution strategy.
What carries the argument
Lightweight feature extraction network combined with GPU-friendly SSD-style anchors and non-NMS tie resolution.
If this is right
- The detector can supply accurate facial regions of interest to downstream AR models for keypoint estimation.
- It supports real-time facial expression classification and feature analysis on phones.
- Face region segmentation becomes feasible within live AR pipelines.
- The approach works across flagship mobile devices without custom hardware.
Where Pith is reading between the lines
- The same anchor and tie-resolution adjustments might apply to other single-shot detectors on mobile GPUs.
- Sustained high frame rates could lower average power use in always-on camera applications.
- Design patterns here could guide speed optimizations for related tasks like hand or body detection.
Load-bearing premise
The described changes to the feature extractor, anchor scheme, and tie resolution produce the stated speed and accuracy on mobile GPUs.
What would settle it
Benchmark measurements on a flagship mobile device showing inference slower than 200 FPS or detection accuracy substantially below standard mobile face detectors.
read the original abstract
We present BlazeFace, a lightweight and well-performing face detector tailored for mobile GPU inference. It runs at a speed of 200-1000+ FPS on flagship devices. This super-realtime performance enables it to be applied to any augmented reality pipeline that requires an accurate facial region of interest as an input for task-specific models, such as 2D/3D facial keypoint or geometry estimation, facial features or expression classification, and face region segmentation. Our contributions include a lightweight feature extraction network inspired by, but distinct from MobileNetV1/V2, a GPU-friendly anchor scheme modified from Single Shot MultiBox Detector (SSD), and an improved tie resolution strategy alternative to non-maximum suppression.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents BlazeFace, a lightweight neural face detector optimized for mobile GPU inference. It claims to run at 200-1000+ FPS on flagship devices through three contributions: a custom feature extraction network distinct from MobileNetV1/V2, a GPU-friendly anchor scheme modified from SSD, and an improved tie resolution strategy as an alternative to NMS. The detector is intended to supply accurate facial regions of interest as input to downstream AR models for tasks such as 2D/3D keypoint estimation, expression classification, and segmentation.
Significance. If the reported sub-millisecond performance holds and the three architectural modifications can be shown to be responsible for the gains, the work would provide a useful engineering contribution for real-time face detection in mobile augmented reality pipelines. The emphasis on GPU-friendly design choices directly addresses deployment constraints on mobile hardware. The paper is presented as an applied artifact rather than a parameter-free derivation or theoretical result.
major comments (1)
- [Results] Results section: the paper reports aggregate FPS on flagship devices and qualitative AR use-cases, but contains no ablation tables, no per-component timing breakdowns, and no direct comparison against an unmodified SSD-MobileNet baseline on the same hardware. This leaves open the possibility that model size alone, rather than the three cited modifications, accounts for the performance, undermining attribution of the central speed claim.
minor comments (1)
- [Abstract] Abstract: the claim of 'accurate facial regions of interest' is stated without accompanying quantitative accuracy metrics (e.g., mAP or precision-recall) to accompany the FPS figures.
Simulated Author's Rebuttal
We appreciate the referee's constructive feedback on our manuscript. We address the major comment below.
read point-by-point responses
-
Referee: [Results] Results section: the paper reports aggregate FPS on flagship devices and qualitative AR use-cases, but contains no ablation tables, no per-component timing breakdowns, and no direct comparison against an unmodified SSD-MobileNet baseline on the same hardware. This leaves open the possibility that model size alone, rather than the three cited modifications, accounts for the performance, undermining attribution of the central speed claim.
Authors: We agree that the manuscript would be strengthened by explicit ablation studies, per-component timing breakdowns, and a direct comparison to an unmodified SSD-MobileNet baseline on the same hardware. The current version focuses on the end-to-end performance of the integrated BlazeFace system on mobile GPUs. In the revised manuscript we will add ablation tables isolating the contributions of the custom backbone, modified anchor scheme, and tie-resolution method, along with the requested baseline comparison and timing details where available on the target devices. revision: yes
Circularity Check
No significant circularity; engineering artifact without load-bearing derivations or self-referential reductions
full rationale
The paper presents an empirical engineering result: a lightweight face detector with three listed modifications (feature extractor distinct from MobileNet, modified SSD anchors, alternative tie resolution). No equations, fitted parameters, predictions, or uniqueness theorems appear. Claims rest on reported FPS measurements and qualitative AR use-cases rather than any derivation chain that reduces to its own inputs by construction. Self-citations (if present) are not load-bearing for a central premise, and the work is self-contained against external benchmarks without tautological reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanperiod8 definition and 8-tick periodicity in reality_from_one_distinction echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we have adopted an alternative anchor scheme that stops at the 8 ×8 feature map dimensions without further downsampling... replaced 2 anchors per pixel in each of the 8 ×8, 4×4 and 2×2 resolutions by 6 anchors at 8×8
-
IndisputableMonolith/Cost/FunctionalEquation.leanJcost uniqueness and washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lightweight feature extraction network inspired by, but distinct from MobileNetV1/V2... GPU-friendly anchor scheme modified from SSD... improved tie resolution strategy alternative to non-maximum suppression
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
BIDO: A Biometric Identity Online Authentication Framework
BIDO derives transient ECDSA keys from live facial biometrics salted with a memorized secret to produce non-resident WebAuthn credentials, achieving 99.51% verification accuracy on LFW without storing templates or PII.
-
UNSEEN: A Cross-Stack LLM Unlearning Defense against AR-LLM Social Engineering Attacks
UNSEEN combines AR access control, LLM unlearning to suppress profiles, and agent guardrails to defend against AR-LLM social engineering attacks, tested in a 60-person user study with 360 conversations.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.