Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
Pith reviewed 2026-05-17 22:37 UTC · model grok-4.3
The pith
Distilling SAM's heavy encoder into a lightweight one creates MobileSAM, over 60 times smaller with matching zero-shot segmentation performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By distilling knowledge from the frozen ViT-H image encoder to a lightweight image encoder, the new model remains fully compatible with the original SAM mask decoder. This decoupled approach avoids the issues of joint optimization and produces MobileSAM, which is more than 60 times smaller than the original while achieving on-par performance across vision applications.
What carries the argument
Decoupled distillation of the image encoder, which trains the lightweight encoder independently to replicate the outputs of the original heavy encoder while keeping the mask decoder fixed.
If this is right
- MobileSAM achieves inference speeds of around 12ms per image on a single GPU, with 8ms for the encoder and 4ms for the decoder.
- The model is around 5 times faster and 7 times smaller than the concurrent FastSAM method.
- MobileSAM can run relatively smoothly on CPU, enabling use in mobile applications.
- Training completes on a single GPU in less than one day.
Where Pith is reading between the lines
- Similar distillation techniques could be applied to other large foundation models in vision to create mobile-friendly versions without full retraining.
- Further hardware-specific optimizations like model quantization or pruning might yield even smaller and faster variants suitable for specific devices.
- Testing on real mobile hardware with diverse image types would validate the practical speed and accuracy gains.
Load-bearing premise
That the lightweight encoder, trained only by mimicking the frozen original encoder, will work seamlessly with the unchanged mask decoder on a wide variety of downstream tasks without any extra fine-tuning.
What would settle it
Observing that MobileSAM underperforms the original SAM by a large margin on standard zero-shot segmentation benchmarks like COCO or LVIS without any additional training would indicate the assumption of compatibility is false.
read the original abstract
Segment Anything Model (SAM) has attracted significant attention due to its impressive zero-shot transfer performance and high versatility for numerous vision applications (like image editing with fine-grained control). Many of such applications need to be run on resource-constraint edge devices, like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, With a single GPU, MobileSAM runs around 10ms per image: 8ms on the image encoder and 4ms on the mask decoder. With superior performance, our MobileSAM is around 5 times faster than the concurrent FastSAM and 7 times smaller, making it more suitable for mobile applications. Moreover, we show that MobileSAM can run relatively smoothly on CPU. The code for our project is provided at \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{MobileSAM}}), with a demo showing that MobileSAM can run relatively smoothly on CPU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MobileSAM, a lightweight adaptation of the Segment Anything Model (SAM) for mobile and edge devices. It replaces the original ViT-H image encoder with a lightweight encoder trained via decoupled knowledge distillation from the frozen original encoder, keeping the mask decoder unchanged to maintain compatibility. The authors claim this avoids the failures of naive joint training, yielding a model over 60 times smaller than the original SAM that performs on par, runs at ~10ms per image on GPU (8ms encoder + 4ms decoder), is 5x faster and 7x smaller than concurrent FastSAM, and runs smoothly on CPU, with code and a demo provided.
Significance. If the empirical claims hold, this provides a practical path to deploy zero-shot segmentation on resource-constrained devices, addressing a key barrier for real-world use of SAM. The efficient single-GPU training (<1 day) and public code release are clear strengths that enhance reproducibility and impact.
major comments (2)
- Abstract: the central claim that MobileSAM 'performs on par with the original SAM' and is 'around 5 times faster than the concurrent FastSAM' is load-bearing but unsupported by any quantitative tables, metrics, error bars, or task-specific breakdowns in the abstract; without these, the magnitude of any performance gap on zero-shot or downstream tasks cannot be assessed.
- Decoupled distillation section: the assumption that distilling only the lightweight encoder from the frozen ViT-H (while leaving the mask decoder untouched) produces features sufficiently aligned for the original decoder to retain zero-shot performance across diverse tasks is not directly evidenced by ablations or comparisons to joint fine-tuning; this is critical because the original SAM was jointly optimized and the paper notes naive joint training fails.
minor comments (1)
- Abstract: the GitHub link contains raw LaTeX commands (e.g., href and textcolor{red}) that should be rendered or removed in the final version.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment point by point below and indicate the planned revisions.
read point-by-point responses
-
Referee: Abstract: the central claim that MobileSAM 'performs on par with the original SAM' and is 'around 5 times faster than the concurrent FastSAM' is load-bearing but unsupported by any quantitative tables, metrics, error bars, or task-specific breakdowns in the abstract; without these, the magnitude of any performance gap on zero-shot or downstream tasks cannot be assessed.
Authors: We agree that the abstract would be strengthened by including key quantitative metrics. In the revised manuscript we will update the abstract to report the >60x size reduction, on-par zero-shot performance, ~10 ms inference time, and 5x speed / 7x size advantage over FastSAM, with explicit pointers to the corresponding tables and figures in the main text. Space constraints preclude full tables or error bars in the abstract itself, but the added numbers will allow readers to assess the claims directly. revision: yes
-
Referee: Decoupled distillation section: the assumption that distilling only the lightweight encoder from the frozen ViT-H (while leaving the mask decoder untouched) produces features sufficiently aligned for the original decoder to retain zero-shot performance across diverse tasks is not directly evidenced by ablations or comparisons to joint fine-tuning; this is critical because the original SAM was jointly optimized and the paper notes naive joint training fails.
Authors: We acknowledge that a direct head-to-head ablation would provide stronger evidence. The current manuscript already states that naive joint training yields unsatisfactory results, which motivated the decoupled design. In the revision we will add a new ablation table that compares (i) the proposed decoupled distillation against (ii) joint fine-tuning of a lightweight encoder plus the original decoder, reporting zero-shot mIoU / IoU metrics on the standard SAM evaluation benchmarks to quantify the alignment benefit. revision: yes
Circularity Check
No circularity: standard external distillation against frozen teacher
full rationale
The paper describes training a lightweight image encoder via knowledge distillation to match the outputs of the original frozen ViT-H encoder from SAM, while leaving the mask decoder unchanged. This is a conventional empirical procedure using an external teacher model and limited training data; no equations, fitted parameters, or predictions are shown to reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim of decoder compatibility is presented as an observed outcome of the decoupled training rather than a self-referential definition. The derivation chain relies on external benchmarks (original SAM performance) and is therefore self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- lightweight encoder architecture
- distillation hyperparameters
axioms (1)
- domain assumption The original SAM mask decoder remains fixed and optimal when paired with a distilled encoder
Forward citations
Cited by 18 Pith papers
-
Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries
ProCompNav disambiguates ambiguous instance navigation queries via candidate-pool construction followed by attribute-based comparative binary questions that prune distractors, yielding higher success rates and shorter...
-
Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries
ProCompNav improves success rate and shortens user responses in ambiguous instance navigation by using comparative binary questions that prune a candidate pool rather than requesting detailed descriptions.
-
PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
PR-MaGIC refines prompts in in-context segmentation via test-time gradient flow from the mask decoder plus top-1 selection, yielding better masks across benchmarks without training.
-
Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks
Boxes2Pixels distills noisy SAM pseudo-masks into a compact DINOv2-based student with auxiliary localization and one-sided self-correction, delivering +6.97 anomaly mIoU and +9.71 binary IoU gains over baselines on wi...
-
OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
OmniOVCD uses SAM 3's decoupled outputs and an SFID strategy to achieve state-of-the-art IoU scores of 67.2, 66.5, 24.5, and 27.1 on four OVCD benchmarks, surpassing prior methods.
-
Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning
GLMap combines explicit 3D Gaussians with multi-scale language semantics in a dual-modality structure and uses an analytical Gaussian Estimator for incremental map building, improving zero-shot performance on navigati...
-
HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
HTNav combines imitation and reinforcement learning in a staged, tiered structure with map learning to reach state-of-the-art performance on the CityNav benchmark for urban aerial navigation.
-
Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views
Uni-Hand forecasts 2D/3D hand waypoints, head motion, and contact states in egocentric views using vision-language fusion and dual-branch diffusion, with new benchmarks for downstream robotics and action tasks.
-
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
-
The Midas Touch for Metric Depth
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
-
Deep Reprogramming Distillation for Medical Foundation Models
DRD introduces a reprogramming module and CKA-based distillation to enable efficient, robust adaptation of medical foundation models to downstream 2D/3D classification and segmentation tasks, outperforming prior PEFT ...
-
TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
TrajRAG uses a topological-polar trajectory representation and hierarchical retrieval to accumulate and reuse geometric-semantic navigation experiences, improving zero-shot ObjectNav on MP3D and HM3D benchmarks.
-
Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics
Distilled SAM 3 and DINOv3 models deliver near-teacher accuracy in pig tracking (92.29% MOTA, 96.15% IDF1) and behavior classification while achieving 7.77x parameter reduction and fitting on Jetson Orin NX with headroom.
-
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
Unposed-to-3D learns simulation-ready 3D vehicle models from unposed real images by predicting camera parameters for photometric self-supervision, then adding scale prediction and harmonization.
-
SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance
SocialMirror reconstructs 3D meshes of closely interacting humans from monocular videos using semantic guidance from vision-language models and geometric constraints in a diffusion model to handle occlusions and maint...
-
IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments
IGV-RRT improves object goal navigation in dynamic indoor environments by combining uncertainty-aware priors from 3D scene graphs with online VLM observations in a real-time tree planner.
-
Towards Any-Quality Image Segmentation via Generative and Adaptive Latent Space Enhancement
GleSAM++ improves SAM robustness on degraded images by using generative enhancement, feature alignment, and adaptive degradation prediction while adding few parameters.
-
A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration
A modular VLN architecture builds a cognitive memory graph, decomposes it for VLM reasoning, and solves a weighted traveling repairman problem for context-aware exploration to achieve real-time performance and higher ...
Reference graph
Works this paper leans on
-
[1]
One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era
Chaoning Zhang, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Kumar Dam, Mengchun Zhang, Jung Uk Kim, Seong Tae Kim, Jinwoo Choi, et al. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv preprint arXiv:2304.06488, 2023a. Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li...
-
[2]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Mp-fedcl: Multi-prototype federated contrastive learning for edge intelligence
Yu Qiao, Md Munir, Apurba Adhikary, Huy Q Le, Avi Deb Raha, Chaoning Zhang, Choong Seon Hong, et al. Mp-fedcl: Multi-prototype federated contrastive learning for edge intelligence. arXiv preprint arXiv:2304.01950, 2023a. Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X Pham, Chang D Yoo, and In So Kweon. How does simsiam avoid collapse without negati...
-
[4]
A survey on segment anything model (sam): Vision foundation model meets prompt engineering
Chaoning Zhang, Sheng Zheng, Chenghao Li, Yu Qiao, Taegoo Kang, Xinru Shan, Chenshuang Zhang, Caiyan Qin, Francois Rameau, Sung-Ho Bae, et al. A survey on segment anything model (sam): Vision foundation model meets prompt engineering. 2023c. Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything. ...
-
[5]
Segment anything in medical images
Jun Ma and Bo Wang. Segment anything in medical images. arXiv preprint arXiv:2304.12306,
-
[6]
Input augmentation with sam: Boosting medical image segmentation with segmentation foundation model
Yizhe Zhang, Tao Zhou, Peixian Liang, and Danny Z Chen. Input augmentation with sam: Boosting medical image segmentation with segmentation foundation model. arXiv preprint arXiv:2304.11332, 2023d. Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709,
-
[7]
Attack-sam: Towards evaluating adversarial robustness of segment anything model
Chenshuang Zhang, Chaoning Zhang, Taegoo Kang, Donghun Kim, Sung-Ho Bae, and In So Kweon. Attack-sam: Towards evaluating adversarial robustness of segment anything model. arXiv preprint, 2023e. Yu Qiao, Chaoning Zhang, Taegoo Kang, Donghun Kim, Shehbaz Tariq, Chenshuang Zhang, and Choong Seon Hong. Robustness of sam: Segment anything under corruptions and...
-
[8]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
URL https://github.com/IDEA-Research/ Grounded-Segment-Anything. GitHub repository. Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023a. Jiaqi Chen, Zeyu Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Inpaint anything: Segment anything meets image inpainting
Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790,
-
[10]
Track anything: Segment anything meets videos
Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968,
-
[11]
URL https://github.com/z-x-yang/ Segment-and-Track-Anything . GitHub repository. Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261,
-
[12]
Any-speaker adaptive text-to-speech synthesis with diffusion models
Minki Kang, Dongchan Min, and Sung Ju Hwang. Any-speaker adaptive text-to-speech synthesis with diffusion models. arXiv preprint arXiv:2211.09383,
-
[13]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877,
-
[15]
Mobilevit: light-weight, general- purpose, and mobile-friendly vision transformer,
Sachin Mehta and Mohammad Rastegari. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178,
-
[16]
Efficientformer: Vision transformers at mobilenet speed
Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems , 35: 12934–12949, 2022a. Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, and Yixuan Yuan. Efficientvit: Memory efficient vision trans...
-
[17]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
V-net: Fully convolutional neural networks for volumetric medical image segmentation
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV) , pages 565–571. Ieee,
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.