arxiv: 2306.14289 · v2 · pith:NO6CY7DAnew · submitted 2023-06-25 · 💻 cs.CV

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Chaoning Zhang , Dongshen Han , Yu Qiao , Jung Uk Kim , Sung-Ho Bae , Seungkyu Lee , Choong Seon Hong This is my paper

Pith reviewed 2026-05-17 22:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords SAMMobileSAMlightweight modelknowledge distillationimage segmentationmobile applicationszero-shot learningViT-H

0 comments

The pith

Distilling SAM's heavy encoder into a lightweight one creates MobileSAM, over 60 times smaller with matching zero-shot segmentation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to adapt the Segment Anything Model for mobile devices by swapping its large image encoder for a much smaller version. A naive retraining approach fails because it tries to optimize the encoder and decoder together, leading to poor results with limited data. Instead, they use decoupled distillation to train only the new encoder to match the original heavy encoder's behavior while leaving the mask decoder untouched. This allows the lightweight model to inherit the original's capabilities without retraining everything. If successful, this makes high-quality zero-shot segmentation practical on resource-limited phones and edge devices.

Core claim

By distilling knowledge from the frozen ViT-H image encoder to a lightweight image encoder, the new model remains fully compatible with the original SAM mask decoder. This decoupled approach avoids the issues of joint optimization and produces MobileSAM, which is more than 60 times smaller than the original while achieving on-par performance across vision applications.

What carries the argument

Decoupled distillation of the image encoder, which trains the lightweight encoder independently to replicate the outputs of the original heavy encoder while keeping the mask decoder fixed.

If this is right

MobileSAM achieves inference speeds of around 12ms per image on a single GPU, with 8ms for the encoder and 4ms for the decoder.
The model is around 5 times faster and 7 times smaller than the concurrent FastSAM method.
MobileSAM can run relatively smoothly on CPU, enabling use in mobile applications.
Training completes on a single GPU in less than one day.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar distillation techniques could be applied to other large foundation models in vision to create mobile-friendly versions without full retraining.
Further hardware-specific optimizations like model quantization or pruning might yield even smaller and faster variants suitable for specific devices.
Testing on real mobile hardware with diverse image types would validate the practical speed and accuracy gains.

Load-bearing premise

That the lightweight encoder, trained only by mimicking the frozen original encoder, will work seamlessly with the unchanged mask decoder on a wide variety of downstream tasks without any extra fine-tuning.

What would settle it

Observing that MobileSAM underperforms the original SAM by a large margin on standard zero-shot segmentation benchmarks like COCO or LVIS without any additional training would indicate the assumption of compatibility is false.

read the original abstract

Segment Anything Model (SAM) has attracted significant attention due to its impressive zero-shot transfer performance and high versatility for numerous vision applications (like image editing with fine-grained control). Many of such applications need to be run on resource-constraint edge devices, like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, With a single GPU, MobileSAM runs around 10ms per image: 8ms on the image encoder and 4ms on the mask decoder. With superior performance, our MobileSAM is around 5 times faster than the concurrent FastSAM and 7 times smaller, making it more suitable for mobile applications. Moreover, we show that MobileSAM can run relatively smoothly on CPU. The code for our project is provided at \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{MobileSAM}}), with a demo showing that MobileSAM can run relatively smoothly on CPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MobileSAM gets a 60x smaller SAM via decoupled encoder distillation with usable speed, but the compatibility evidence stays thin without tables or ablations.

read the letter

The main thing to know is that this paper produces MobileSAM, a distilled lightweight version of SAM that is over 60 times smaller than the original while claiming on-par zero-shot performance and faster inference suitable for mobile use. They achieve this by replacing the heavy ViT-H encoder with a small one trained via decoupled distillation from the frozen original encoder, then pairing it with the unchanged mask decoder. Training finishes on a single GPU in under a day, and they release code plus a CPU demo showing roughly 10 ms per image on GPU and reasonable CPU performance. It also comes out ahead of the concurrent FastSAM in both speed and size. That practical outcome is the clearest value here. The decoupled schedule is the concrete step they add; standard distillation is not new, but applying it this way to keep decoder compatibility without joint fine-tuning is the specific recipe that lets the work succeed where naive retraining fails. They earn credit for identifying the coupled optimization problem with limited data and for shipping a usable model quickly. The soft spots sit mostly in the evidence. The abstract states the performance and speed claims but gives no quantitative tables, error bars, or direct ablations comparing decoupled versus coupled training. The central assumption—that distilling the encoder alone produces features close enough to the original ViT-H outputs for the frozen decoder to retain zero-shot quality across tasks—remains plausible yet untested in the provided text for fine boundaries or out-of-distribution cases. The stress-test note correctly flags this alignment step as the least secure part, and nothing in the abstract closes that gap with targeted checks. This paper is for engineers and researchers who need a ready-to-run compressed SAM for edge devices in editing, AR, or robotics. A reader who wants a concrete model and code to try will get immediate use from it. I would send it for peer review. The empirical result is grounded enough to deserve referee time even if the method is incremental and the experiments need more detail.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MobileSAM, a lightweight adaptation of the Segment Anything Model (SAM) for mobile and edge devices. It replaces the original ViT-H image encoder with a lightweight encoder trained via decoupled knowledge distillation from the frozen original encoder, keeping the mask decoder unchanged to maintain compatibility. The authors claim this avoids the failures of naive joint training, yielding a model over 60 times smaller than the original SAM that performs on par, runs at ~10ms per image on GPU (8ms encoder + 4ms decoder), is 5x faster and 7x smaller than concurrent FastSAM, and runs smoothly on CPU, with code and a demo provided.

Significance. If the empirical claims hold, this provides a practical path to deploy zero-shot segmentation on resource-constrained devices, addressing a key barrier for real-world use of SAM. The efficient single-GPU training (<1 day) and public code release are clear strengths that enhance reproducibility and impact.

major comments (2)

Abstract: the central claim that MobileSAM 'performs on par with the original SAM' and is 'around 5 times faster than the concurrent FastSAM' is load-bearing but unsupported by any quantitative tables, metrics, error bars, or task-specific breakdowns in the abstract; without these, the magnitude of any performance gap on zero-shot or downstream tasks cannot be assessed.
Decoupled distillation section: the assumption that distilling only the lightweight encoder from the frozen ViT-H (while leaving the mask decoder untouched) produces features sufficiently aligned for the original decoder to retain zero-shot performance across diverse tasks is not directly evidenced by ablations or comparisons to joint fine-tuning; this is critical because the original SAM was jointly optimized and the paper notes naive joint training fails.

minor comments (1)

Abstract: the GitHub link contains raw LaTeX commands (e.g., href and textcolor{red}) that should be rendered or removed in the final version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below and indicate the planned revisions.

read point-by-point responses

Referee: Abstract: the central claim that MobileSAM 'performs on par with the original SAM' and is 'around 5 times faster than the concurrent FastSAM' is load-bearing but unsupported by any quantitative tables, metrics, error bars, or task-specific breakdowns in the abstract; without these, the magnitude of any performance gap on zero-shot or downstream tasks cannot be assessed.

Authors: We agree that the abstract would be strengthened by including key quantitative metrics. In the revised manuscript we will update the abstract to report the >60x size reduction, on-par zero-shot performance, ~10 ms inference time, and 5x speed / 7x size advantage over FastSAM, with explicit pointers to the corresponding tables and figures in the main text. Space constraints preclude full tables or error bars in the abstract itself, but the added numbers will allow readers to assess the claims directly. revision: yes
Referee: Decoupled distillation section: the assumption that distilling only the lightweight encoder from the frozen ViT-H (while leaving the mask decoder untouched) produces features sufficiently aligned for the original decoder to retain zero-shot performance across diverse tasks is not directly evidenced by ablations or comparisons to joint fine-tuning; this is critical because the original SAM was jointly optimized and the paper notes naive joint training fails.

Authors: We acknowledge that a direct head-to-head ablation would provide stronger evidence. The current manuscript already states that naive joint training yields unsatisfactory results, which motivated the decoupled design. In the revision we will add a new ablation table that compares (i) the proposed decoupled distillation against (ii) joint fine-tuning of a lightweight encoder plus the original decoder, reporting zero-shot mIoU / IoU metrics on the standard SAM evaluation benchmarks to quantify the alignment benefit. revision: yes

Circularity Check

0 steps flagged

No circularity: standard external distillation against frozen teacher

full rationale

The paper describes training a lightweight image encoder via knowledge distillation to match the outputs of the original frozen ViT-H encoder from SAM, while leaving the mask decoder unchanged. This is a conventional empirical procedure using an external teacher model and limited training data; no equations, fitted parameters, or predictions are shown to reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim of decoder compatibility is presented as an observed outcome of the decoupled training rather than a self-referential definition. The derivation chain relies on external benchmarks (original SAM performance) and is therefore self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that the original SAM mask decoder is already optimal and that distillation from ViT-H features is sufficient to recover its behavior with a smaller encoder. No new physical or mathematical axioms; the main free parameter is the choice of lightweight backbone architecture and the distillation temperature or loss weighting.

free parameters (2)

lightweight encoder architecture
Choice of specific tiny ViT or CNN variant used as student; selected to balance speed and accuracy.
distillation hyperparameters
Loss weights and training schedule for the decoupled distillation; tuned on limited data.

axioms (1)

domain assumption The original SAM mask decoder remains fixed and optimal when paired with a distilled encoder
Invoked when the authors state the distilled encoder is automatically compatible with the original decoder.

pith-pipeline@v0.9.0 · 5627 in / 1204 out tokens · 27820 ms · 2026-05-17T22:37:57.264741+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries
cs.AI 2026-05 unverdicted novelty 7.0

ProCompNav disambiguates ambiguous instance navigation queries via candidate-pool construction followed by attribute-based comparative binary questions that prune distractors, yielding higher success rates and shorter...
Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries
cs.AI 2026-05 unverdicted novelty 7.0

ProCompNav improves success rate and shortens user responses in ambiguous instance navigation by using comparative binary questions that prune a candidate pool rather than requesting detailed descriptions.
PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

PR-MaGIC refines prompts in in-context segmentation via test-time gradient flow from the mask decoder plus top-1 selection, yielding better masks across benchmarks without training.
Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks
cs.CV 2026-04 accept novelty 7.0

Boxes2Pixels distills noisy SAM pseudo-masks into a compact DINOv2-based student with auxiliary localization and one-sided self-correction, delivering +6.97 anomaly mIoU and +9.71 binary IoU gains over baselines on wi...
OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
cs.CV 2026-01 conditional novelty 7.0

OmniOVCD uses SAM 3's decoupled outputs and an SFID strategy to achieve state-of-the-art IoU scores of 67.2, 66.5, 24.5, and 27.1 on four OVCD benchmarks, surpassing prior methods.
Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

GLMap combines explicit 3D Gaussians with multi-scale language semantics in a dual-modality structure and uses an analytical Gaussian Estimator for incremental map building, improving zero-shot performance on navigati...
HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
cs.RO 2026-04 unverdicted novelty 6.0

HTNav combines imitation and reinforcement learning in a staged, tiered structure with map learning to reach state-of-the-art performance on the CityNav benchmark for urban aerial navigation.
Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views
cs.CV 2025-11 unverdicted novelty 6.0

Uni-Hand forecasts 2D/3D hand waypoints, head motion, and contact states in egocentric views using vision-language fusion and dual-branch diffusion, with new benchmarks for downstream robotics and action tasks.
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
cs.CV 2024-01 unverdicted novelty 6.0

Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
The Midas Touch for Metric Depth
cs.CV 2026-05 unverdicted novelty 5.0

MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
Deep Reprogramming Distillation for Medical Foundation Models
cs.CV 2026-05 unverdicted novelty 5.0

DRD introduces a reprogramming module and CKA-based distillation to enable efficient, robust adaptation of medical foundation models to downstream 2D/3D classification and segmentation tasks, outperforming prior PEFT ...
TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
cs.CV 2026-05 unverdicted novelty 5.0

TrajRAG uses a topological-polar trajectory representation and hierarchical retrieval to accumulate and reuse geometric-semantic navigation experiences, improving zero-shot ObjectNav on MP3D and HM3D benchmarks.
Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics
cs.CV 2026-04 unverdicted novelty 5.0

Distilled SAM 3 and DINOv3 models deliver near-teacher accuracy in pig tracking (92.29% MOTA, 96.15% IDF1) and behavior classification while achieving 7.77x parameter reduction and fitting on Jetson Orin NX with headroom.
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
cs.CV 2026-04 unverdicted novelty 5.0

Unposed-to-3D learns simulation-ready 3D vehicle models from unposed real images by predicting camera parameters for photometric self-supervision, then adding scale prediction and harmonization.
SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance
cs.CV 2026-04 unverdicted novelty 5.0

SocialMirror reconstructs 3D meshes of closely interacting humans from monocular videos using semantic guidance from vision-language models and geometric constraints in a diffusion model to handle occlusions and maint...
IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments
cs.RO 2026-03 unverdicted novelty 5.0

IGV-RRT improves object goal navigation in dynamic indoor environments by combining uncertainty-aware priors from 3D scene graphs with online VLM observations in a real-time tree planner.
Towards Any-Quality Image Segmentation via Generative and Adaptive Latent Space Enhancement
cs.CV 2026-01 unverdicted novelty 5.0

GleSAM++ improves SAM robustness on degraded images by using generative enhancement, feature alignment, and adaptive degradation prediction while adding few parameters.
A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration
cs.RO 2026-04 unverdicted novelty 4.0

A modular VLN architecture builds a cognitive memory graph, decomposes it for VLM reasoning, and solves a weighted traveling repairman problem for context-aware exploration to achieve real-time performance and higher ...

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 17 Pith papers · 4 internal anchors

[1]

One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era

Chaoning Zhang, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Kumar Dam, Mengchun Zhang, Jung Uk Kim, Seong Tae Kim, Jinwoo Choi, et al. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv preprint arXiv:2304.06488, 2023a. Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li...

work page arXiv
[2]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Mp-fedcl: Multi-prototype federated contrastive learning for edge intelligence

Yu Qiao, Md Munir, Apurba Adhikary, Huy Q Le, Avi Deb Raha, Chaoning Zhang, Choong Seon Hong, et al. Mp-fedcl: Multi-prototype federated contrastive learning for edge intelligence. arXiv preprint arXiv:2304.01950, 2023a. Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X Pham, Chang D Yoo, and In So Kweon. How does simsiam avoid collapse without negati...

work page arXiv
[4]

A survey on segment anything model (sam): Vision foundation model meets prompt engineering

Chaoning Zhang, Sheng Zheng, Chenghao Li, Yu Qiao, Taegoo Kang, Xinru Shan, Chenshuang Zhang, Caiyan Qin, Francois Rameau, Sung-Ho Bae, et al. A survey on segment anything model (sam): Vision foundation model meets prompt engineering. 2023c. Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything. ...

work page arXiv
[5]

Segment anything in medical images

Jun Ma and Bo Wang. Segment anything in medical images. arXiv preprint arXiv:2304.12306,

work page arXiv
[6]

Input augmentation with sam: Boosting medical image segmentation with segmentation foundation model

Yizhe Zhang, Tao Zhou, Peixian Liang, and Danny Z Chen. Input augmentation with sam: Boosting medical image segmentation with segmentation foundation model. arXiv preprint arXiv:2304.11332, 2023d. Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709,

work page arXiv
[7]

Attack-sam: Towards evaluating adversarial robustness of segment anything model

Chenshuang Zhang, Chaoning Zhang, Taegoo Kang, Donghun Kim, Sung-Ho Bae, and In So Kweon. Attack-sam: Towards evaluating adversarial robustness of segment anything model. arXiv preprint, 2023e. Yu Qiao, Chaoning Zhang, Taegoo Kang, Donghun Kim, Shehbaz Tariq, Chenshuang Zhang, and Choong Seon Hong. Robustness of sam: Segment anything under corruptions and...

work page arXiv
[8]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

URL https://github.com/IDEA-Research/ Grounded-Segment-Anything. GitHub repository. Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023a. Jiaqi Chen, Zeyu Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Inpaint anything: Segment anything meets image inpainting

Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790,

work page arXiv
[10]

Track anything: Segment anything meets videos

Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968,

work page arXiv
[11]

GitHub repository

URL https://github.com/z-x-yang/ Segment-and-Track-Anything . GitHub repository. Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261,

work page arXiv
[12]

Any-speaker adaptive text-to-speech synthesis with diffusion models

Minki Kang, Dongchan Min, and Sung Ju Hwang. Any-speaker adaptive text-to-speech synthesis with diffusion models. arXiv preprint arXiv:2211.09383,

work page arXiv
[13]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877,

work page arXiv 2012
[15]

Mobilevit: light-weight, general- purpose, and mobile-friendly vision transformer,

Sachin Mehta and Mohammad Rastegari. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178,

work page arXiv
[16]

Efficientformer: Vision transformers at mobilenet speed

Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems , 35: 12934–12949, 2022a. Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, and Yixuan Yuan. Efficientvit: Memory efficient vision trans...

work page arXiv 2022
[17]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV) , pages 565–571. Ieee,

work page 2016