arxiv: 2604.20169 · v2 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Semantic-Fast-SAM: Efficient Semantic Segmenter

Byunghyun Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic segmentationFastSAMSegment Anything Modelreal-time inferenceopen-vocabulary segmentationCLIPCityscapesADE20K

0 comments

The pith

Semantic-Fast-SAM produces accurate semantic segmentation maps in real time by pairing rapid mask generation with category labeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Semantic-Fast-SAM to turn an efficient mask generator into a full semantic segmenter. It starts with FastSAM, which creates masks quickly using a convolutional network rather than a transformer, then attaches a labeling pipeline to assign categories to each mask. This yields segmentation results that match earlier SAM-based methods on standard datasets while running far faster. The design also extends to naming classes not seen in training by using CLIP embeddings for the labels. A reader would care because the speed gain makes foundation-style segmentation usable in settings that need immediate output, such as moving machines.

Core claim

Semantic-Fast-SAM combines FastSAM's rapid mask generation with an SSA labeling strategy to assign meaningful categories to each mask. The resulting model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments show it reaches mIoU of 70.33 on Cityscapes and 48.01 on ADE20K while achieving approximately 20x faster inference than SSA in the closed-set setting. It further handles open-vocabulary segmentation by leveraging CLIP-based semantic heads and outperforms recent open-vocabulary models on broad class labeling.

What carries the argument

The integration of FastSAM's CNN-based rapid mask generation with SSA's semantic labeling pipeline, extended by CLIP heads for assigning categories to masks.

If this is right

Matches the accuracy of prior SAM-based semantic segmenters on Cityscapes and ADE20K benchmarks.
Delivers approximately 20 times faster inference than SSA in closed-set settings.
Supports effective open-vocabulary segmentation via CLIP heads and outperforms recent models on broad class labeling.
Operates at lower computational cost and memory footprint, enabling the segment-anything capability in real-time robotics scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pairing of fast masks with labeling could be tested on other efficient mask generators to check if the speed gain is specific to FastSAM.
Lower compute needs might allow semantic segmentation to run directly on edge hardware without external servers.
Open-vocabulary results could improve with targeted fine-tuning of the CLIP heads on specific domains.
Frame-by-frame processing might be extended to video streams to test temporal consistency at the reported speeds.

Load-bearing premise

That attaching SSA semantic labeling to FastSAM masks preserves both mask quality and category accuracy without introducing significant errors or requiring extensive retraining.

What would settle it

Measuring mIoU and inference speed on a new held-out dataset using identical hardware would show whether SFS maintains the reported accuracy levels while staying roughly 20 times faster than SSA.

Figures

Figures reproduced from arXiv: 2604.20169 by Byunghyun Kim.

**Figure 2.** Figure 2: Overall architecture of Semantic-Fast-SAM. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

We propose Semantic-Fast-SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real-time performance without sacrificing accuracy. FastSAM is an efficient CNN-based re-implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer-based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic-Segment-Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM-based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed-set setting. We also show that SFS effectively handles open-vocabulary segmentation by leveraging CLIP-based semantic heads, outperforming recent open-vocabulary models on broad class labeling. This work enables practical real-time semantic segmentation with the "segment-anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at https://github.com/KBH00/Semantic-Fast-SAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semantic-Fast-SAM is a straightforward engineering swap of FastSAM masks into the SSA pipeline plus CLIP, claiming matching accuracy at 20x speed but resting on unverified transfer assumptions.

read the letter

The main takeaway is that this paper does not introduce new algorithms or theory. It simply routes FastSAM's CNN masks through SSA's semantic labeling and adds CLIP heads for open-vocabulary cases, then reports mIoU numbers around 70 on Cityscapes and 48 on ADE20K with a 20x inference speedup over SSA in closed-set mode. The code release and robotics focus are the practical upsides here. Those elements make the work immediately usable for anyone needing faster segment-anything style output without heavy compute. The open-vocab extension also shows the pipeline can handle broader labels without retraining the core mask generator. That part is clean and direct. The soft spots sit in the evidence for the accuracy claim. FastSAM approximates SAM with a different architecture, so its masks can have altered boundaries or completeness. The paper gives no ablations on mask quality metrics, boundary F-scores, or error propagation when SSA labels those masks instead of original SAM outputs. The abstract also omits experimental setup details like splits, training procedure, or exact timing breakdowns, which leaves the 20x figure hard to evaluate. The stress-test point about potential degradation from the mask swap holds up based on what is shown. This is the sort of paper that appeals to engineers building real-time vision stacks for robotics or edge devices more than to researchers seeking new methods. It engages the literature honestly by reusing published components without overclaiming novelty. I would send it for peer review. Referees can check the integration details and ask for the missing ablations, which would clarify whether the speed gain comes at an acceptable quality cost.

Referee Report

2 major / 1 minor

Summary. The paper proposes Semantic-Fast-SAM (SFS), a framework that pairs the CNN-based FastSAM for rapid mask generation with an SSA semantic labeling pipeline (and CLIP heads for open-vocabulary cases) to deliver real-time semantic segmentation. It claims that SFS matches the accuracy of prior SAM-based methods (mIoU ~70.33 on Cityscapes, 48.01 on ADE20K) while running ~20x faster than SSA in closed-set settings, with additional gains in open-vocabulary labeling, all at reduced compute and memory cost. The implementation is released publicly.

Significance. If the performance claims hold under rigorous validation, the work would provide a practical route to deploying segment-anything capabilities in real-time robotics and edge scenarios. The explicit release of code is a clear strength that supports reproducibility and extension.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the headline mIoU values (70.33 Cityscapes, 48.01 ADE20K) and 20x speedup claim are presented without any description of experimental setup, data splits, number of runs, error bars, or ablation studies on the SSA-to-FastSAM transfer. This leaves the central accuracy-matching and efficiency claims only moderately supported.
[Method] Method section (integration of SSA labeling with FastSAM masks): the claim that semantic accuracy is preserved without degradation or retraining rests on the untested assumption that SSA's labeling strategy transfers directly to CNN-generated masks whose boundary precision and object completeness may differ from transformer-based SAM. No mask-level metrics (boundary F-score, instance AP) or error-propagation analysis are reported to substantiate this.

minor comments (1)

[Abstract] The abstract introduces 'SSA' only after using the acronym; a parenthetical expansion on first use would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the experimental reporting and methodological validation in our work. We address each major comment below with specific plans for revision.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the headline mIoU values (70.33 Cityscapes, 48.01 ADE20K) and 20x speedup claim are presented without any description of experimental setup, data splits, number of runs, error bars, or ablation studies on the SSA-to-FastSAM transfer. This leaves the central accuracy-matching and efficiency claims only moderately supported.

Authors: We agree that the current presentation of results would benefit from greater detail to support the claims. In the revised manuscript, we will expand the Experiments section to explicitly describe the standard Cityscapes and ADE20K train/val/test splits, the hardware used for all timing experiments (single NVIDIA A100 GPU), and report mIoU and runtime averaged over three independent runs with standard deviations and error bars. We will also add an ablation study directly comparing semantic segmentation mIoU when the SSA labeling pipeline is applied to FastSAM masks versus the original SAM masks, confirming the transfer preserves accuracy while delivering the reported speedup. revision: yes
Referee: [Method] Method section (integration of SSA labeling with FastSAM masks): the claim that semantic accuracy is preserved without degradation or retraining rests on the untested assumption that SSA's labeling strategy transfers directly to CNN-generated masks whose boundary precision and object completeness may differ from transformer-based SAM. No mask-level metrics (boundary F-score, instance AP) or error-propagation analysis are reported to substantiate this.

Authors: This observation is well-taken, as differences in mask generation between the CNN-based FastSAM and transformer-based SAM could in principle affect downstream labeling. To substantiate the transfer, we will augment the Method and Experiments sections with mask-level evaluations on a held-out validation subset, reporting boundary F-score and instance AP for FastSAM masks relative to SAM masks. These metrics will be paired with a brief error-propagation analysis showing that observed differences do not lead to measurable degradation in final semantic mIoU. This addition will provide direct evidence for the no-retraining claim. revision: yes

Circularity Check

0 steps flagged

No circularity: applied model composition evaluated on external benchmarks

full rationale

The paper describes an engineering integration of two existing models (FastSAM for mask generation and SSA for semantic labeling) followed by standard benchmark evaluation on Cityscapes and ADE20K. No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central claims rest on reported mIoU numbers from public datasets rather than any internal reduction or self-referential definition. This is a normal non-circular applied paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no equations, parameters, or new entities are described, so the ledger is empty.

pith-pipeline@v0.9.0 · 5526 in / 1192 out tokens · 37323 ms · 2026-05-10T01:09:21.921615+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollár, and R. Girshick, “Segment anything,”arXiv preprint arXiv:2304.02643, 2023

work page internal anchor Pith review arXiv 2023
[2]

Semantic segment anything,

J. Chen, Z. Yang, and L. Zhang, “Semantic segment anything,” GitHub repository, 2023, https://github.com/fudan-zvg/ Semantic-Segment-Anything

2023
[3]

Oneformer: One transformer to rule universal image segmentation,

J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “Oneformer: One transformer to rule universal image segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[4]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[5]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProceedings of the International Conference on Machine Learning (ICML), 2022

2022
[6]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning (ICML), 2021

2021
[7]

Fast segment anything,

X. Zhao, W. Ding, Y . An, Y . Du, T. Yu, M. Li, M. Tang, and J. Wang, “Fast segment anything,”arXiv preprint arXiv:2306.12156, 2023

work page arXiv 2023
[8]

Yolact: Real-time instance segmentation,

D. Bolya, C. Zhou, F. Xiao, and Y . J. Lee, “Yolact: Real-time instance segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

2019
[9]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016
[10]

Semantic understanding of scenes through the ade20k dataset,

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,”International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019

2019
[11]

YOLOv4: Optimal Speed and Accuracy of Object Detection

A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “Yolov4: Optimal speed and accuracy of object detection,”arXiv preprint arXiv:2004.10934, 2020

work page internal anchor Pith review arXiv 2004
[12]

Image segmentation using text and image prompts,

L. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[13]

Groupvit: Semantic segmentation emerges from text supervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[14]

Open-vocabulary universal image segmen- tation with maskclip,

Z. Ding, J. Wang, and Z. Tu, “Open-vocabulary universal image segmen- tation with maskclip,” inProceedings of the International Conference on Machine Learning (ICML), 2023

2023
[15]

Ultra-light test-time adaptation for Vision–Language models,

B. Kim, “Ultra-light test-time adaptation for Vision–Language models,” arXiv preprint arXiv:2511.09101, 2025

work page arXiv 2025
[16]

OT-UVGS: Revisiting UV Mapping for Gaussian Splatting as a Capacity Allocation Problem

——, “OT-UVGS: Revisiting UV mapping for gaussian splatting as a capacity allocation problem,”arXiv preprint arXiv:2604.19127, 2026, accepted to Eurographics 2026 Short Papers

work page internal anchor Pith review Pith/arXiv arXiv 2026