Recognition: unknown
Semantic-Fast-SAM: Efficient Semantic Segmenter
Pith reviewed 2026-05-10 01:09 UTC · model grok-4.3
The pith
Semantic-Fast-SAM produces accurate semantic segmentation maps in real time by pairing rapid mask generation with category labeling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Semantic-Fast-SAM combines FastSAM's rapid mask generation with an SSA labeling strategy to assign meaningful categories to each mask. The resulting model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments show it reaches mIoU of 70.33 on Cityscapes and 48.01 on ADE20K while achieving approximately 20x faster inference than SSA in the closed-set setting. It further handles open-vocabulary segmentation by leveraging CLIP-based semantic heads and outperforms recent open-vocabulary models on broad class labeling.
What carries the argument
The integration of FastSAM's CNN-based rapid mask generation with SSA's semantic labeling pipeline, extended by CLIP heads for assigning categories to masks.
If this is right
- Matches the accuracy of prior SAM-based semantic segmenters on Cityscapes and ADE20K benchmarks.
- Delivers approximately 20 times faster inference than SSA in closed-set settings.
- Supports effective open-vocabulary segmentation via CLIP heads and outperforms recent models on broad class labeling.
- Operates at lower computational cost and memory footprint, enabling the segment-anything capability in real-time robotics scenarios.
Where Pith is reading between the lines
- The same pairing of fast masks with labeling could be tested on other efficient mask generators to check if the speed gain is specific to FastSAM.
- Lower compute needs might allow semantic segmentation to run directly on edge hardware without external servers.
- Open-vocabulary results could improve with targeted fine-tuning of the CLIP heads on specific domains.
- Frame-by-frame processing might be extended to video streams to test temporal consistency at the reported speeds.
Load-bearing premise
That attaching SSA semantic labeling to FastSAM masks preserves both mask quality and category accuracy without introducing significant errors or requiring extensive retraining.
What would settle it
Measuring mIoU and inference speed on a new held-out dataset using identical hardware would show whether SFS maintains the reported accuracy levels while staying roughly 20 times faster than SSA.
Figures
read the original abstract
We propose Semantic-Fast-SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real-time performance without sacrificing accuracy. FastSAM is an efficient CNN-based re-implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer-based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic-Segment-Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM-based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed-set setting. We also show that SFS effectively handles open-vocabulary segmentation by leveraging CLIP-based semantic heads, outperforming recent open-vocabulary models on broad class labeling. This work enables practical real-time semantic segmentation with the "segment-anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at https://github.com/KBH00/Semantic-Fast-SAM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Semantic-Fast-SAM (SFS), a framework that pairs the CNN-based FastSAM for rapid mask generation with an SSA semantic labeling pipeline (and CLIP heads for open-vocabulary cases) to deliver real-time semantic segmentation. It claims that SFS matches the accuracy of prior SAM-based methods (mIoU ~70.33 on Cityscapes, 48.01 on ADE20K) while running ~20x faster than SSA in closed-set settings, with additional gains in open-vocabulary labeling, all at reduced compute and memory cost. The implementation is released publicly.
Significance. If the performance claims hold under rigorous validation, the work would provide a practical route to deploying segment-anything capabilities in real-time robotics and edge scenarios. The explicit release of code is a clear strength that supports reproducibility and extension.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments section: the headline mIoU values (70.33 Cityscapes, 48.01 ADE20K) and 20x speedup claim are presented without any description of experimental setup, data splits, number of runs, error bars, or ablation studies on the SSA-to-FastSAM transfer. This leaves the central accuracy-matching and efficiency claims only moderately supported.
- [Method] Method section (integration of SSA labeling with FastSAM masks): the claim that semantic accuracy is preserved without degradation or retraining rests on the untested assumption that SSA's labeling strategy transfers directly to CNN-generated masks whose boundary precision and object completeness may differ from transformer-based SAM. No mask-level metrics (boundary F-score, instance AP) or error-propagation analysis are reported to substantiate this.
minor comments (1)
- [Abstract] The abstract introduces 'SSA' only after using the acronym; a parenthetical expansion on first use would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights opportunities to strengthen the experimental reporting and methodological validation in our work. We address each major comment below with specific plans for revision.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the headline mIoU values (70.33 Cityscapes, 48.01 ADE20K) and 20x speedup claim are presented without any description of experimental setup, data splits, number of runs, error bars, or ablation studies on the SSA-to-FastSAM transfer. This leaves the central accuracy-matching and efficiency claims only moderately supported.
Authors: We agree that the current presentation of results would benefit from greater detail to support the claims. In the revised manuscript, we will expand the Experiments section to explicitly describe the standard Cityscapes and ADE20K train/val/test splits, the hardware used for all timing experiments (single NVIDIA A100 GPU), and report mIoU and runtime averaged over three independent runs with standard deviations and error bars. We will also add an ablation study directly comparing semantic segmentation mIoU when the SSA labeling pipeline is applied to FastSAM masks versus the original SAM masks, confirming the transfer preserves accuracy while delivering the reported speedup. revision: yes
-
Referee: [Method] Method section (integration of SSA labeling with FastSAM masks): the claim that semantic accuracy is preserved without degradation or retraining rests on the untested assumption that SSA's labeling strategy transfers directly to CNN-generated masks whose boundary precision and object completeness may differ from transformer-based SAM. No mask-level metrics (boundary F-score, instance AP) or error-propagation analysis are reported to substantiate this.
Authors: This observation is well-taken, as differences in mask generation between the CNN-based FastSAM and transformer-based SAM could in principle affect downstream labeling. To substantiate the transfer, we will augment the Method and Experiments sections with mask-level evaluations on a held-out validation subset, reporting boundary F-score and instance AP for FastSAM masks relative to SAM masks. These metrics will be paired with a brief error-propagation analysis showing that observed differences do not lead to measurable degradation in final semantic mIoU. This addition will provide direct evidence for the no-retraining claim. revision: yes
Circularity Check
No circularity: applied model composition evaluated on external benchmarks
full rationale
The paper describes an engineering integration of two existing models (FastSAM for mask generation and SSA for semantic labeling) followed by standard benchmark evaluation on Cityscapes and ADE20K. No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central claims rest on reported mIoU numbers from public datasets rather than any internal reduction or self-referential definition. This is a normal non-circular applied paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollár, and R. Girshick, “Segment anything,”arXiv preprint arXiv:2304.02643, 2023
work page internal anchor Pith review arXiv 2023
-
[2]
Semantic segment anything,
J. Chen, Z. Yang, and L. Zhang, “Semantic segment anything,” GitHub repository, 2023, https://github.com/fudan-zvg/ Semantic-Segment-Anything
2023
-
[3]
Oneformer: One transformer to rule universal image segmentation,
J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “Oneformer: One transformer to rule universal image segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
2023
-
[4]
Masked-attention mask transformer for universal image segmentation,
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
2022
-
[5]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProceedings of the International Conference on Machine Learning (ICML), 2022
2022
-
[6]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning (ICML), 2021
2021
-
[7]
X. Zhao, W. Ding, Y . An, Y . Du, T. Yu, M. Li, M. Tang, and J. Wang, “Fast segment anything,”arXiv preprint arXiv:2306.12156, 2023
-
[8]
Yolact: Real-time instance segmentation,
D. Bolya, C. Zhou, F. Xiao, and Y . J. Lee, “Yolact: Real-time instance segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019
2019
-
[9]
The cityscapes dataset for semantic urban scene understanding,
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
2016
-
[10]
Semantic understanding of scenes through the ade20k dataset,
B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,”International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019
2019
-
[11]
YOLOv4: Optimal Speed and Accuracy of Object Detection
A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “Yolov4: Optimal speed and accuracy of object detection,”arXiv preprint arXiv:2004.10934, 2020
work page internal anchor Pith review arXiv 2004
-
[12]
Image segmentation using text and image prompts,
L. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
2022
-
[13]
Groupvit: Semantic segmentation emerges from text supervision,
J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
2022
-
[14]
Open-vocabulary universal image segmen- tation with maskclip,
Z. Ding, J. Wang, and Z. Tu, “Open-vocabulary universal image segmen- tation with maskclip,” inProceedings of the International Conference on Machine Learning (ICML), 2023
2023
-
[15]
Ultra-light test-time adaptation for Vision–Language models,
B. Kim, “Ultra-light test-time adaptation for Vision–Language models,” arXiv preprint arXiv:2511.09101, 2025
-
[16]
OT-UVGS: Revisiting UV Mapping for Gaussian Splatting as a Capacity Allocation Problem
——, “OT-UVGS: Revisiting UV mapping for gaussian splatting as a capacity allocation problem,”arXiv preprint arXiv:2604.19127, 2026, accepted to Eurographics 2026 Short Papers
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.