pith. machine review for the scientific record. sign in

arxiv: 2604.20169 · v2 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Semantic-Fast-SAM: Efficient Semantic Segmenter

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic segmentationFastSAMSegment Anything Modelreal-time inferenceopen-vocabulary segmentationCLIPCityscapesADE20K
0
0 comments X

The pith

Semantic-Fast-SAM produces accurate semantic segmentation maps in real time by pairing rapid mask generation with category labeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Semantic-Fast-SAM to turn an efficient mask generator into a full semantic segmenter. It starts with FastSAM, which creates masks quickly using a convolutional network rather than a transformer, then attaches a labeling pipeline to assign categories to each mask. This yields segmentation results that match earlier SAM-based methods on standard datasets while running far faster. The design also extends to naming classes not seen in training by using CLIP embeddings for the labels. A reader would care because the speed gain makes foundation-style segmentation usable in settings that need immediate output, such as moving machines.

Core claim

Semantic-Fast-SAM combines FastSAM's rapid mask generation with an SSA labeling strategy to assign meaningful categories to each mask. The resulting model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments show it reaches mIoU of 70.33 on Cityscapes and 48.01 on ADE20K while achieving approximately 20x faster inference than SSA in the closed-set setting. It further handles open-vocabulary segmentation by leveraging CLIP-based semantic heads and outperforms recent open-vocabulary models on broad class labeling.

What carries the argument

The integration of FastSAM's CNN-based rapid mask generation with SSA's semantic labeling pipeline, extended by CLIP heads for assigning categories to masks.

If this is right

  • Matches the accuracy of prior SAM-based semantic segmenters on Cityscapes and ADE20K benchmarks.
  • Delivers approximately 20 times faster inference than SSA in closed-set settings.
  • Supports effective open-vocabulary segmentation via CLIP heads and outperforms recent models on broad class labeling.
  • Operates at lower computational cost and memory footprint, enabling the segment-anything capability in real-time robotics scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pairing of fast masks with labeling could be tested on other efficient mask generators to check if the speed gain is specific to FastSAM.
  • Lower compute needs might allow semantic segmentation to run directly on edge hardware without external servers.
  • Open-vocabulary results could improve with targeted fine-tuning of the CLIP heads on specific domains.
  • Frame-by-frame processing might be extended to video streams to test temporal consistency at the reported speeds.

Load-bearing premise

That attaching SSA semantic labeling to FastSAM masks preserves both mask quality and category accuracy without introducing significant errors or requiring extensive retraining.

What would settle it

Measuring mIoU and inference speed on a new held-out dataset using identical hardware would show whether SFS maintains the reported accuracy levels while staying roughly 20 times faster than SSA.

Figures

Figures reproduced from arXiv: 2604.20169 by Byunghyun Kim.

Figure 1
Figure 1. Figure 1: Comparison between Semantic-SAM and Semantic-Fast-SAM. Our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of Semantic-Fast-SAM. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
read the original abstract

We propose Semantic-Fast-SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real-time performance without sacrificing accuracy. FastSAM is an efficient CNN-based re-implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer-based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic-Segment-Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM-based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed-set setting. We also show that SFS effectively handles open-vocabulary segmentation by leveraging CLIP-based semantic heads, outperforming recent open-vocabulary models on broad class labeling. This work enables practical real-time semantic segmentation with the "segment-anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at https://github.com/KBH00/Semantic-Fast-SAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Semantic-Fast-SAM (SFS), a framework that pairs the CNN-based FastSAM for rapid mask generation with an SSA semantic labeling pipeline (and CLIP heads for open-vocabulary cases) to deliver real-time semantic segmentation. It claims that SFS matches the accuracy of prior SAM-based methods (mIoU ~70.33 on Cityscapes, 48.01 on ADE20K) while running ~20x faster than SSA in closed-set settings, with additional gains in open-vocabulary labeling, all at reduced compute and memory cost. The implementation is released publicly.

Significance. If the performance claims hold under rigorous validation, the work would provide a practical route to deploying segment-anything capabilities in real-time robotics and edge scenarios. The explicit release of code is a clear strength that supports reproducibility and extension.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the headline mIoU values (70.33 Cityscapes, 48.01 ADE20K) and 20x speedup claim are presented without any description of experimental setup, data splits, number of runs, error bars, or ablation studies on the SSA-to-FastSAM transfer. This leaves the central accuracy-matching and efficiency claims only moderately supported.
  2. [Method] Method section (integration of SSA labeling with FastSAM masks): the claim that semantic accuracy is preserved without degradation or retraining rests on the untested assumption that SSA's labeling strategy transfers directly to CNN-generated masks whose boundary precision and object completeness may differ from transformer-based SAM. No mask-level metrics (boundary F-score, instance AP) or error-propagation analysis are reported to substantiate this.
minor comments (1)
  1. [Abstract] The abstract introduces 'SSA' only after using the acronym; a parenthetical expansion on first use would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the experimental reporting and methodological validation in our work. We address each major comment below with specific plans for revision.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the headline mIoU values (70.33 Cityscapes, 48.01 ADE20K) and 20x speedup claim are presented without any description of experimental setup, data splits, number of runs, error bars, or ablation studies on the SSA-to-FastSAM transfer. This leaves the central accuracy-matching and efficiency claims only moderately supported.

    Authors: We agree that the current presentation of results would benefit from greater detail to support the claims. In the revised manuscript, we will expand the Experiments section to explicitly describe the standard Cityscapes and ADE20K train/val/test splits, the hardware used for all timing experiments (single NVIDIA A100 GPU), and report mIoU and runtime averaged over three independent runs with standard deviations and error bars. We will also add an ablation study directly comparing semantic segmentation mIoU when the SSA labeling pipeline is applied to FastSAM masks versus the original SAM masks, confirming the transfer preserves accuracy while delivering the reported speedup. revision: yes

  2. Referee: [Method] Method section (integration of SSA labeling with FastSAM masks): the claim that semantic accuracy is preserved without degradation or retraining rests on the untested assumption that SSA's labeling strategy transfers directly to CNN-generated masks whose boundary precision and object completeness may differ from transformer-based SAM. No mask-level metrics (boundary F-score, instance AP) or error-propagation analysis are reported to substantiate this.

    Authors: This observation is well-taken, as differences in mask generation between the CNN-based FastSAM and transformer-based SAM could in principle affect downstream labeling. To substantiate the transfer, we will augment the Method and Experiments sections with mask-level evaluations on a held-out validation subset, reporting boundary F-score and instance AP for FastSAM masks relative to SAM masks. These metrics will be paired with a brief error-propagation analysis showing that observed differences do not lead to measurable degradation in final semantic mIoU. This addition will provide direct evidence for the no-retraining claim. revision: yes

Circularity Check

0 steps flagged

No circularity: applied model composition evaluated on external benchmarks

full rationale

The paper describes an engineering integration of two existing models (FastSAM for mask generation and SSA for semantic labeling) followed by standard benchmark evaluation on Cityscapes and ADE20K. No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central claims rest on reported mIoU numbers from public datasets rather than any internal reduction or self-referential definition. This is a normal non-circular applied paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no equations, parameters, or new entities are described, so the ledger is empty.

pith-pipeline@v0.9.0 · 5526 in / 1192 out tokens · 37323 ms · 2026-05-10T01:09:21.921615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    Segment Anything

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollár, and R. Girshick, “Segment anything,”arXiv preprint arXiv:2304.02643, 2023

  2. [2]

    Semantic segment anything,

    J. Chen, Z. Yang, and L. Zhang, “Semantic segment anything,” GitHub repository, 2023, https://github.com/fudan-zvg/ Semantic-Segment-Anything

  3. [3]

    Oneformer: One transformer to rule universal image segmentation,

    J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “Oneformer: One transformer to rule universal image segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  4. [4]

    Masked-attention mask transformer for universal image segmentation,

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  5. [5]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProceedings of the International Conference on Machine Learning (ICML), 2022

  6. [6]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning (ICML), 2021

  7. [7]

    Fast segment anything,

    X. Zhao, W. Ding, Y . An, Y . Du, T. Yu, M. Li, M. Tang, and J. Wang, “Fast segment anything,”arXiv preprint arXiv:2306.12156, 2023

  8. [8]

    Yolact: Real-time instance segmentation,

    D. Bolya, C. Zhou, F. Xiao, and Y . J. Lee, “Yolact: Real-time instance segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

  9. [9]

    The cityscapes dataset for semantic urban scene understanding,

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  10. [10]

    Semantic understanding of scenes through the ade20k dataset,

    B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,”International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019

  11. [11]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “Yolov4: Optimal speed and accuracy of object detection,”arXiv preprint arXiv:2004.10934, 2020

  12. [12]

    Image segmentation using text and image prompts,

    L. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  13. [13]

    Groupvit: Semantic segmentation emerges from text supervision,

    J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  14. [14]

    Open-vocabulary universal image segmen- tation with maskclip,

    Z. Ding, J. Wang, and Z. Tu, “Open-vocabulary universal image segmen- tation with maskclip,” inProceedings of the International Conference on Machine Learning (ICML), 2023

  15. [15]

    Ultra-light test-time adaptation for Vision–Language models,

    B. Kim, “Ultra-light test-time adaptation for Vision–Language models,” arXiv preprint arXiv:2511.09101, 2025

  16. [16]

    OT-UVGS: Revisiting UV Mapping for Gaussian Splatting as a Capacity Allocation Problem

    ——, “OT-UVGS: Revisiting UV mapping for gaussian splatting as a capacity allocation problem,”arXiv preprint arXiv:2604.19127, 2026, accepted to Eurographics 2026 Short Papers