Don't waste SAM
Pith reviewed 2026-06-27 13:40 UTC · model grok-4.3
The pith
Fine-tuning the Segment Anything Model improves waste image segmentation accuracy on real-world datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The fine-tuned SAM-ViT-H model outperforms the state-of-the-art on the Zerowaste and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.
What carries the argument
The fine-tuned SAM-ViT-H model, which adapts the pre-trained Vision Transformer Huge backbone of the Segment Anything Model to waste segmentation data.
Load-bearing premise
The reported performance gains assume that the fine-tuning experiments used the same test data splits and evaluation protocols as the prior state-of-the-art methods being compared.
What would settle it
Re-running the fine-tuning and evaluation on the exact same test splits of Zerowaste and TACO using the published code and hyperparameters of the previous SOTA methods to check if the +30 IoU advantage disappears.
Figures
read the original abstract
Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates the Segment Anything Model (SAM) on three waste segmentation datasets (Zerowaste, TACO, TrashCan 1.0) that feature real-world challenges such as occlusions, deformable objects, transparency, and background confusion. It reports that fine-tuning the SAM-ViT-H variant produces large gains, outperforming prior SOTA methods on Zerowaste and TACO by +30 IoU while approaching TrashCan 1.0 performance within -1.44 IoU, and concludes that fine-tuning SAM is essential for better generalization on downstream waste tasks rather than discarding the model.
Significance. If the reported IoU deltas are shown to arise from identical test splits and evaluation protocols, the result would indicate that a general-purpose foundation model can be adapted to deliver competitive performance on a domain with substantial visual variability, supporting the broader utility of fine-tuning SAM-style models for specialized segmentation without designing new architectures from scratch.
major comments (2)
- [Abstract] Abstract: the headline claim of a +30 IoU improvement on Zerowaste and TACO (and -1.44 on TrashCan 1.0) is presented without any description of the train/test splits, confirmation that the cited SOTA baselines were re-evaluated on the identical held-out images, or the precise IoU computation (per-image vs. dataset-wide, threshold, etc.). This information is load-bearing for the central generalization argument.
- [Abstract] Abstract: no training protocol is supplied (learning rate, number of epochs, data augmentations, optimizer, or whether the SAM encoder was frozen), nor are any ablations or error bars across runs. Without these, it is impossible to determine whether the reported gains are attributable to fine-tuning SAM or to other uncontrolled factors.
minor comments (1)
- [Abstract] Abstract contains the typo "state-ofthe-art" (missing space).
Simulated Author's Rebuttal
We thank the referee for the careful reading and for highlighting the need for greater transparency in the abstract. We agree that the headline results require supporting details on splits, baselines, metrics, and training protocol to be fully convincing. We will revise the abstract (and cross-reference the methods) to address both points.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of a +30 IoU improvement on Zerowaste and TACO (and -1.44 on TrashCan 1.0) is presented without any description of the train/test splits, confirmation that the cited SOTA baselines were re-evaluated on the identical held-out images, or the precise IoU computation (per-image vs. dataset-wide, threshold, etc.). This information is load-bearing for the central generalization argument.
Authors: We accept the criticism. The abstract will be rewritten to state that we adopt the official train/test splits released with each dataset, that the reported SOTA numbers are taken from the original papers (or re-implemented on the identical test images where code was available), and that IoU is the standard per-image mean IoU averaged over the test set using the conventional 0.5 overlap threshold. These clarifications will be added while preserving the abstract's brevity. revision: yes
-
Referee: [Abstract] Abstract: no training protocol is supplied (learning rate, number of epochs, data augmentations, optimizer, or whether the SAM encoder was frozen), nor are any ablations or error bars across runs. Without these, it is impossible to determine whether the reported gains are attributable to fine-tuning SAM or to other uncontrolled factors.
Authors: We agree that the abstract must at least sketch the fine-tuning protocol. The revised abstract will include a concise statement of the optimizer, learning rate, number of epochs, and that the image encoder was fine-tuned (not frozen). The full hyper-parameter list, augmentations, and implementation details already appear in Section 3; we will add an explicit pointer from the abstract. The manuscript does not currently contain ablations or multi-run error bars; we will note this limitation and, if space permits, add a brief statement that single-run results are reported. revision: partial
Circularity Check
No circularity: purely empirical reporting of fine-tuning results
full rationale
The manuscript contains no derivation chain, equations, or predictions. Performance numbers are presented as direct outputs of standard fine-tuning and held-out evaluation on waste datasets. No fitted parameters are renamed as predictions, no self-citations bear load-bearing uniqueness claims, and no ansatz or renaming of known results occurs. The work is self-contained against external benchmarks via reported IoU values.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Fine-tuning a foundation model on task-specific data improves downstream metrics without catastrophic forgetting or domain-shift artifacts
- domain assumption The three waste datasets are representative of real-world waste scenes and share the same distribution as SAM's pre-training data
Reference graph
Works this paper leans on
-
[1]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
2020
-
[2]
Chatcpt, November 2022
OpenAI. Chatcpt, November 2022
2022
-
[3]
Zero-shot text-to-image generation, 2021
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021
2021
-
[4]
Learning transferable visual models from natural language supervi- sion, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervi- sion, 2021
2021
-
[5]
Le, Yunhsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision, 2021
2021
-
[6]
Metasa1b, April 2023
MetaSA1B. Metasa1b, April 2023
2023
-
[7]
Berg, Wan-Yen Lo, Piotr Doll˜A¡r, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll˜A¡r, and Ross Girshick. Segment anything, 2023
2023
-
[8]
Inpaint anything: Segment anything meets image inpainting, 2023
Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting, 2023
2023
-
[9]
Anything-3d: Towards single-view anything reconstruction in the wild, 2023
Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything-3d: Towards single-view anything reconstruction in the wild, 2023
2023
-
[10]
Any-to-any style transfer: Making picasso and da vinci collaborate, 2023
Songhua Liu, Jingwen Ye, and Xinchao Wang. Any-to-any style transfer: Making picasso and da vinci collaborate, 2023
2023
-
[11]
Track anything: Segment anything meets videos, 2023
Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos, 2023
2023
-
[12]
Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models, 2023
Jielu Zhang, Zhongliang Zhou, Gengchen Mai, Lan Mu, Mengxuan Hu, and Sheng Li. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models, 2023
2023
-
[13]
Learning to ”segment anything” in thermal infrared images through knowledge distillation with a large scale dataset satir, 2023
Junzhang Chen and Xiangzhi Bai. Learning to ”segment anything” in thermal infrared images through knowledge distillation with a large scale dataset satir, 2023
2023
-
[14]
Can sam segment anything? when sam meets camou- flaged object detection, 2023
Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camou- flaged object detection, 2023
2023
-
[15]
Can sam segment polyps?, 2023
Tao Zhou, Yizhe Zhang, Yi Zhou, Ye Wu, and Chen Gong. Can sam segment polyps?, 2023
2023
-
[16]
Ellen Grant, and Yangming Ou
Sheng He, Rina Bao, Jingpeng Li, P. Ellen Grant, and Yangming Ou. Accuracy of segment-anything model (sam) in medical image segmentation tasks, 2023
2023
-
[17]
Sam struggles in concealed scenes – empirical study on ”segment anything”, 2023
Ge-Peng Ji, Deng-Ping Fan, Peng Xu, Ming-Ming Cheng, Bowen Zhou, and Luc Van Gool. Sam struggles in concealed scenes – empirical study on ”segment anything”, 2023
2023
-
[18]
Sam fails to segment anything? – sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more, 2023
Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. Sam fails to segment anything? – sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more, 2023
2023
-
[19]
Zerowaste dataset: Towards deformable object segmentation in cluttered scenes, 2022
Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. Zerowaste dataset: Towards deformable object segmentation in cluttered scenes, 2022
2022
-
[20]
Trashcan: A semantically-segmented dataset towards visual detection of marine debris, 2020
Jungseok Hong, Michael Fulton, and Junaed Sattar. Trashcan: A semantically-segmented dataset towards visual detection of marine debris, 2020
2020
-
[21]
Taco: Trash annotations in context for litter detec- tion, 2020
Pedro F Proenca and Pedro Simoes. Taco: Trash annotations in context for litter detec- tion, 2020
2020
-
[22]
lightning-sam, April 2023
Luca Medeiros. lightning-sam, April 2023
2023
-
[23]
Petrov, and Anton Konushin
Konstantin Sofiiuk, Ilia A. Petrov, and Anton Konushin. Reviving iterative training with mask guidance for interactive segmentation, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.