SAM3-I: Segment Anything with Instructions
Pith reviewed 2026-05-17 02:02 UTC · model grok-4.3
The pith
SAM3 can be extended to interpret rich natural-language instructions for segmentation by aligning them to its existing vision-language representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAM3-I builds on SAM3 by adding an instruction-aware cascaded adaptation mechanism with dedicated alignment losses. This mechanism progressively aligns expressive instruction semantics with SAM3's vision-language representations. The result is a single framework that interprets natural-language instructions directly for segmentation while preserving strong concept recall. Training is enabled by the HMPL-Instruct dataset, which covers hierarchical instruction semantics and diverse target granularities. Experiments show appealing performance on referring and reasoning-based segmentation tasks.
What carries the argument
The instruction-aware cascaded adaptation mechanism, which applies dedicated alignment losses to connect detailed instruction semantics to SAM3's vision-language features.
If this is right
- Users can supply full sentences describing what to segment instead of relying on external systems to reduce instructions to noun phrases.
- Segmentation becomes more specific to the details given in the instruction, such as particular attributes or relations between objects.
- Performance on basic concept-driven prompts remains high because the adaptation is designed to avoid degrading original recall ability.
- A single model now covers both simple concept grounding and more complex reasoning-based segmentation tasks.
Where Pith is reading between the lines
- This kind of internal alignment could reduce the number of separate modules needed in larger vision-language systems that perform segmentation.
- The same progressive alignment idea might transfer to other prompt-based models that currently depend on external conversion for complex inputs.
- Further tests on instructions involving actions or implicit states not heavily represented in the training data would show where the limits lie.
Load-bearing premise
The cascaded adaptation with alignment losses can match complex instructions to the model's existing representations without creating new errors on unfamiliar instructions or reducing accuracy on simple concept prompts.
What would settle it
A held-out test set of instructions that combine attributes, relations, and reasoning, where the quality and specificity of masks from SAM3-I are compared against masks from the original SAM3 plus an external agent that converts instructions to noun phrases.
Figures
read the original abstract
Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3's vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SAM3-I, an extension of SAM3 for direct interpretation of complex natural-language instructions in segmentation. It proposes an instruction-aware cascaded adaptation mechanism with dedicated alignment losses to align expressive instruction semantics to SAM3's vision-language representations. A new HMPL-Instruct dataset is introduced to support instruction-centric training covering hierarchical semantics and target granularities. Experiments are reported to show appealing performance on referring and reasoning-based segmentation tasks while preserving SAM3's original concept-driven strengths.
Significance. If the no-sacrifice claim on concept recall is substantiated, the work would meaningfully advance open-vocabulary segmentation by unifying concept-level grounding and instruction-level reasoning in a single model, reducing reliance on external multi-modal agents. The public release of code and the HMPL-Instruct dataset at the GitHub link is a clear strength supporting reproducibility.
major comments (1)
- [Experiments] The central claim that instruction following is achieved without sacrificing concept recall (abstract and §1) is load-bearing for the contribution yet lacks direct support. No before/after quantitative comparisons on standard concept-segmentation benchmarks (e.g., COCO or LVIS with pure noun-phrase prompts) are provided to demonstrate that noun-phrase concept recall remains intact after the cascaded adaptation and alignment losses.
minor comments (2)
- [Abstract] The phrase 'appealing performance' in the abstract is vague; specific metrics, baselines, and error bars should be summarized early.
- [Method] A diagram or pseudocode for the cascaded adaptation mechanism would clarify the progressive alignment process described in the method.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The point raised is important for strengthening the central claim, and we address it directly below.
read point-by-point responses
-
Referee: [Experiments] The central claim that instruction following is achieved without sacrificing concept recall (abstract and §1) is load-bearing for the contribution yet lacks direct support. No before/after quantitative comparisons on standard concept-segmentation benchmarks (e.g., COCO or LVIS with pure noun-phrase prompts) are provided to demonstrate that noun-phrase concept recall remains intact after the cascaded adaptation and alignment losses.
Authors: We agree that explicit before-and-after quantitative comparisons on standard concept-segmentation benchmarks would provide stronger, direct support for the no-sacrifice claim. The cascaded adaptation and alignment losses were specifically designed to progressively align instruction semantics while preserving SAM3's original vision-language representations, and we have observed that concept-level performance is retained in our instruction-following experiments. However, we acknowledge that the current manuscript does not include the requested direct comparisons using pure noun-phrase prompts on COCO or LVIS. In the revised manuscript, we will add these evaluations, reporting relevant metrics (e.g., mIoU or mask AP) to quantify that noun-phrase concept recall remains comparable to the original SAM3. This addition will be placed in the experiments section alongside the existing results. revision: yes
Circularity Check
No circularity: empirical model extension on external benchmarks
full rationale
The paper describes an empirical extension of SAM3 via cascaded adaptation, alignment losses, and a newly introduced HMPL-Instruct dataset. Claims rest on experimental results across referring/reasoning segmentation tasks and external benchmarks rather than any mathematical derivation, fitted-parameter prediction, or self-citation chain. No equations appear in the provided text, and no load-bearing step reduces a result to its own inputs by construction. This is the standard case of a self-contained empirical CV paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Neural network representations can be progressively aligned across modalities using dedicated losses without catastrophic forgetting of prior capabilities.
Forward citations
Cited by 2 Pith papers
-
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
-
LumiVideo: An Intelligent Agentic System for Video Color Grading
LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.Neural Information Processing Systems, 2025. 3
work page 2025
-
[3]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman R¨adle, Triantafyllos Afouras, Effrosyni Mavroudi, Kather- ine Xu, Tsung-Han Wu, Yu Zhou, Lil...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Adamv-moe: Adaptive multi-task vision mixture-of- experts
Tianlong Chen, Xuxi Chen, Xianzhi Du, Abdullah Rashwan, Fan Yang, Huizhong Chen, Zhangyang Wang, and Yeqing Li. Adamv-moe: Adaptive multi-task vision mixture-of- experts. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17346–17357, 2023. 8
work page 2023
-
[5]
Di Feng, Christian Haase-Sch ¨utz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaeser, Fabian Timm, Werner Wies- beck, and Klaus Dietmayer. Deep multi-modal object de- tection and semantic segmentation for autonomous driv- ing:datasets,methods,and challenges.IEEE Transactions on Intelligent Transportation Systems, 22:1341–1360, 2020. 2
work page 2020
-
[6]
Seg- mentation from natural language expressions
Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg- mentation from natural language expressions. InEuropean Conference on Computer Vision, pages 108–124, 2016. 6
work page 2016
-
[7]
Wei Ji, Jingjing Li, Qi Bi, Tingwei Liu, Wenbo Li, and Li Cheng. Segment anything is not always perfect: An investi- gation of sam on different real-world applications.Machine Intelligence Research, 21(4):617–630, 2024. 1
work page 2024
-
[8]
Wei Ji, Jingjing Li, Cheng Bian, Zhicheng Zhang, and Li Cheng. Semanticrt: A large-scale dataset and method for robust semantic segmentation in multispectral images. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3307–3316, 2023. 1
work page 2023
-
[9]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1, 2, 8
work page 2023
-
[10]
Novel method of seman- tic segmentation applicable to augmented reality.Sensors, 20(6):1737, 2020
Tae-young Ko and Seung-ho Lee. Novel method of seman- tic segmentation applicable to augmented reality.Sensors, 20(6):1737, 2020. 2
work page 2020
-
[11]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 6, 8
work page 2024
-
[12]
Yong Li, Zhiqiang Guo, Feng Shuang, Man Zhang, and Xi- uhua Li. Key technologies of machine vision for weeding robots: A review and benchmark.Computers and Electron- ics in Agriculture, 196:106880, 2022. 2
work page 2022
-
[13]
Jianhua Lin. Divergence measures based on the shannon en- tropy.IEEE Transactions on Information Theory, 37(1):145– 151, 1991. 4
work page 1991
-
[14]
Fully convolutional networks for semantic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 1
work page 2015
-
[15]
Mixture of ex- perts: a literature survey.Artificial Intelligence Review, 42(2):275–293, 2014
Saeed Masoudnia and Reza Ebrahimpour. Mixture of ex- perts: a literature survey.Artificial Intelligence Review, 42(2):275–293, 2014. 8
work page 2014
-
[16]
Yun Peng, Xiao Lin, Nachuan Ma, Jiayuan Du, Chuang- wei Liu, Chengju Liu, and Qijun Chen. Sam-lad: Seg- ment anything model meets zero-shot logic anomaly detec- tion.Knowledge-Based Systems, 314:113176, 2025. 2
work page 2025
-
[17]
Verifiably following complex robot instructions with foundation models
Benedict Quartey, Eric Rosen, Stefanie Tellex, and George Konidaris. Verifiably following complex robot instructions with foundation models. InIEEE International Conference on Robotics and Automation, pages 1–8. IEEE, 2025. 2
work page 2025
-
[18]
Paco: Parts and attributes of common objects
Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Mar- quez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023. 6
work page 2023
-
[19]
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos.International Confer- ence on Learning Representations, 2025. 1, 2, 8
work page 2025
-
[20]
Augmented reality model in supporting instruction process: a critical review
Azhar Wahid, Miftachul Huda, Moh Abdul Rohim, Ab- dul Halim Ali, Khairul Ghufran Kaspin, Maskanatul Fiqiyah, and Muhammad Talhah Ajmain Jima’ain. Augmented reality model in supporting instruction process: a critical review. In International Congress on Information and Communication Technology, pages 69–83. Springer, 2024. 2
work page 2024
-
[21]
Junde Wu, Ziyue Wang, Mingxuan Hong, Wei Ji, Huazhu Fu, Yanwu Xu, Min Xu, and Yueming Jin. Medical sam adapter: Adapting segment anything model for medical im- age segmentation.Medical Image Analysis, 102:103547,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.