SAM3-I: Segment Anything with Instructions

Huchuan Lu; Jincai Huang; Jingjing Li; Li Cheng; Miao Zhang; Qiang Chen; Qi Bi; Shihao Zou; Wei Ji; Xiaoqi Zhao

arxiv: 2512.04585 · v4 · submitted 2025-12-04 · 💻 cs.CV

SAM3-I: Segment Anything with Instructions

Jingjing Li , Yue Feng , Yuchen Guo , Jincai Huang , Wei Ji , Qi Bi , Yongri Piao , Miao Zhang

show 5 more authors

Xiaoqi Zhao Qiang Chen Shihao Zou Huchuan Lu Li Cheng

This is my paper

Pith reviewed 2026-05-17 02:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords segment anything modelinstruction followingimage segmentationvision languagereferring segmentationreasoning segmentationnatural language instructions

0 comments

The pith

SAM3 can be extended to interpret rich natural-language instructions for segmentation by aligning them to its existing vision-language representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SAM3-I as an extension of the Segment Anything Model that adds direct support for complex natural language instructions. These instructions can include attributes, relations, actions, states, and implicit reasoning rather than just short noun phrases. This matters because current methods convert such instructions into simple prompts using external agents, which often produces coarse results and loses specificity. The work shows that an internal adaptation process can handle instructions while keeping the model's original strength at segmenting by basic concepts. Training uses a new dataset built around hierarchical instruction semantics and different levels of detail in the targets.

Core claim

SAM3-I builds on SAM3 by adding an instruction-aware cascaded adaptation mechanism with dedicated alignment losses. This mechanism progressively aligns expressive instruction semantics with SAM3's vision-language representations. The result is a single framework that interprets natural-language instructions directly for segmentation while preserving strong concept recall. Training is enabled by the HMPL-Instruct dataset, which covers hierarchical instruction semantics and diverse target granularities. Experiments show appealing performance on referring and reasoning-based segmentation tasks.

What carries the argument

The instruction-aware cascaded adaptation mechanism, which applies dedicated alignment losses to connect detailed instruction semantics to SAM3's vision-language features.

If this is right

Users can supply full sentences describing what to segment instead of relying on external systems to reduce instructions to noun phrases.
Segmentation becomes more specific to the details given in the instruction, such as particular attributes or relations between objects.
Performance on basic concept-driven prompts remains high because the adaptation is designed to avoid degrading original recall ability.
A single model now covers both simple concept grounding and more complex reasoning-based segmentation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This kind of internal alignment could reduce the number of separate modules needed in larger vision-language systems that perform segmentation.
The same progressive alignment idea might transfer to other prompt-based models that currently depend on external conversion for complex inputs.
Further tests on instructions involving actions or implicit states not heavily represented in the training data would show where the limits lie.

Load-bearing premise

The cascaded adaptation with alignment losses can match complex instructions to the model's existing representations without creating new errors on unfamiliar instructions or reducing accuracy on simple concept prompts.

What would settle it

A held-out test set of instructions that combine attributes, relations, and reasoning, where the quality and specificity of masks from SAM3-I are compared against masks from the original SAM3 plus an external agent that converts instructions to noun phrases.

Figures

Figures reproduced from arXiv: 2512.04585 by Huchuan Lu, Jincai Huang, Jingjing Li, Li Cheng, Miao Zhang, Qiang Chen, Qi Bi, Shihao Zou, Wei Ji, Xiaoqi Zhao, Yongri Piao, Yuchen Guo, Yue Feng.

**Figure 1.** Figure 1: Evolution of promptable segmentation within the SAM family. (a) Promptable Visual Segmentation (PVS) in SAM1/2 [9, 19] requires users’ interactive prompts, with each prompt segmenting only one object. (b) Promptable Concept Segmentation (PCS) in SAM3 [3] supports short noun-phrase prompts (e.g., “soccer player”) to segment all instances that match the given concept. (c) To process more complex natural-lang… view at source ↗

**Figure 2.** Figure 2: Comparison of SAM3-series pipelines. (a) SAM3 takes an image and a short NP concept as input and directly produces segmentation outputs. (b) SAM3 + Agent handles longer instructions by prompting an MLLM1 to rewrite them into NPs and using an MLLM2 to validate and refine SAM3’s predictions, often through multiple rounds of mask filtering. (MLLM1 and MLLM2 denote two separate calls to the same MLLM.) Althoug… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed SAM3-I framework. (Sec. 4) used exclusively for simple instructions, whereas complex instructions activate the full cascade so that C-Adapter can refine and enrich the representation produced by S-Adapter. This progressive cascaded learning mechanism mirrors the natural hierarchy of linguistic difficulty, stabilizes optimization, mitigates catastrophic interference, and yields mor… view at source ↗

**Figure 4.** Figure 4: Overview of the scalable instructional data construction pipeline. (Sec. 5) enable the model with functional, and reasoning-level grounding ability. • iii) Stage 3: Joint alignment refinement. All adapters are activated and jointly fine-tuned using the alignment objectives, harmonizing the two branches and ensuring consistent instruction-conditioned predictions. This progressive curriculum reflects the inh… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between SAM3-I and SAM3 with an MLLM agent. The left column shows examples with simple referring instructions, while the right column presents complex instructions that require contextual or functional reasoning. In all cases, we use Qwen3-VL-8B as the external agent for SAM3+Agent. Across both simple and complex settings, SAM3-I more reliably grounds the intended target without rely… view at source ↗

read the original abstract

Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3's vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAM3-I adds cascaded adaptation and alignment losses to SAM3 plus a new HMPL-Instruct dataset for direct instruction following, but the no-sacrifice claim on concept recall lacks direct before/after comparisons.

read the letter

Hi, the punchline on SAM3-I is that it adds cascaded adaptation and alignment losses to SAM3 for following complex natural language instructions, backed by the new HMPL-Instruct dataset, but the claim of preserving concept recall isn't supported by direct before-and-after comparisons on original tasks. What the paper does well is identifying the limitation of current SAM3 relying on external agents for rich prompts and proposing a unified framework that can interpret instructions with attributes, relations, and reasoning. The dataset provides systematic coverage of hierarchical semantics and different granularities, which is a solid addition for training such models. This could simplify things for real-world uses in robotics and content creation. The soft spots are in the experimental validation of the no-sacrifice part. The abstract states that SAM3-I achieves appealing performance without losing its original strengths, but there's no side-by-side data on standard concept segmentation benchmarks like COCO with noun-phrase prompts comparing the base model to the adapted one. The stress-test concern holds up based on what's described, as the alignment losses could potentially introduce trade-offs or OOD issues that aren't quantified. Without those details or ablations, it's hard to fully assess if the central argument holds. The work is straightforward empirical extension with no obvious circular reasoning. This is for people in the segmentation and vision-language community who want to make foundation models more instruction-friendly. A reader looking for practical extensions of SAM would get value from the method and resources. It deserves serious peer review because it introduces new data and a mechanism on top of a popular base model. I'd say send it for review with feedback on adding those preservation metrics. Cheers,

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SAM3-I, an extension of SAM3 for direct interpretation of complex natural-language instructions in segmentation. It proposes an instruction-aware cascaded adaptation mechanism with dedicated alignment losses to align expressive instruction semantics to SAM3's vision-language representations. A new HMPL-Instruct dataset is introduced to support instruction-centric training covering hierarchical semantics and target granularities. Experiments are reported to show appealing performance on referring and reasoning-based segmentation tasks while preserving SAM3's original concept-driven strengths.

Significance. If the no-sacrifice claim on concept recall is substantiated, the work would meaningfully advance open-vocabulary segmentation by unifying concept-level grounding and instruction-level reasoning in a single model, reducing reliance on external multi-modal agents. The public release of code and the HMPL-Instruct dataset at the GitHub link is a clear strength supporting reproducibility.

major comments (1)

[Experiments] The central claim that instruction following is achieved without sacrificing concept recall (abstract and §1) is load-bearing for the contribution yet lacks direct support. No before/after quantitative comparisons on standard concept-segmentation benchmarks (e.g., COCO or LVIS with pure noun-phrase prompts) are provided to demonstrate that noun-phrase concept recall remains intact after the cascaded adaptation and alignment losses.

minor comments (2)

[Abstract] The phrase 'appealing performance' in the abstract is vague; specific metrics, baselines, and error bars should be summarized early.
[Method] A diagram or pseudocode for the cascaded adaptation mechanism would clarify the progressive alignment process described in the method.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The point raised is important for strengthening the central claim, and we address it directly below.

read point-by-point responses

Referee: [Experiments] The central claim that instruction following is achieved without sacrificing concept recall (abstract and §1) is load-bearing for the contribution yet lacks direct support. No before/after quantitative comparisons on standard concept-segmentation benchmarks (e.g., COCO or LVIS with pure noun-phrase prompts) are provided to demonstrate that noun-phrase concept recall remains intact after the cascaded adaptation and alignment losses.

Authors: We agree that explicit before-and-after quantitative comparisons on standard concept-segmentation benchmarks would provide stronger, direct support for the no-sacrifice claim. The cascaded adaptation and alignment losses were specifically designed to progressively align instruction semantics while preserving SAM3's original vision-language representations, and we have observed that concept-level performance is retained in our instruction-following experiments. However, we acknowledge that the current manuscript does not include the requested direct comparisons using pure noun-phrase prompts on COCO or LVIS. In the revised manuscript, we will add these evaluations, reporting relevant metrics (e.g., mIoU or mask AP) to quantify that noun-phrase concept recall remains comparable to the original SAM3. This addition will be placed in the experiments section alongside the existing results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model extension on external benchmarks

full rationale

The paper describes an empirical extension of SAM3 via cascaded adaptation, alignment losses, and a newly introduced HMPL-Instruct dataset. Claims rest on experimental results across referring/reasoning segmentation tasks and external benchmarks rather than any mathematical derivation, fitted-parameter prediction, or self-citation chain. No equations appear in the provided text, and no load-bearing step reduces a result to its own inputs by construction. This is the standard case of a self-contained empirical CV paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions about representation alignment and the sufficiency of the new dataset for covering hierarchical instruction semantics; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Neural network representations can be progressively aligned across modalities using dedicated losses without catastrophic forgetting of prior capabilities.
Invoked when describing the instruction-aware cascaded adaptation mechanism that preserves concept recall.

pith-pipeline@v0.9.0 · 5582 in / 1194 out tokens · 28252 ms · 2026-05-17T02:02:45.670649+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
LumiVideo: An Intelligent Agentic System for Video Color Grading
cs.CV 2026-04 unverdicted novelty 6.0

LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Perception encoder: The best visual embeddings are not at the output of the net- work.Neural Information Processing Systems, 2025

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.Neural Information Processing Systems, 2025. 3

work page 2025
[3]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman R¨adle, Triantafyllos Afouras, Effrosyni Mavroudi, Kather- ine Xu, Tsung-Han Wu, Yu Zhou, Lil...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Adamv-moe: Adaptive multi-task vision mixture-of- experts

Tianlong Chen, Xuxi Chen, Xianzhi Du, Abdullah Rashwan, Fan Yang, Huizhong Chen, Zhangyang Wang, and Yeqing Li. Adamv-moe: Adaptive multi-task vision mixture-of- experts. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17346–17357, 2023. 8

work page 2023
[5]

Di Feng, Christian Haase-Sch ¨utz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaeser, Fabian Timm, Werner Wies- beck, and Klaus Dietmayer. Deep multi-modal object de- tection and semantic segmentation for autonomous driv- ing:datasets,methods,and challenges.IEEE Transactions on Intelligent Transportation Systems, 22:1341–1360, 2020. 2

work page 2020
[6]

Seg- mentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg- mentation from natural language expressions. InEuropean Conference on Computer Vision, pages 108–124, 2016. 6

work page 2016
[7]

Segment anything is not always perfect: An investi- gation of sam on different real-world applications.Machine Intelligence Research, 21(4):617–630, 2024

Wei Ji, Jingjing Li, Qi Bi, Tingwei Liu, Wenbo Li, and Li Cheng. Segment anything is not always perfect: An investi- gation of sam on different real-world applications.Machine Intelligence Research, 21(4):617–630, 2024. 1

work page 2024
[8]

Semanticrt: A large-scale dataset and method for robust semantic segmentation in multispectral images

Wei Ji, Jingjing Li, Cheng Bian, Zhicheng Zhang, and Li Cheng. Semanticrt: A large-scale dataset and method for robust semantic segmentation in multispectral images. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3307–3316, 2023. 1

work page 2023
[9]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1, 2, 8

work page 2023
[10]

Novel method of seman- tic segmentation applicable to augmented reality.Sensors, 20(6):1737, 2020

Tae-young Ko and Seung-ho Lee. Novel method of seman- tic segmentation applicable to augmented reality.Sensors, 20(6):1737, 2020. 2

work page 2020
[11]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 6, 8

work page 2024
[12]

Key technologies of machine vision for weeding robots: A review and benchmark.Computers and Electron- ics in Agriculture, 196:106880, 2022

Yong Li, Zhiqiang Guo, Feng Shuang, Man Zhang, and Xi- uhua Li. Key technologies of machine vision for weeding robots: A review and benchmark.Computers and Electron- ics in Agriculture, 196:106880, 2022. 2

work page 2022
[13]

Divergence measures based on the shannon en- tropy.IEEE Transactions on Information Theory, 37(1):145– 151, 1991

Jianhua Lin. Divergence measures based on the shannon en- tropy.IEEE Transactions on Information Theory, 37(1):145– 151, 1991. 4

work page 1991
[14]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 1

work page 2015
[15]

Mixture of ex- perts: a literature survey.Artificial Intelligence Review, 42(2):275–293, 2014

Saeed Masoudnia and Reza Ebrahimpour. Mixture of ex- perts: a literature survey.Artificial Intelligence Review, 42(2):275–293, 2014. 8

work page 2014
[16]

Sam-lad: Seg- ment anything model meets zero-shot logic anomaly detec- tion.Knowledge-Based Systems, 314:113176, 2025

Yun Peng, Xiao Lin, Nachuan Ma, Jiayuan Du, Chuang- wei Liu, Chengju Liu, and Qijun Chen. Sam-lad: Seg- ment anything model meets zero-shot logic anomaly detec- tion.Knowledge-Based Systems, 314:113176, 2025. 2

work page 2025
[17]

Verifiably following complex robot instructions with foundation models

Benedict Quartey, Eric Rosen, Stefanie Tellex, and George Konidaris. Verifiably following complex robot instructions with foundation models. InIEEE International Conference on Robotics and Automation, pages 1–8. IEEE, 2025. 2

work page 2025
[18]

Paco: Parts and attributes of common objects

Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Mar- quez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023. 6

work page 2023
[19]

Sam 2: Seg- ment anything in images and videos.International Confer- ence on Learning Representations, 2025

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos.International Confer- ence on Learning Representations, 2025. 1, 2, 8

work page 2025
[20]

Augmented reality model in supporting instruction process: a critical review

Azhar Wahid, Miftachul Huda, Moh Abdul Rohim, Ab- dul Halim Ali, Khairul Ghufran Kaspin, Maskanatul Fiqiyah, and Muhammad Talhah Ajmain Jima’ain. Augmented reality model in supporting instruction process: a critical review. In International Congress on Information and Communication Technology, pages 69–83. Springer, 2024. 2

work page 2024
[21]

Medical sam adapter: Adapting segment anything model for medical im- age segmentation.Medical Image Analysis, 102:103547,

Junde Wu, Ziyue Wang, Mingxuan Hong, Wei Ji, Huazhu Fu, Yanwu Xu, Min Xu, and Yueming Jin. Medical sam adapter: Adapting segment anything model for medical im- age segmentation.Medical Image Analysis, 102:103547,

work page

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Perception encoder: The best visual embeddings are not at the output of the net- work.Neural Information Processing Systems, 2025

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.Neural Information Processing Systems, 2025. 3

work page 2025

[3] [3]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman R¨adle, Triantafyllos Afouras, Effrosyni Mavroudi, Kather- ine Xu, Tsung-Han Wu, Yu Zhou, Lil...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Adamv-moe: Adaptive multi-task vision mixture-of- experts

Tianlong Chen, Xuxi Chen, Xianzhi Du, Abdullah Rashwan, Fan Yang, Huizhong Chen, Zhangyang Wang, and Yeqing Li. Adamv-moe: Adaptive multi-task vision mixture-of- experts. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17346–17357, 2023. 8

work page 2023

[5] [5]

Di Feng, Christian Haase-Sch ¨utz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaeser, Fabian Timm, Werner Wies- beck, and Klaus Dietmayer. Deep multi-modal object de- tection and semantic segmentation for autonomous driv- ing:datasets,methods,and challenges.IEEE Transactions on Intelligent Transportation Systems, 22:1341–1360, 2020. 2

work page 2020

[6] [6]

Seg- mentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg- mentation from natural language expressions. InEuropean Conference on Computer Vision, pages 108–124, 2016. 6

work page 2016

[7] [7]

Segment anything is not always perfect: An investi- gation of sam on different real-world applications.Machine Intelligence Research, 21(4):617–630, 2024

Wei Ji, Jingjing Li, Qi Bi, Tingwei Liu, Wenbo Li, and Li Cheng. Segment anything is not always perfect: An investi- gation of sam on different real-world applications.Machine Intelligence Research, 21(4):617–630, 2024. 1

work page 2024

[8] [8]

Semanticrt: A large-scale dataset and method for robust semantic segmentation in multispectral images

Wei Ji, Jingjing Li, Cheng Bian, Zhicheng Zhang, and Li Cheng. Semanticrt: A large-scale dataset and method for robust semantic segmentation in multispectral images. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3307–3316, 2023. 1

work page 2023

[9] [9]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1, 2, 8

work page 2023

[10] [10]

Novel method of seman- tic segmentation applicable to augmented reality.Sensors, 20(6):1737, 2020

Tae-young Ko and Seung-ho Lee. Novel method of seman- tic segmentation applicable to augmented reality.Sensors, 20(6):1737, 2020. 2

work page 2020

[11] [11]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 6, 8

work page 2024

[12] [12]

Key technologies of machine vision for weeding robots: A review and benchmark.Computers and Electron- ics in Agriculture, 196:106880, 2022

Yong Li, Zhiqiang Guo, Feng Shuang, Man Zhang, and Xi- uhua Li. Key technologies of machine vision for weeding robots: A review and benchmark.Computers and Electron- ics in Agriculture, 196:106880, 2022. 2

work page 2022

[13] [13]

Divergence measures based on the shannon en- tropy.IEEE Transactions on Information Theory, 37(1):145– 151, 1991

Jianhua Lin. Divergence measures based on the shannon en- tropy.IEEE Transactions on Information Theory, 37(1):145– 151, 1991. 4

work page 1991

[14] [14]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 1

work page 2015

[15] [15]

Mixture of ex- perts: a literature survey.Artificial Intelligence Review, 42(2):275–293, 2014

Saeed Masoudnia and Reza Ebrahimpour. Mixture of ex- perts: a literature survey.Artificial Intelligence Review, 42(2):275–293, 2014. 8

work page 2014

[16] [16]

Sam-lad: Seg- ment anything model meets zero-shot logic anomaly detec- tion.Knowledge-Based Systems, 314:113176, 2025

Yun Peng, Xiao Lin, Nachuan Ma, Jiayuan Du, Chuang- wei Liu, Chengju Liu, and Qijun Chen. Sam-lad: Seg- ment anything model meets zero-shot logic anomaly detec- tion.Knowledge-Based Systems, 314:113176, 2025. 2

work page 2025

[17] [17]

Verifiably following complex robot instructions with foundation models

Benedict Quartey, Eric Rosen, Stefanie Tellex, and George Konidaris. Verifiably following complex robot instructions with foundation models. InIEEE International Conference on Robotics and Automation, pages 1–8. IEEE, 2025. 2

work page 2025

[18] [18]

Paco: Parts and attributes of common objects

Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Mar- quez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023. 6

work page 2023

[19] [19]

Sam 2: Seg- ment anything in images and videos.International Confer- ence on Learning Representations, 2025

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos.International Confer- ence on Learning Representations, 2025. 1, 2, 8

work page 2025

[20] [20]

Augmented reality model in supporting instruction process: a critical review

Azhar Wahid, Miftachul Huda, Moh Abdul Rohim, Ab- dul Halim Ali, Khairul Ghufran Kaspin, Maskanatul Fiqiyah, and Muhammad Talhah Ajmain Jima’ain. Augmented reality model in supporting instruction process: a critical review. In International Congress on Information and Communication Technology, pages 69–83. Springer, 2024. 2

work page 2024

[21] [21]

Medical sam adapter: Adapting segment anything model for medical im- age segmentation.Medical Image Analysis, 102:103547,

Junde Wu, Ziyue Wang, Mingxuan Hong, Wei Ji, Huazhu Fu, Yanwu Xu, Min Xu, and Yueming Jin. Medical sam adapter: Adapting segment anything model for medical im- age segmentation.Medical Image Analysis, 102:103547,

work page