Recognition: unknown
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
Pith reviewed 2026-05-10 05:07 UTC · model grok-4.3
The pith
AnchorSeg decouples semantic reasoning from spatial localization using structured language grounded query banks to improve reasoning segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. AnchorSeg reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture
What carries the argument
Structured language grounded query banks consisting of an ordered sequence of latent reasoning tokens for intermediate semantic states and a segmentation anchor token for explicit spatial grounding that factorizes conditioning over image tokens.
If this is right
- The model reaches state-of-the-art performance of 67.7 percent gIoU and 68.1 percent cIoU on the ReasonSeg test set.
- Semantic reasoning and spatial localization are handled by separate query components rather than compressed into one embedding.
- The factorized distribution lets the anchor token supply localization while other tokens modulate semantics.
- Bidirectional cycle consistency enforces alignment between token predictions and pixel-level masks across resolutions.
Where Pith is reading between the lines
- Examining the states of the latent reasoning tokens could provide a window into the model's step-by-step interpretation of the query.
- The query bank structure might transfer to other vision-language tasks such as visual grounding or referring expression segmentation that also need both reasoning and precise placement.
- Varying the number of latent tokens according to query complexity could allow adaptive handling of instructions without changing the base architecture.
- The results suggest that information compression in single embeddings is a bottleneck for fine-grained output tasks in current vision-language models.
Load-bearing premise
The ordered sequence of latent reasoning tokens and the segmentation anchor token can reliably disentangle semantic states from spatial localization in a way that a single token cannot, without the separation introducing new misalignment or failure modes.
What would settle it
If replacing the query banks with a single token or removing the cycle consistency loss causes performance on the ReasonSeg test set to fall to or below existing single-token baselines, the claimed benefit of the decoupling would be refuted.
Figures
read the original abstract
Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{<SEG>}$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized distribution over image tokens, where the anchor query determines localization signals while contextual queries provide semantic modulation. To bridge token-level predictions and pixel-level supervision, we propose Token--Mask Cycle Consistency (TMCC), a bidirectional training objective that enforces alignment across resolutions. By explicitly decoupling spatial grounding from semantic reasoning through structured language grounded query banks, AnchorSeg achieves state-of-the-art results on ReasonSeg test set (67.7\% gIoU and 68.1\% cIoU). All code and models are publicly available at https://github.com/rui-qian/AnchorSeg.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AnchorSeg, which reformulates reasoning segmentation as structured conditional generation over image tokens using language-grounded query banks: an ordered sequence of latent reasoning tokens to capture intermediate semantic states and a segmentation anchor token for explicit spatial grounding. It models spatial conditioning as a factorized distribution and introduces Token-Mask Cycle Consistency (TMCC) to enforce bidirectional alignment between token-level predictions and pixel-level masks. The central claim is that this explicit decoupling of semantic reasoning from spatial localization yields state-of-the-art performance on the ReasonSeg test set (67.7% gIoU, 68.1% cIoU).
Significance. If the results hold under rigorous verification, the structured query-bank formulation offers a promising direction for improving interpretability and performance in reasoning segmentation by avoiding implicit compression into a single <SEG> token. The public release of code and models strengthens the contribution by enabling direct reproducibility and follow-up work.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments (likely §4): The SOTA claim (67.7% gIoU / 68.1% cIoU) is reported without any ablations that isolate the effect of the ordered query banks (latent reasoning tokens + anchor) versus a capacity-matched single-<SEG> baseline, versus TMCC alone, or versus the factorized distribution. This is load-bearing for the central claim of reliable semantic-spatial disentanglement.
- [Method / Experiments] §3 (method) and §4: No probing or analysis is described that demonstrates the reasoning tokens remain insensitive to spatial perturbations while the anchor token drives localization, nor any visualization or metric showing that the sequence enforces the intended factorization beyond aggregate metrics. Without such evidence, gains could arise from extra parameters or TMCC rather than the claimed decoupling.
- [Experiments] §4: The manuscript provides no details on baselines, data splits, error bars, statistical tests, or variance across runs, which is required to substantiate the reported improvements on ReasonSeg.
minor comments (1)
- [Method] Notation for the factorized distribution over image tokens and the exact definition of TMCC could be clarified with an explicit equation or pseudocode to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that additional ablations, diagnostic analyses, and experimental details are needed to strengthen the claims regarding semantic-spatial disentanglement. We will revise the manuscript to incorporate these elements as outlined below.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments (likely §4): The SOTA claim (67.7% gIoU / 68.1% cIoU) is reported without any ablations that isolate the effect of the ordered query banks (latent reasoning tokens + anchor) versus a capacity-matched single-<SEG> baseline, versus TMCC alone, or versus the factorized distribution. This is load-bearing for the central claim of reliable semantic-spatial disentanglement.
Authors: We agree that isolating the contributions of the ordered query banks, anchor token, TMCC, and factorized distribution is necessary to support the central claim. In the revised manuscript, we will add a dedicated ablation study in §4 comparing the full model to a capacity-matched single-<SEG> baseline, variants without ordered reasoning tokens, ablations with/without TMCC, and variants of the factorized distribution. These results will quantify the incremental gains attributable to each component. revision: yes
-
Referee: [Method / Experiments] §3 (method) and §4: No probing or analysis is described that demonstrates the reasoning tokens remain insensitive to spatial perturbations while the anchor token drives localization, nor any visualization or metric showing that the sequence enforces the intended factorization beyond aggregate metrics. Without such evidence, gains could arise from extra parameters or TMCC rather than the claimed decoupling.
Authors: We acknowledge that the current manuscript lacks explicit probing or diagnostic analysis to verify the intended factorization. In the revision, we will add sensitivity experiments measuring the impact of spatial perturbations on reasoning tokens versus the anchor token, along with visualizations of token activations, attention maps, and quantitative metrics (such as alignment scores between token predictions and mask regions) to demonstrate that the sequence enforces the claimed decoupling beyond aggregate metrics. revision: yes
-
Referee: [Experiments] §4: The manuscript provides no details on baselines, data splits, error bars, statistical tests, or variance across runs, which is required to substantiate the reported improvements on ReasonSeg.
Authors: We will revise §4 to include full details on the baselines, the precise data splits for ReasonSeg, error bars from multiple runs with different random seeds, and any statistical tests used. Variance across runs will be reported to allow rigorous assessment of the improvements. revision: yes
Circularity Check
No significant circularity; architectural proposal and empirical results are independent of inputs
full rationale
The paper presents AnchorSeg as a new architecture that explicitly constructs ordered query banks (latent reasoning tokens plus segmentation anchor token) and introduces the TMCC bidirectional loss to enforce cross-resolution alignment. These are described as design choices that reformulate the task as factorized conditional generation, with SOTA metrics (67.7% gIoU, 68.1% cIoU) reported as outcomes. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the claimed disentanglement or performance gains to fitted parameters, prior self-referential results, or definitional equivalences. The central claims rest on the proposed components and external benchmark evaluation rather than any load-bearing reduction to the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer-based vision-language models can be trained end-to-end with pixel-level supervision
invented entities (2)
-
Language grounded query banks
no independent evidence
-
Token-Mask Cycle Consistency (TMCC)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Lisa: Reasoning segmentation via large language model , author=
-
[2]
Wu, Yixuan and Wang, Yizhou and Tang, Shixiang and Wu, Wenhao and He, Tong , title =. 2024 , urldate =. arXiv , langid =:2403.12488 , primaryclass =
-
[3]
ArXiv , year=
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want , author=. ArXiv , year=
-
[4]
Shtedritski, Aleksandar and Rupprecht, Christian and Vedaldi, Andrea , title =. 2023. 2023 , urldate =. doi:10.1109/ICCV51070.2023.01101 , isbn =
-
[5]
ArXiv , year=
Fine-Grained Visual Prompting , author=. ArXiv , year=
-
[6]
OpenAI and Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama , title =. 2024 , urldate =. doi:10.48550/arXiv.2303.08774 , archiveprefix =. 2303.08774 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
-
[7]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team and Anil, Rohan and Borgeaud, Sebastian and Alayrac, Jean-Baptiste and Yu, Jiahui , title =. 2024 , urldate =. doi:10.48550/arXiv.2312.11805 , archiveprefix =. 2312.11805 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2024
-
[8]
The dawn of lmms: Preliminary explorations with gpt-4v (ision)
Yang, Zhengyuan and Li, Linjie and Lin, Kevin and Wang, Jianfeng and Lin, Chung-Ching and Liu, Zicheng and Wang, Lijuan , title =. 2023 , urldate =. doi:10.48550/arXiv.2309.17421 , archiveprefix =. 2309.17421 , primaryclass =
-
[9]
2022 , booktitle =
Flamingo: a Visual Language Model for Few-Shot Learning , author =. 2022 , booktitle =
2022
-
[10]
Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. 2023 , urldate =. doi:10.48550/arXiv.2304.08485 , archiveprefix =. 2304.08485 , primaryclass =
work page internal anchor Pith review doi:10.48550/arxiv.2304.08485 2023
-
[11]
Proceedings of the 38th
Jia, Chao and Yang, Yinfei and Xia, Ye and Chen, Yi-Ting and Parekh, Zarana , title =. Proceedings of the 38th. 2021 , issn =
2021
-
[12]
2023 , issn =
Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , title =. 2023 , issn =
2023
-
[13]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , title =. 2023 , urldate =. doi:10.48550/arXiv.2304.10592 , archiveprefix =. 2304.10592 , primaryclass =
work page internal anchor Pith review doi:10.48550/arxiv.2304.10592 2023
-
[14]
Zhou, Qiji and Zhou, Ruochen and Hu, Zike and Lu, Panzhong and Gao, Siyang and Zhang, Yue , title =. 2024 , urldate =. doi:10.48550/arXiv.2405.13872 , archiveprefix =. 2405.13872 , primaryclass =
-
[15]
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Compositional Chain-of-Thought Prompting for Large Multimodal Models , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2024
-
[16]
ArXiv , year=
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V , author=. ArXiv , year=
-
[17]
ArXiv , year=
Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. ArXiv , year=
-
[18]
2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
Segment Anything , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
2023
-
[19]
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Masked Autoencoders Are Scalable Vision Learners , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2022
-
[20]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[21]
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=
2024
-
[22]
ArXiv , year=
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks , author=. ArXiv , year=
-
[23]
ArXiv , year=
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection , author=. ArXiv , year=
-
[24]
ArXiv , year=
Hallucination of Multimodal Large Language Models: A Survey , author=. ArXiv , year=
-
[25]
ArXiv , year=
Unified Text-to-Image Generation and Retrieval , author=. ArXiv , year=
-
[26]
ICML , year=
Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R. ICML , year=
-
[27]
arXiv preprint arXiv:2406.20076 (2024)
Evf-sam: Early vision-language fusion for text-prompted segment anything model , author=. arXiv preprint arXiv:2406.20076 , year=
-
[28]
arXiv preprint arXiv:2208.10442 , year=
Image as a foreign language: Beit pretraining for all vision and vision-language tasks , author=. arXiv preprint arXiv:2208.10442 , year=
-
[29]
2023 10th International Conference on ICT for Smart Society (ICISS) , pages=
XLM-RoBERTa Model for Key Information Extraction on Military Document , author=. 2023 10th International Conference on ICT for Smart Society (ICISS) , pages=. 2023 , organization=
2023
-
[30]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[31]
LoRA: Low-Rank Adaptation of Large Language Models
Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
CVPR , year=
Generalized decoding for pixel, image, and language , author=. CVPR , year=
-
[33]
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation , author=. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2020
-
[34]
2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
Vision-Language Transformer and Query Generation for Referring Segmentation , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
2021
-
[35]
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
CRIS: CLIP-Driven Referring Image Segmentation , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2022
-
[36]
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2022
-
[37]
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
GRES: Generalized Referring Expression Segmentation , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2023
-
[38]
ArXiv , year=
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. ArXiv , year=
-
[39]
Li, Bo and Zhang, Hao and Zhang, Kaichen and Guo, Dong and Zhang, Yuanhan and Zhang, Renrui and Li, Feng and Liu, Ziwei and Li, Chunyuan , month=
-
[40]
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year=
DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year=
-
[41]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[42]
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Generation and Comprehension of Unambiguous Object Descriptions , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2016
-
[43]
The Twelfth International Conference on Learning Representations , year=
Grounding multimodal large language models to the world , author=. The Twelfth International Conference on Learning Representations , year=
-
[44]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Groundhog: Grounding large language models to holistic segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[45]
Locate then segment: A strong pipeline for referring image segmentation , author=
-
[46]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =
Liu, Fang and Liu, Yuhao and Kong, Yuqiu and Xu, Ke and Zhang, Lihe and Yin, Baocai and Hancke, Gerhard and Lau, Rynson , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =
2023
-
[47]
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Generalized Decoding for Pixel, Image, and Language , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2023
-
[48]
2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
A Simple Framework for Open-Vocabulary Segmentation and Detection , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
2023
-
[49]
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2023
-
[50]
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Side Adapter Network for Open-Vocabulary Semantic Segmentation , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
2023
-
[51]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Universal instance perception as object discovery and retrieval , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[52]
ArXiv , year=
Segment Anything in High Quality , author=. ArXiv , year=
-
[53]
arXiv e-prints , pages=
An improved baseline for reasoning segmentation with large language model , author=. arXiv e-prints , pages=
-
[54]
Open-vocabulary segmentation with unpaired mask-text supervision , author=. arXiv preprint arXiv:2402.08960 , year=
-
[55]
Advances in Neural Information Processing Systems , volume=
Segment everything everywhere all at once , author=. Advances in Neural Information Processing Systems , volume=
-
[56]
Hashimoto , title =
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
2023
-
[57]
Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl , author=. arXiv preprint arXiv:2503.07536 , year=
-
[58]
Fast large language model collaborative decoding via speculation,
Speculative Ensemble: Fast Large Language Model Ensemble via Speculation , author=. arXiv preprint arXiv:2502.01662 , year=
-
[59]
Advances in Neural Information Processing Systems , volume=
Exploring diverse in-context configurations for image captioning , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
arXiv preprint arXiv:2411.10332 , year=
Number it: Temporal Grounding Videos like Flipping Manga , author=. arXiv preprint arXiv:2411.10332 , year=
-
[61]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Causal prompting: Debiasing large language model prompting based on front-door adjustment , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[62]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Causal walk: Debiasing multi-hop fact verification with front-door adjustment , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[63]
ArXiv , year=
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task , author=. ArXiv , year=
-
[64]
Multimodal Chain-of-Thought Reasoning in Language Models , author=. Trans. Mach. Learn. Res. , year=
-
[65]
Annual Meeting of the Association for Computational Linguistics , year=
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought , author=. Annual Meeting of the Association for Computational Linguistics , year=
-
[66]
Neural Information Processing Systems , year=
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning , author=. Neural Information Processing Systems , year=
-
[67]
ArXiv , year=
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models , author=. ArXiv , year=
-
[68]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[69]
European Conference on Computer Vision , pages=
MTA-CLIP: Language-guided semantic segmentation with mask-text alignment , author=. European Conference on Computer Vision , pages=. 2024 , organization=
2024
-
[70]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Training Vision Transformers for Semi-Supervised Semantic Segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[71]
ArXiv , year=
VEU-Bench: Towards Comprehensive Understanding of Video Editing , author=. ArXiv , year=
-
[72]
ArXiv , year=
Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark , author=. ArXiv , year=
-
[73]
ArXiv , year=
Reframe Anything: LLM Agent for Open World Video Reframing , author=. ArXiv , year=
-
[74]
TIP , year=
Towards robust referring image segmentation , author=. TIP , year=
-
[75]
Yang, Senqiao and Qu, Tianyuan and Lai, Xin and Tian, Zhuotao and Peng, Bohao and Liu, Shu and Jia, Jiaya , booktitle=
-
[76]
ECCV , year=
Segmentation from natural language expressions , author=. ECCV , year=
-
[77]
Rasheed, Hanoona and Maaz, Muhammad and Shaji, Sahal and Shaker, Abdelrahman and Khan, Salman and Cholakkal, Hisham and Anwer, Rao M and Xing, Eric and Yang, Ming-Hsuan and Khan, Fahad S , booktitle=
-
[78]
Ren, Zhongwei and Huang, Zhicheng and Wei, Yunchao and Zhao, Yao and Fu, Dongmei and Feng, Jiashi and Jin, Xiaojie , booktitle=
-
[79]
Xia, Zhuofan and Han, Dongchen and Han, Yizeng and Pan, Xuran and Song, Shiji and Huang, Gao , booktitle=
-
[80]
CVPR , year=
Fully convolutional networks for semantic segmentation , author=. CVPR , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.