arxiv: 2604.18562 · v3 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

Chuanhang Deng, Dejing Dou, Jian Xiong, Jintao Chen, Mingxuan Li, Qiang Huang, Rui Qian, Wei Zhai, Yingbo Zhou

Pith reviewed 2026-05-10 05:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords reasoning segmentationquery bankslanguage groundingsemantic reasoningspatial localizationvision-language modelscycle consistencypixel masks

0 comments

The pith

AnchorSeg decouples semantic reasoning from spatial localization using structured language grounded query banks to improve reasoning segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a single segmentation token forces semantic reasoning and spatial localization into one compressed embedding, which limits accuracy when grounding complex implicit text queries to pixel masks. AnchorSeg instead builds an ordered sequence of latent reasoning tokens to track intermediate semantic states followed by a dedicated anchor token for explicit location signals, modeling the image tokens as a factorized distribution conditioned on these banks. A bidirectional Token-Mask Cycle Consistency loss then aligns the token-level outputs back to pixel supervision. A sympathetic reader would care because this separation directly targets why current vision-language models struggle with tasks that require both understanding an instruction and precisely placing the result in the image.

Core claim

Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. AnchorSeg reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture

What carries the argument

Structured language grounded query banks consisting of an ordered sequence of latent reasoning tokens for intermediate semantic states and a segmentation anchor token for explicit spatial grounding that factorizes conditioning over image tokens.

If this is right

The model reaches state-of-the-art performance of 67.7 percent gIoU and 68.1 percent cIoU on the ReasonSeg test set.
Semantic reasoning and spatial localization are handled by separate query components rather than compressed into one embedding.
The factorized distribution lets the anchor token supply localization while other tokens modulate semantics.
Bidirectional cycle consistency enforces alignment between token predictions and pixel-level masks across resolutions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Examining the states of the latent reasoning tokens could provide a window into the model's step-by-step interpretation of the query.
The query bank structure might transfer to other vision-language tasks such as visual grounding or referring expression segmentation that also need both reasoning and precise placement.
Varying the number of latent tokens according to query complexity could allow adaptive handling of instructions without changing the base architecture.
The results suggest that information compression in single embeddings is a bottleneck for fine-grained output tasks in current vision-language models.

Load-bearing premise

The ordered sequence of latent reasoning tokens and the segmentation anchor token can reliably disentangle semantic states from spatial localization in a way that a single token cannot, without the separation introducing new misalignment or failure modes.

What would settle it

If replacing the query banks with a single token or removing the cycle consistency loss causes performance on the ReasonSeg test set to fall to or below existing single-token baselines, the claimed benefit of the decoupling would be refuted.

Figures

Figures reproduced from arXiv: 2604.18562 by Chuanhang Deng, Dejing Dou, Jian Xiong, Jintao Chen, Mingxuan Li, Qiang Huang, Rui Qian, Wei Zhai, Yingbo Zhou.

**Figure 1.** Figure 1: Overview of AnchorSeg. Given an input image and textual query, the LMM GT autoregressively generates latent reasoning tokens and a segmentation anchor token <SEG>, forming a language grounded query bank Q = {q1 , . . . , qK, qanc}. The anchor query qanc computes similarity with image tokens to produce a spatial prior P, which is injected into visual features f. The query bank then conditions the SAM decode… view at source ↗

**Figure 2.** Figure 2: Visual comparison of AnchorSeg (Ours) with prior works on the ReasonSeg val set. indicating an optimal balance between semantic reasoning and spatial grounding. However, further expansion (N ≥ 16) leads to degradation, with cIoU dropping to 72.8% at N = 16, likely due to the increased optimization difficulty in the continuous embedding space [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative analysis of query banks. As the query tokens q1:7 approach the anchor token qanc, the activation gradually becomes less noisy and more spatially concentrated. The final anchor token produces a well-localized response, which serves as an effective spatial prior for guiding SAM segmentation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{<SEG>}$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized distribution over image tokens, where the anchor query determines localization signals while contextual queries provide semantic modulation. To bridge token-level predictions and pixel-level supervision, we propose Token--Mask Cycle Consistency (TMCC), a bidirectional training objective that enforces alignment across resolutions. By explicitly decoupling spatial grounding from semantic reasoning through structured language grounded query banks, AnchorSeg achieves state-of-the-art results on ReasonSeg test set (67.7\% gIoU and 68.1\% cIoU). All code and models are publicly available at https://github.com/rui-qian/AnchorSeg.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AnchorSeg uses ordered query banks plus a cycle-consistency loss to try separating semantic reasoning from spatial localization, but the abstract and stress-test note show no ablations confirming the separation actually happens.

read the letter

AnchorSeg reframes reasoning segmentation as conditional generation over a sequence of language-grounded query banks instead of a single token. Latent reasoning tokens handle intermediate semantic states while a dedicated anchor token supplies the spatial grounding, with factorized conditioning and a Token-Mask Cycle Consistency (TMCC) loss to align the two levels. That combination is the concrete new piece: the explicit ordering, the anchor, and the bidirectional consistency objective. The paper reports 67.7% gIoU and 68.1% cIoU on ReasonSeg, which beats prior numbers, and the code is released, so the claims can be checked directly. Those are the parts that stand out as useful contributions on their own terms. The soft spot is exactly the one the stress-test flags. The central claim is that the ordered banks reliably disentangle semantic state from spatial localization in a way a single token cannot, yet the provided description gives no ablation that holds capacity fixed and removes the anchor or the ordering. There is also no probe showing reasoning tokens stay stable under spatial perturbations while the anchor drives localization. Without those controls, the lift could come from extra parameters, the new loss alone, or better overall tuning rather than the intended factorization. The abstract does not supply error bars, statistical tests, or detailed baseline breakdowns either, so the empirical support remains thin. This work is aimed at people already working on vision-language segmentation and grounding, especially those who care about making the internal representations more interpretable. A reader who wants to try structured query banks on their own data would get value from the architecture description and the public implementation. I would send it to peer review. The idea is clear enough and the numbers are competitive enough that referees should see it, but they will almost certainly ask for the missing ablations and token-level analysis before accepting the disentanglement story.

Referee Report

3 major / 1 minor

Summary. The paper introduces AnchorSeg, which reformulates reasoning segmentation as structured conditional generation over image tokens using language-grounded query banks: an ordered sequence of latent reasoning tokens to capture intermediate semantic states and a segmentation anchor token for explicit spatial grounding. It models spatial conditioning as a factorized distribution and introduces Token-Mask Cycle Consistency (TMCC) to enforce bidirectional alignment between token-level predictions and pixel-level masks. The central claim is that this explicit decoupling of semantic reasoning from spatial localization yields state-of-the-art performance on the ReasonSeg test set (67.7% gIoU, 68.1% cIoU).

Significance. If the results hold under rigorous verification, the structured query-bank formulation offers a promising direction for improving interpretability and performance in reasoning segmentation by avoiding implicit compression into a single <SEG> token. The public release of code and models strengthens the contribution by enabling direct reproducibility and follow-up work.

major comments (3)

[Abstract / Experiments] Abstract and Experiments (likely §4): The SOTA claim (67.7% gIoU / 68.1% cIoU) is reported without any ablations that isolate the effect of the ordered query banks (latent reasoning tokens + anchor) versus a capacity-matched single-<SEG> baseline, versus TMCC alone, or versus the factorized distribution. This is load-bearing for the central claim of reliable semantic-spatial disentanglement.
[Method / Experiments] §3 (method) and §4: No probing or analysis is described that demonstrates the reasoning tokens remain insensitive to spatial perturbations while the anchor token drives localization, nor any visualization or metric showing that the sequence enforces the intended factorization beyond aggregate metrics. Without such evidence, gains could arise from extra parameters or TMCC rather than the claimed decoupling.
[Experiments] §4: The manuscript provides no details on baselines, data splits, error bars, statistical tests, or variance across runs, which is required to substantiate the reported improvements on ReasonSeg.

minor comments (1)

[Method] Notation for the factorized distribution over image tokens and the exact definition of TMCC could be clarified with an explicit equation or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional ablations, diagnostic analyses, and experimental details are needed to strengthen the claims regarding semantic-spatial disentanglement. We will revise the manuscript to incorporate these elements as outlined below.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments (likely §4): The SOTA claim (67.7% gIoU / 68.1% cIoU) is reported without any ablations that isolate the effect of the ordered query banks (latent reasoning tokens + anchor) versus a capacity-matched single-<SEG> baseline, versus TMCC alone, or versus the factorized distribution. This is load-bearing for the central claim of reliable semantic-spatial disentanglement.

Authors: We agree that isolating the contributions of the ordered query banks, anchor token, TMCC, and factorized distribution is necessary to support the central claim. In the revised manuscript, we will add a dedicated ablation study in §4 comparing the full model to a capacity-matched single-<SEG> baseline, variants without ordered reasoning tokens, ablations with/without TMCC, and variants of the factorized distribution. These results will quantify the incremental gains attributable to each component. revision: yes
Referee: [Method / Experiments] §3 (method) and §4: No probing or analysis is described that demonstrates the reasoning tokens remain insensitive to spatial perturbations while the anchor token drives localization, nor any visualization or metric showing that the sequence enforces the intended factorization beyond aggregate metrics. Without such evidence, gains could arise from extra parameters or TMCC rather than the claimed decoupling.

Authors: We acknowledge that the current manuscript lacks explicit probing or diagnostic analysis to verify the intended factorization. In the revision, we will add sensitivity experiments measuring the impact of spatial perturbations on reasoning tokens versus the anchor token, along with visualizations of token activations, attention maps, and quantitative metrics (such as alignment scores between token predictions and mask regions) to demonstrate that the sequence enforces the claimed decoupling beyond aggregate metrics. revision: yes
Referee: [Experiments] §4: The manuscript provides no details on baselines, data splits, error bars, statistical tests, or variance across runs, which is required to substantiate the reported improvements on ReasonSeg.

Authors: We will revise §4 to include full details on the baselines, the precise data splits for ReasonSeg, error bars from multiple runs with different random seeds, and any statistical tests used. Variance across runs will be reported to allow rigorous assessment of the improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural proposal and empirical results are independent of inputs

full rationale

The paper presents AnchorSeg as a new architecture that explicitly constructs ordered query banks (latent reasoning tokens plus segmentation anchor token) and introduces the TMCC bidirectional loss to enforce cross-resolution alignment. These are described as design choices that reformulate the task as factorized conditional generation, with SOTA metrics (67.7% gIoU, 68.1% cIoU) reported as outcomes. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the claimed disentanglement or performance gains to fitted parameters, prior self-referential results, or definitional equivalences. The central claims rest on the proposed components and external benchmark evaluation rather than any load-bearing reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on standard deep learning assumptions plus two new invented components whose effectiveness is asserted but not independently evidenced in the abstract.

axioms (1)

domain assumption Transformer-based vision-language models can be trained end-to-end with pixel-level supervision
Implicit background assumption for any such segmentation model.

invented entities (2)

Language grounded query banks no independent evidence
purpose: Ordered sequence of latent reasoning tokens and segmentation anchor token to disentangle semantics from localization
Core new structure introduced to replace single <SEG> token.
Token-Mask Cycle Consistency (TMCC) no independent evidence
purpose: Bidirectional objective aligning token predictions with pixel masks across resolutions
New training loss proposed to bridge token and mask levels.

pith-pipeline@v0.9.0 · 5551 in / 1329 out tokens · 33706 ms · 2026-05-10T05:07:04.601992+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

252 extracted references · 42 canonical work pages · 12 internal anchors

[1]

Lisa: Reasoning segmentation via large language model , author=
[2]

2024 , urldate =

Wu, Yixuan and Wang, Yizhou and Tang, Shixiang and Wu, Wenhao and He, Tong , title =. 2024 , urldate =. arXiv , langid =:2403.12488 , primaryclass =

work page arXiv 2024
[3]

ArXiv , year=

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want , author=. ArXiv , year=
[4]

Shtedritski, Aleksandar and Rupprecht, Christian and Vedaldi, Andrea , title =. 2023. 2023 , urldate =. doi:10.1109/ICCV51070.2023.01101 , isbn =

work page doi:10.1109/iccv51070.2023.01101 2023
[5]

ArXiv , year=

Fine-Grained Visual Prompting , author=. ArXiv , year=
[6]

GPT-4 Technical Report

OpenAI and Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama , title =. 2024 , urldate =. doi:10.48550/arXiv.2303.08774 , archiveprefix =. 2303.08774 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
[7]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team and Anil, Rohan and Borgeaud, Sebastian and Alayrac, Jean-Baptiste and Yu, Jiahui , title =. 2024 , urldate =. doi:10.48550/arXiv.2312.11805 , archiveprefix =. 2312.11805 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2024
[8]

The dawn of lmms: Preliminary explorations with gpt-4v (ision)

Yang, Zhengyuan and Li, Linjie and Lin, Kevin and Wang, Jianfeng and Lin, Chung-Ching and Liu, Zicheng and Wang, Lijuan , title =. 2023 , urldate =. doi:10.48550/arXiv.2309.17421 , archiveprefix =. 2309.17421 , primaryclass =

work page doi:10.48550/arxiv.2309.17421 2023
[9]

2022 , booktitle =

Flamingo: a Visual Language Model for Few-Shot Learning , author =. 2022 , booktitle =

2022
[10]

Visual Instruction Tuning

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. 2023 , urldate =. doi:10.48550/arXiv.2304.08485 , archiveprefix =. 2304.08485 , primaryclass =

work page internal anchor Pith review doi:10.48550/arxiv.2304.08485 2023
[11]

Proceedings of the 38th

Jia, Chao and Yang, Yinfei and Xia, Ye and Chen, Yi-Ting and Parekh, Zarana , title =. Proceedings of the 38th. 2021 , issn =

2021
[12]

2023 , issn =

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , title =. 2023 , issn =

2023
[13]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , title =. 2023 , urldate =. doi:10.48550/arXiv.2304.10592 , archiveprefix =. 2304.10592 , primaryclass =

work page internal anchor Pith review doi:10.48550/arxiv.2304.10592 2023
[14]

Image-of-thought prompt- ing for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024

Zhou, Qiji and Zhou, Ruochen and Hu, Zike and Lu, Panzhong and Gao, Siyang and Zhang, Yue , title =. 2024 , urldate =. doi:10.48550/arXiv.2405.13872 , archiveprefix =. 2405.13872 , primaryclass =

work page doi:10.48550/arxiv.2405.13872 2024
[15]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Compositional Chain-of-Thought Prompting for Large Multimodal Models , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2024
[16]

ArXiv , year=

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V , author=. ArXiv , year=
[17]

ArXiv , year=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. ArXiv , year=
[18]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Segment Anything , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

2023
[19]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Masked Autoencoders Are Scalable Vision Learners , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2022
[20]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[21]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

2024
[22]

ArXiv , year=

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks , author=. ArXiv , year=
[23]

ArXiv , year=

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection , author=. ArXiv , year=
[24]

ArXiv , year=

Hallucination of Multimodal Large Language Models: A Survey , author=. ArXiv , year=
[25]

ArXiv , year=

Unified Text-to-Image Generation and Retrieval , author=. ArXiv , year=
[26]

ICML , year=

Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R. ICML , year=
[27]

arXiv preprint arXiv:2406.20076 (2024)

Evf-sam: Early vision-language fusion for text-prompted segment anything model , author=. arXiv preprint arXiv:2406.20076 , year=

work page arXiv
[28]

arXiv preprint arXiv:2208.10442 , year=

Image as a foreign language: Beit pretraining for all vision and vision-language tasks , author=. arXiv preprint arXiv:2208.10442 , year=

work page arXiv
[29]

2023 10th International Conference on ICT for Smart Society (ICISS) , pages=

XLM-RoBERTa Model for Key Information Extraction on Military Document , author=. 2023 10th International Conference on ICT for Smart Society (ICISS) , pages=. 2023 , organization=

2023
[30]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[31]

LoRA: Low-Rank Adaptation of Large Language Models

Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

CVPR , year=

Generalized decoding for pixel, image, and language , author=. CVPR , year=
[33]

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation , author=. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2020
[34]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Vision-Language Transformer and Query Generation for Referring Segmentation , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

2021
[35]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

CRIS: CLIP-Driven Referring Image Segmentation , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2022
[36]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2022
[37]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

GRES: Generalized Referring Expression Segmentation , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2023
[38]

ArXiv , year=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. ArXiv , year=
[39]

Li, Bo and Zhang, Hao and Zhang, Kaichen and Guo, Dong and Zhang, Yuanhan and Zhang, Renrui and Li, Feng and Liu, Ziwei and Li, Chunyuan , month=
[40]

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year=

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year=
[41]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[42]

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Generation and Comprehension of Unambiguous Object Descriptions , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2016
[43]

The Twelfth International Conference on Learning Representations , year=

Grounding multimodal large language models to the world , author=. The Twelfth International Conference on Learning Representations , year=
[44]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Groundhog: Grounding large language models to holistic segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[45]

Locate then segment: A strong pipeline for referring image segmentation , author=
[46]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Liu, Fang and Liu, Yuhao and Kong, Yuqiu and Xu, Ke and Zhang, Lihe and Yin, Baocai and Hancke, Gerhard and Lau, Rynson , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

2023
[47]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Generalized Decoding for Pixel, Image, and Language , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2023
[48]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

A Simple Framework for Open-Vocabulary Segmentation and Detection , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

2023
[49]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2023
[50]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Side Adapter Network for Open-Vocabulary Semantic Segmentation , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2023
[51]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Universal instance perception as object discovery and retrieval , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[52]

ArXiv , year=

Segment Anything in High Quality , author=. ArXiv , year=
[53]

arXiv e-prints , pages=

An improved baseline for reasoning segmentation with large language model , author=. arXiv e-prints , pages=
[54]

Open-vocabulary segmentation with unpaired mask-text supervision.arXiv preprint arXiv:2402.08960, 2024

Open-vocabulary segmentation with unpaired mask-text supervision , author=. arXiv preprint arXiv:2402.08960 , year=

work page arXiv
[55]

Advances in Neural Information Processing Systems , volume=

Segment everything everywhere all at once , author=. Advances in Neural Information Processing Systems , volume=
[56]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[57]

Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl , author=. arXiv preprint arXiv:2503.07536 , year=

work page arXiv
[58]

Fast large language model collaborative decoding via speculation,

Speculative Ensemble: Fast Large Language Model Ensemble via Speculation , author=. arXiv preprint arXiv:2502.01662 , year=

work page arXiv
[59]

Advances in Neural Information Processing Systems , volume=

Exploring diverse in-context configurations for image captioning , author=. Advances in Neural Information Processing Systems , volume=
[60]

arXiv preprint arXiv:2411.10332 , year=

Number it: Temporal Grounding Videos like Flipping Manga , author=. arXiv preprint arXiv:2411.10332 , year=

work page arXiv
[61]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Causal prompting: Debiasing large language model prompting based on front-door adjustment , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[62]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Causal walk: Debiasing multi-hop fact verification with front-door adjustment , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[63]

ArXiv , year=

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task , author=. ArXiv , year=
[64]

Multimodal Chain-of-Thought Reasoning in Language Models , author=. Trans. Mach. Learn. Res. , year=
[65]

Annual Meeting of the Association for Computational Linguistics , year=

M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought , author=. Annual Meeting of the Association for Computational Linguistics , year=
[66]

Neural Information Processing Systems , year=

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning , author=. Neural Information Processing Systems , year=
[67]

ArXiv , year=

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models , author=. ArXiv , year=
[68]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[69]

European Conference on Computer Vision , pages=

MTA-CLIP: Language-guided semantic segmentation with mask-text alignment , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[70]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Training Vision Transformers for Semi-Supervised Semantic Segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[71]

ArXiv , year=

VEU-Bench: Towards Comprehensive Understanding of Video Editing , author=. ArXiv , year=
[72]

ArXiv , year=

Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark , author=. ArXiv , year=
[73]

ArXiv , year=

Reframe Anything: LLM Agent for Open World Video Reframing , author=. ArXiv , year=
[74]

TIP , year=

Towards robust referring image segmentation , author=. TIP , year=
[75]

Yang, Senqiao and Qu, Tianyuan and Lai, Xin and Tian, Zhuotao and Peng, Bohao and Liu, Shu and Jia, Jiaya , booktitle=
[76]

ECCV , year=

Segmentation from natural language expressions , author=. ECCV , year=
[77]

Rasheed, Hanoona and Maaz, Muhammad and Shaji, Sahal and Shaker, Abdelrahman and Khan, Salman and Cholakkal, Hisham and Anwer, Rao M and Xing, Eric and Yang, Ming-Hsuan and Khan, Fahad S , booktitle=
[78]

Ren, Zhongwei and Huang, Zhicheng and Wei, Yunchao and Zhao, Yao and Fu, Dongmei and Feng, Jiashi and Jin, Xiaojie , booktitle=
[79]

Xia, Zhuofan and Han, Dongchen and Han, Yizeng and Pan, Xuran and Song, Shiji and Huang, Gao , booktitle=
[80]

CVPR , year=

Fully convolutional networks for semantic segmentation , author=. CVPR , year=

Showing first 80 references.