UGround: Towards Unified Visual Grounding with Unrolled Transformers

Chuanhang Deng; Dejing Dou; Jian Xiong; Rui Qian; Wei Zhai; Xin Yin; Zhiyuan Peng

arxiv: 2510.03853 · v4 · submitted 2025-10-04 · 💻 cs.CV

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian , Xin Yin , Chuanhang Deng , Zhiyuan Peng , Jian Xiong , Wei Zhai , Dejing Dou This is my paper

Pith reviewed 2026-05-18 10:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual groundingreferring expression segmentationreasoning segmentationunrolled transformersmask as promptstochastic skip connectionreinforcement learning policySAM prompting

0 comments

The pith

UGround dynamically selects intermediate transformer layers to turn token similarities into explicit spatial masks for prompting SAM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new paradigm for visual grounding that unrolls the transformer and lets a reinforcement learning policy pick which intermediate layer to use for each text token. Instead of always taking the final hidden state and feeding a special token as prompt, the method derives a similarity map between the chosen layer's text and image tokens and uses that map as a soft mask to guide the segmentation model. This supplies direct spatial activation regions and allows error correction before errors accumulate through every layer. If the approach holds, a single framework can handle referring expressions, reasoning-based descriptions, single or multiple targets, and queries that have no matching object at all.

Core claim

UGround is a unified visual grounding paradigm that dynamically selects intermediate layers across unrolled transformers as mask as prompt. Central to it is Policy-Prompted Masking, which comprises Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that via stochastic sampling lets each <SEG> token slide across layers and connect to the vision model in a skip-connection fashion. Given the selected layer, MasP computes the similarity map between the <SEG> token and image tokens and supplies it as a soft logit mask to SAM, offering explicit spatial cues through its activation regions.

What carries the argument

Policy-Prompted Masking, which uses a reinforcement learning policy for stochastic layer selection and converts the resulting token similarity map into a soft mask prompt for the segmentation model.

If this is right

Cumulative layer-by-layer errors can be corrected at chosen intermediate points rather than propagating unchecked to the final layer.
Text embeddings receive explicit spatial information through the activation regions of the similarity map instead of relying on implicit projection.
A single training framework now covers refer expression segmentation, reasoning segmentation, single-target and multi-target cases, and positive versus false-premise queries.
The <SEG> token can connect to any selected unrolled layer in a skip-connection manner, allowing flexible integration with segmentation models such as SAM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same policy-driven layer selection could be tested on other vision-language backbones to check whether the benefit is specific to the unrolled transformer setup used here.
If the similarity map already encodes useful location cues, downstream models might be simplified by removing separate coordinate-encoding modules.
Extending the unification to additional query attributes, such as temporal or 3D references, would test whether the dynamic masking principle generalizes beyond the four attribute axes demonstrated.

Load-bearing premise

A reinforcement learning policy can reliably choose intermediate layers so that the similarity map gives useful spatial cues without creating training instabilities or new error modes.

What would settle it

A controlled comparison in which fixed last-layer prompting with the standard <SEG> token achieves higher mask accuracy than the dynamic selection policy on a benchmark containing both positive and false-premise queries.

read the original abstract

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt,'' diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt.'' UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (e.g., coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All code and models are publicly available at https://github.com/rui-qian/UGround.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UGround unifies several visual grounding variants with unrolled transformers and an RL policy for dynamic layer selection via stochastic skips, but the gains over fixed last-layer baselines rest on unproven stability of that policy.

read the letter

The core idea is to stop always grabbing the final transformer layer for the token and instead let an RL policy stochastically pick intermediate layers for skip connections. Those selected activations then become similarity maps that serve as soft spatial prompts to SAM. The authors also fold traditional referring segmentation, reasoning segmentation, single-target, multi-target, and false-premise cases into one training setup, which they present as a first from an attribute perspective.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes UGround, a unified visual grounding framework that replaces the standard fixed-last-layer <SEG> prompt with a dynamic mechanism: a reinforcement learning policy (Stochastic Skip Connection, SSC) stochastically samples intermediate layers across unrolled transformers, then feeds the resulting <SEG>-image similarity map as a soft logit mask (Mask as Prompt, MasP) to SAM. This is claimed to correct cumulative layer-wise errors and supply explicit spatial cues. The work unifies grounding tasks spanning referring expression segmentation, reasoning segmentation, single- and multi-target, and positive- to false-premise queries, with public code and models released.

Significance. If the empirical claims hold, the approach could meaningfully improve robustness in language-to-vision grounding pipelines by enabling intermediate correction rather than relying on the final hidden state. The attribute-based unification across query types and the public release of code/models are concrete strengths that aid reproducibility and further research.

major comments (2)

[§3.2] §3.2 (Stochastic Skip Connection): The central claim that SSC corrects cumulative layer-wise errors via dynamic selection rests on the RL policy converging to useful layers. However, the manuscript supplies neither the explicit reward formulation, nor baseline comparisons (e.g., fixed-layer, random, or heuristic selection), nor statistics on layer-selection distribution and variance during training. Without these, it is impossible to verify that the stochastic sampling produces reliable similarity maps for MasP rather than introducing additional instability.
[§4] §4 (Experiments): Reported gains on the unified task suite are presented, yet no ablation isolates the contribution of the RL-driven SSC component (e.g., SSC vs. fixed last layer with otherwise identical MasP). This omission makes it difficult to attribute performance differences specifically to the dynamic layer selection that underpins the error-correction argument.

minor comments (2)

[Figure 2] Figure 2: The unrolled-transformer diagram would benefit from explicit arrows or labels indicating where the sampled skip connection attaches to SAM and how the similarity map is computed.
[§5.1] §5.1: Notation for the similarity map (e.g., S vs. M) is used inconsistently between the method description and the loss equations; a single consistent symbol would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that will strengthen the presentation of the Stochastic Skip Connection mechanism and its empirical validation.

read point-by-point responses

Referee: [§3.2] §3.2 (Stochastic Skip Connection): The central claim that SSC corrects cumulative layer-wise errors via dynamic selection rests on the RL policy converging to useful layers. However, the manuscript supplies neither the explicit reward formulation, nor baseline comparisons (e.g., fixed-layer, random, or heuristic selection), nor statistics on layer-selection distribution and variance during training. Without these, it is impossible to verify that the stochastic sampling produces reliable similarity maps for MasP rather than introducing additional instability.

Authors: We agree that greater transparency on the RL policy is warranted. The reward for the SSC policy is the negative of the final segmentation loss (Dice loss plus cross-entropy) computed on the mask produced by MasP, directly incentivizing layer selections that improve grounding quality. We will add this explicit formulation to §3.2. In addition, we have run the requested baselines (fixed last layer, uniform random selection, and depth-heuristic selection) and will report them together with layer-selection histograms and variance statistics across training runs. These results confirm that SSC converges to a stable, non-uniform distribution over intermediate layers and yields higher-quality similarity maps than the alternatives, thereby supporting rather than undermining the error-correction argument. revision: yes
Referee: [§4] §4 (Experiments): Reported gains on the unified task suite are presented, yet no ablation isolates the contribution of the RL-driven SSC component (e.g., SSC vs. fixed last layer with otherwise identical MasP). This omission makes it difficult to attribute performance differences specifically to the dynamic layer selection that underpins the error-correction argument.

Authors: We acknowledge the value of isolating SSC’s contribution. The revised manuscript will include a dedicated ablation in §4 that replaces SSC with a fixed last-layer connection while keeping MasP and all other components identical. Internal results already show a consistent drop in mIoU and precision across the unified task suite when dynamic selection is removed, directly attributing gains to the ability to correct cumulative errors at intermediate layers. These numbers and the corresponding analysis will be added to the main paper or supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method proposal is self-contained

full rationale

The paper proposes UGround as a novel architecture with Stochastic Skip Connection (RL policy for dynamic layer selection) and Mask as Prompt (similarity map to SAM). It directly addresses two stated limitations of prior pipelines (fixed last hidden layer and <SEG> prompt) without any derivation that reduces to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. Unification across grounding tasks is presented as empirical validation, not a mathematical result forced by the inputs. No equations or claims exhibit the enumerated circular patterns; the central mechanism is an independent design choice evaluated externally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that an RL-trained stochastic policy can select useful intermediate layers and that the resulting similarity map supplies explicit spatial information; both are introduced without independent verification in the abstract.

free parameters (1)

Reinforcement learning policy parameters
The Stochastic Skip Connection component is described as an RL policy that samples layer choices; such policies contain parameters fitted during training.

axioms (1)

domain assumption Intermediate hidden states from unrolled transformer layers can be meaningfully used as prompts for a separate vision model such as SAM.
This assumption underpins the skip-connection design of SSC and the mask generation in MasP.

pith-pipeline@v0.9.0 · 5859 in / 1348 out tokens · 40403 ms · 2026-05-18T10:08:18.287459+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence
cs.CV 2026-05 unverdicted novelty 7.0

VISTAQA is a new benchmark for joint visual question answering correctness and pixel-level grounding, evaluated with the GROVE metric that uses per-sample geometric mean to require both dimensions to succeed.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
SE-GA: Memory-Augmented Self-Evolution for GUI Agents
cs.LG 2026-05 unverdicted novelty 5.0

SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.