arxiv: 2605.07154 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition

Jing Zhang, Yuchen He

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords referring audio-visual segmentationmodality suppressionbiased competitionmultimodal fusionvideo object segmentationadaptive attentioncross-modal reasoningreferring expression

0 comments

The pith

By estimating modality priorities from referring expressions, PRIMED adaptively suppresses misleading audio or visual cues for more accurate video object segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve Referring Audio-Visual Segmentation by addressing how relevance of audio and visual modalities varies with different referring expressions and scenes. Existing approaches treat all multimodal cues equally during fusion, which makes them susceptible to noise or irrelevant information from one modality. PRIMED introduces a Modality Prior Decoder that estimates the primary modality reliance and uses it to guide adaptive suppression, drawing from biased competition theory. This is supported by additional components for global context sharing and a contrastive loss for better discrimination, resulting in state-of-the-art performance on the benchmark.

Core claim

PRIMED explicitly models visual perception and language-driven prior modulation for adaptive modality suppression in Ref-AVS. The Modality Prior Decoder estimates whether the referring expression relies primarily on audio, vision, or joint interaction to generate a prior that guides high-level attention. Token Distiller extracts compact global visual tokens shared across Competition-aware Cross-modal Fusion modules for hierarchical context, while Spatial-Aware Semantic Alignment loss enhances foreground-background discrimination via contrastive learning.

What carries the argument

The Modality Prior Decoder that estimates reliance on audio, vision or joint from the referring expression to generate a guiding prior for adaptive suppression, based on biased competition theory.

Load-bearing premise

The Modality Prior Decoder accurately estimates the dominant modality or interaction type from the referring expression to guide suppression effectively.

What would settle it

Observing cases where the estimated modality prior is incorrect, leading to suppressed useful information and segmentation performance worse than standard fusion methods on the same benchmark.

Figures

Figures reproduced from arXiv: 2605.07154 by Jing Zhang, Yuchen He.

**Figure 2.** Figure 2: Overview of PRIMED. Multimodal features extracted from encoders, together with distilled visual tokens, are fused through stage-wise CBCF modules [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Prompts used for the label annotation audit. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the CBCF module at stage 1 and 2. The structure of [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Attention visualization for the biased competition ablation. Without [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization results of referred objects in the Ref-AVS benchmark. Please zoom in to see more details. Speaker icons indicate the sounding objects. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Failure cases of our proposed PRIMED. Please zoom in to see more details. Speaker icons indicate the sounding objects. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PRIMED for Referring Audio-Visual Segmentation (Ref-AVS). It models adaptive modality suppression via biased competition theory from neuroscience: a Modality Prior Decoder estimates whether a referring expression relies primarily on audio, vision, or joint interaction to guide high-level attention; a Token Distiller extracts compact global visual tokens shared across Competition-aware Cross-modal Fusion modules; and a Spatial-Aware Semantic Alignment loss enhances foreground-background discrimination via contrastive learning. The central claim is that this yields state-of-the-art overall performance on the Ref-AVS benchmark by avoiding homogeneous treatment of multimodal cues.

Significance. If the results hold with proper validation, the work offers a principled, cognitively-inspired mechanism for dynamic modality weighting in multimodal video segmentation, which could improve robustness when cues are noisy or conflicting. The explicit separation of language-driven prior modulation from perception is a conceptual strength, and the contrastive alignment loss addresses a practical discrimination issue. Reproducible code or parameter-free derivations are not mentioned, so significance rests on the empirical demonstration.

major comments (2)

[§3.2] §3.2 (Modality Prior Decoder): The decoder's ability to accurately infer audio/vision/joint reliance from the referring expression is assumed without direct validation such as per-class accuracy, a confusion matrix, or an ablation that isolates decoder misclassification errors. This is load-bearing for the central claim, because incorrect priors would trigger harmful suppression in the Competition-aware Cross-modal Fusion, negating any benefit from biased-competition modeling. A quantitative evaluation of the prior (or an oracle-prior upper bound) is required.
[Experiments] Experiments section: The abstract asserts SOTA overall performance, yet no quantitative numbers, error bars, per-modality breakdowns, or statistical significance tests appear in the provided text. Without these, it is impossible to assess whether gains are consistent across referring-expression types or merely average-case improvements.

minor comments (2)

[Abstract] The abstract would benefit from one or two key quantitative results (e.g., mIoU delta over the prior SOTA) to substantiate the SOTA claim for readers who do not immediately access the full experiments.
[§3.1–3.2] Notation for the modality prior (e.g., how the three-way classification is encoded and passed to the fusion modules) should be introduced with an equation or diagram reference in §3.1–3.2 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (Modality Prior Decoder): The decoder's ability to accurately infer audio/vision/joint reliance from the referring expression is assumed without direct validation such as per-class accuracy, a confusion matrix, or an ablation that isolates decoder misclassification errors. This is load-bearing for the central claim, because incorrect priors would trigger harmful suppression in the Competition-aware Cross-modal Fusion, negating any benefit from biased-competition modeling. A quantitative evaluation of the prior (or an oracle-prior upper bound) is required.

Authors: We agree that validating the Modality Prior Decoder is essential, as its accuracy directly impacts the effectiveness of adaptive suppression in the Competition-aware Cross-modal Fusion. The current manuscript describes the decoder architecture and its role but lacks a dedicated quantitative assessment of its inference accuracy (e.g., per-class accuracy or confusion matrix) or an ablation isolating misclassification effects. We will add this evaluation in the revised version, including per-class accuracy metrics, a confusion matrix on the Ref-AVS benchmark, and an oracle-prior upper-bound experiment to measure the gap between predicted and ideal priors. These results will be incorporated into §3.2 and the experiments section. revision: yes
Referee: [Experiments] Experiments section: The abstract asserts SOTA overall performance, yet no quantitative numbers, error bars, per-modality breakdowns, or statistical significance tests appear in the provided text. Without these, it is impossible to assess whether gains are consistent across referring-expression types or merely average-case improvements.

Authors: We acknowledge that the text excerpt reviewed may not have displayed the full experimental details. The complete manuscript includes Section 4 with tables reporting quantitative SOTA results on the Ref-AVS benchmark, including overall metrics and some breakdowns by referring expression categories. However, to strengthen the evidence, we will revise the experiments section to explicitly include error bars (standard deviations across runs), detailed per-modality and per-expression-type breakdowns, and statistical significance tests (e.g., paired t-tests against baselines). This will allow clearer assessment of consistency across conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper proposes an engineering architecture (Modality Prior Decoder, Token Distiller, Competition-aware Cross-modal Fusion, and Spatial-Aware Semantic Alignment loss) trained on the Ref-AVS benchmark and evaluated for SOTA performance. No mathematical derivations, uniqueness theorems, or parameter-fitting steps are described that reduce a claimed prediction back to its own inputs by construction. The central performance claim rests on external benchmark results rather than self-referential definitions or self-citation chains. The Modality Prior Decoder is a learned module whose accuracy is an empirical question, not a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested applicability of biased competition theory to multimodal fusion and on the effectiveness of newly introduced model components whose behavior is not independently verified in the abstract.

axioms (1)

domain assumption Biased competition theory from cognitive neuroscience can be directly translated into an adaptive suppression mechanism for audio-visual-text fusion.
The paper states it is inspired by the theory but offers no derivation or empirical justification for the mapping.

invented entities (2)

Modality Prior Decoder no independent evidence
purpose: Estimates primary modality reliance to guide suppression
New component introduced to produce the modality prior.
Token Distiller no independent evidence
purpose: Extracts and shares compact global visual tokens
New component for providing hierarchical context.

pith-pipeline@v0.9.0 · 5500 in / 1417 out tokens · 65304 ms · 2026-05-11T02:27:04.015356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

[1]

Actor and action video segmentation from a sentence,

K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. Snoek, “Actor and action video segmentation from a sentence,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5958–5966

work page 2018
[2]

Audio-visual event local- ization in unconstrained videos,

Y . Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event local- ization in unconstrained videos,” inEuropean Conference on Computer Vision. Springer, 2018, pp. 247–263

work page 2018
[3]

Video object segmentation with language referring expressions,

A. Khoreva, A. Rohrbach, and B. Schiele, “Video object segmentation with language referring expressions,” inAsian Conference on Computer Vision. Springer, 2018, pp. 123–141

work page 2018
[4]

Urvos: Unified referring video object segmentation network with a large-scale benchmark,

S. Seo, J.-Y . Lee, and B. Han, “Urvos: Unified referring video object segmentation network with a large-scale benchmark,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 208–223

work page 2020
[5]

Robust referring video object segmentation with cyclic structural consensus,

X. Li, J. Wang, X. Xu, X. Li, B. Raj, and Y . Lu, “Robust referring video object segmentation with cyclic structural consensus,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 236–22 245

work page 2023
[6]

Effived: Efficient video editing via text-instruction diffusion models,

Z. Zhang, Z. Dai, L. Qin, and W. Wang, “Effived: Efficient video editing via text-instruction diffusion models,”arXiv preprint arXiv:2403.11568, 2024

work page arXiv 2024
[7]

Raccoon: Versatile instructional video editing with auto-generated narratives,

J. Yoon, S. Yu, and M. Bansal, “Raccoon: Versatile instructional video editing with auto-generated narratives,” inConference on Empirical Methods in Natural Language Processing, 2025, pp. 27 960–27 996

work page 2025
[8]

Embodied uncertainty- aware object segmentation,

X. Fang, L. P. Kaelbling, and T. Lozano-P ´erez, “Embodied uncertainty- aware object segmentation,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 2639–2646

work page 2024
[9]

Roboground: Robotic manipulation with grounded vision- language priors,

H. Huang, X. Chen, Y . Chen, H. Li, X. Han, Z. Wang, T. Wang, J. Pang, and Z. Zhao, “Roboground: Robotic manipulation with grounded vision- language priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 22 540–22 550

work page 2025
[10]

Ref-avs: Refer and segment objects in audio-visual scenes,

Y . Wang, P. Sun, D. Zhou, G. Li, H. Zhang, and D. Hu, “Ref-avs: Refer and segment objects in audio-visual scenes,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 196–213

work page 2024
[11]

Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism,

C. Guo, H. Huang, Y .-H. Zhou, C. Yuan, and D. Han, “Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism,”Expert Systems with Applications, p. 129371, 2025

work page 2025
[12]

Sam2-love: Segment any- thing model 2 in language-aided audio-visual scenes,

Y . Wang, H. Xu, Y . Liu, J. Li, and Y . Tang, “Sam2-love: Segment any- thing model 2 in language-aided audio-visual scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 28 932–28 941

work page 2025
[13]

Tsam: Temporal sam augmented with multimodal prompts for referring audio-visual segmentation,

A. Radman and J. Laaksonen, “Tsam: Temporal sam augmented with multimodal prompts for referring audio-visual segmentation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 23 947–23 956

work page 2025
[14]

Crab: A unified audio-visual scene understanding model with explicit cooperation,

H. Du, G. Li, C. Zhou, C. Zhang, A. Zhao, and D. Hu, “Crab: A unified audio-visual scene understanding model with explicit cooperation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 18 804–18 814

work page 2025
[15]

Aurora: Augmented understand- ing via structured reasoning and reinforcement learning for reference audio-visual segmentation,

Z. Luo, N. Liu, F. S. Khan, and J. Han, “Aurora: Augmented understand- ing via structured reasoning and reinforcement learning for reference audio-visual segmentation,”arXiv preprint arXiv:2508.02149, 2025. 11

work page arXiv 2025
[16]

Think before you segment: An object-aware reasoning agent for referring audio-visual segmentation,

J. Zhou, Y . Zhou, M. Han, T. Wang, X. Chang, H. Cholakkal, and R. M. Anwer, “Think before you segment: An object-aware reasoning agent for referring audio-visual segmentation,”arXiv preprint arXiv:2508.04418, 2025

work page arXiv 2025
[17]

Neural mechanisms of selective visual attention,

R. Desimone, J. Duncanet al., “Neural mechanisms of selective visual attention,”Annual review of neuroscience, vol. 18, no. 1, pp. 193–222, 1995

work page 1995
[18]

Top-down and bottom-up mechanisms in biasing competition in the human brain,

D. M. Beck and S. Kastner, “Top-down and bottom-up mechanisms in biasing competition in the human brain,”Vision research, vol. 49, no. 10, pp. 1154–1165, 2009

work page 2009
[19]

Audio–visual segmentation,

J. Zhou, J. Wang, J. Zhang, W. Sun, J. Zhang, S. Birchfield, D. Guo, L. Kong, M. Wang, and Y . Zhong, “Audio–visual segmentation,” in European Conference on Computer Vision. Springer, 2022, pp. 386– 403

work page 2022
[20]

Audio-visual segmentation with semantics,

J. Zhou, X. Shen, J. Wang, J. Zhang, W. Sun, J. Zhang, S. Birchfield, D. Guo, L. Kong, M. Wanget al., “Audio-visual segmentation with semantics,”International Journal of Computer Vision, vol. 133, no. 4, pp. 1644–1664, 2025

work page 2025
[21]

Catr: Combinatorial- dependence audio-queried transformer for audio-visual video segmenta- tion,

K. Li, Z. Yang, L. Chen, Y . Yang, and J. Xiao, “Catr: Combinatorial- dependence audio-queried transformer for audio-visual video segmenta- tion,” inProceedings of the ACM International Conference on Multime- dia, 2023, pp. 1485–1494

work page 2023
[22]

Avsegformer: Audio- visual segmentation with transformer,

S. Gao, Z. Chen, G. Chen, W. Wang, and T. Lu, “Avsegformer: Audio- visual segmentation with transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 155– 12 163

work page 2024
[23]

Transavs: End- to-end audio-visual segmentation with transformer,

Y . Ling, Y . Li, Z. Gan, J. Zhang, M. Chi, and Y . Wang, “Transavs: End- to-end audio-visual segmentation with transformer,” inICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 7845–7849

work page 2024
[24]

Discovering sounding objects by audio queries for audio visual segmentation,

S. Huang, H. Li, Y . Wang, H. Zhu, J. Dai, J. Han, W. Rong, and S. Liu, “Discovering sounding objects by audio queries for audio visual segmentation,” inProceedings of the International Joint Conference on Artificial Intelligence, 2023, pp. 876–884

work page 2023
[25]

Consistency-queried transformer for audio-visual segmentation,

Y . Lv, Z. Liu, and X. Chang, “Consistency-queried transformer for audio-visual segmentation,”IEEE Transactions on Image Processing, 2025

work page 2025
[26]

Qdformer: Towards robust audiovisual segmentation in complex environments with quantization-based semantic decomposition,

X. Li, J. Wang, X. Xu, X. Peng, R. Singh, Y . Lu, and B. Raj, “Qdformer: Towards robust audiovisual segmentation in complex environments with quantization-based semantic decomposition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3402–3413

work page 2024
[27]

Av-sam: Segment any- thing model meets audio-visual localization and segmentation

S. Mo and Y . Tian, “Av-sam: Segment anything model meets audio- visual localization and segmentation,”arXiv preprint arXiv:2305.01836, 2023

work page arXiv 2023
[28]

Annotation- free audio-visual segmentation,

J. Liu, Y . Wang, C. Ju, C. Ma, Y . Zhang, and W. Xie, “Annotation- free audio-visual segmentation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5604–5614

work page 2024
[29]

Prompting segmentation with sound is generalizable audio-visual source localizer,

Y . Wang, W. Liu, G. Li, J. Ding, D. Hu, and X. Li, “Prompting segmentation with sound is generalizable audio-visual source localizer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5669–5677

work page 2024
[30]

Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256,

H. Zhong, M. Zhu, Z. Du, Z. Huang, C. Zhao, M. Liu, W. Wang, H. Chen, and C. Shen, “Omni-r1: Reinforcement learning for om- nimodal reasoning via two-system collaboration,”arXiv preprint arXiv:2505.20256, 2025

work page arXiv 2025
[31]

Towards omnimodal expressions and reasoning in referring audio-visual segmentation,

K. Ying, H. Ding, G. Jie, and Y .-G. Jiang, “Towards omnimodal expressions and reasoning in referring audio-visual segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 575–22 585

work page 2025
[32]

Hiera: A hierarchical vision transformer without the bells- and-whistles,

C. Ryali, Y .-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y . Huang, V . Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y . Li, and C. Fe- ichtenhofer, “Hiera: A hierarchical vision transformer without the bells- and-whistles,”International Conference on Machine Learning, 2023

work page 2023
[33]

Clap learn- ing audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learn- ing audio concepts from natural language supervision,” inIEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[34]

Hts-at: A hierarchical token-semantic audio transformer for sound clas- sification and detection,

K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound clas- sification and detection,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650

work page 2022
[35]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[36]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,”arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review arXiv 1910
[37]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Language as queries for referring video object segmentation,

J. Wu, Y . Jiang, P. Sun, Z. Yuan, and P. Luo, “Language as queries for referring video object segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4974–4984

work page 2022