pith. machine review for the scientific record. sign in

arxiv: 2605.07154 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition

Jing Zhang, Yuchen He

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords referring audio-visual segmentationmodality suppressionbiased competitionmultimodal fusionvideo object segmentationadaptive attentioncross-modal reasoningreferring expression
0
0 comments X

The pith

By estimating modality priorities from referring expressions, PRIMED adaptively suppresses misleading audio or visual cues for more accurate video object segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve Referring Audio-Visual Segmentation by addressing how relevance of audio and visual modalities varies with different referring expressions and scenes. Existing approaches treat all multimodal cues equally during fusion, which makes them susceptible to noise or irrelevant information from one modality. PRIMED introduces a Modality Prior Decoder that estimates the primary modality reliance and uses it to guide adaptive suppression, drawing from biased competition theory. This is supported by additional components for global context sharing and a contrastive loss for better discrimination, resulting in state-of-the-art performance on the benchmark.

Core claim

PRIMED explicitly models visual perception and language-driven prior modulation for adaptive modality suppression in Ref-AVS. The Modality Prior Decoder estimates whether the referring expression relies primarily on audio, vision, or joint interaction to generate a prior that guides high-level attention. Token Distiller extracts compact global visual tokens shared across Competition-aware Cross-modal Fusion modules for hierarchical context, while Spatial-Aware Semantic Alignment loss enhances foreground-background discrimination via contrastive learning.

What carries the argument

The Modality Prior Decoder that estimates reliance on audio, vision or joint from the referring expression to generate a guiding prior for adaptive suppression, based on biased competition theory.

Load-bearing premise

The Modality Prior Decoder accurately estimates the dominant modality or interaction type from the referring expression to guide suppression effectively.

What would settle it

Observing cases where the estimated modality prior is incorrect, leading to suppressed useful information and segmentation performance worse than standard fusion methods on the same benchmark.

Figures

Figures reproduced from arXiv: 2605.07154 by Jing Zhang, Yuchen He.

Figure 1
Figure 1. Figure 1: In cognitive neuroscience, multiple candidate regions compete for [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PRIMED. Multimodal features extracted from encoders, together with distilled visual tokens, are fused through stage-wise CBCF modules [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompts used for the label annotation audit. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the CBCF module at stage 1 and 2. The structure of [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention visualization for the biased competition ablation. Without [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization results of referred objects in the Ref-AVS benchmark. Please zoom in to see more details. Speaker icons indicate the sounding objects. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Failure cases of our proposed PRIMED. Please zoom in to see more details. Speaker icons indicate the sounding objects. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PRIMED for Referring Audio-Visual Segmentation (Ref-AVS). It models adaptive modality suppression via biased competition theory from neuroscience: a Modality Prior Decoder estimates whether a referring expression relies primarily on audio, vision, or joint interaction to guide high-level attention; a Token Distiller extracts compact global visual tokens shared across Competition-aware Cross-modal Fusion modules; and a Spatial-Aware Semantic Alignment loss enhances foreground-background discrimination via contrastive learning. The central claim is that this yields state-of-the-art overall performance on the Ref-AVS benchmark by avoiding homogeneous treatment of multimodal cues.

Significance. If the results hold with proper validation, the work offers a principled, cognitively-inspired mechanism for dynamic modality weighting in multimodal video segmentation, which could improve robustness when cues are noisy or conflicting. The explicit separation of language-driven prior modulation from perception is a conceptual strength, and the contrastive alignment loss addresses a practical discrimination issue. Reproducible code or parameter-free derivations are not mentioned, so significance rests on the empirical demonstration.

major comments (2)
  1. [§3.2] §3.2 (Modality Prior Decoder): The decoder's ability to accurately infer audio/vision/joint reliance from the referring expression is assumed without direct validation such as per-class accuracy, a confusion matrix, or an ablation that isolates decoder misclassification errors. This is load-bearing for the central claim, because incorrect priors would trigger harmful suppression in the Competition-aware Cross-modal Fusion, negating any benefit from biased-competition modeling. A quantitative evaluation of the prior (or an oracle-prior upper bound) is required.
  2. [Experiments] Experiments section: The abstract asserts SOTA overall performance, yet no quantitative numbers, error bars, per-modality breakdowns, or statistical significance tests appear in the provided text. Without these, it is impossible to assess whether gains are consistent across referring-expression types or merely average-case improvements.
minor comments (2)
  1. [Abstract] The abstract would benefit from one or two key quantitative results (e.g., mIoU delta over the prior SOTA) to substantiate the SOTA claim for readers who do not immediately access the full experiments.
  2. [§3.1–3.2] Notation for the modality prior (e.g., how the three-way classification is encoded and passed to the fusion modules) should be introduced with an equation or diagram reference in §3.1–3.2 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Modality Prior Decoder): The decoder's ability to accurately infer audio/vision/joint reliance from the referring expression is assumed without direct validation such as per-class accuracy, a confusion matrix, or an ablation that isolates decoder misclassification errors. This is load-bearing for the central claim, because incorrect priors would trigger harmful suppression in the Competition-aware Cross-modal Fusion, negating any benefit from biased-competition modeling. A quantitative evaluation of the prior (or an oracle-prior upper bound) is required.

    Authors: We agree that validating the Modality Prior Decoder is essential, as its accuracy directly impacts the effectiveness of adaptive suppression in the Competition-aware Cross-modal Fusion. The current manuscript describes the decoder architecture and its role but lacks a dedicated quantitative assessment of its inference accuracy (e.g., per-class accuracy or confusion matrix) or an ablation isolating misclassification effects. We will add this evaluation in the revised version, including per-class accuracy metrics, a confusion matrix on the Ref-AVS benchmark, and an oracle-prior upper-bound experiment to measure the gap between predicted and ideal priors. These results will be incorporated into §3.2 and the experiments section. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts SOTA overall performance, yet no quantitative numbers, error bars, per-modality breakdowns, or statistical significance tests appear in the provided text. Without these, it is impossible to assess whether gains are consistent across referring-expression types or merely average-case improvements.

    Authors: We acknowledge that the text excerpt reviewed may not have displayed the full experimental details. The complete manuscript includes Section 4 with tables reporting quantitative SOTA results on the Ref-AVS benchmark, including overall metrics and some breakdowns by referring expression categories. However, to strengthen the evidence, we will revise the experiments section to explicitly include error bars (standard deviations across runs), detailed per-modality and per-expression-type breakdowns, and statistical significance tests (e.g., paired t-tests against baselines). This will allow clearer assessment of consistency across conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper proposes an engineering architecture (Modality Prior Decoder, Token Distiller, Competition-aware Cross-modal Fusion, and Spatial-Aware Semantic Alignment loss) trained on the Ref-AVS benchmark and evaluated for SOTA performance. No mathematical derivations, uniqueness theorems, or parameter-fitting steps are described that reduce a claimed prediction back to its own inputs by construction. The central performance claim rests on external benchmark results rather than self-referential definitions or self-citation chains. The Modality Prior Decoder is a learned module whose accuracy is an empirical question, not a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested applicability of biased competition theory to multimodal fusion and on the effectiveness of newly introduced model components whose behavior is not independently verified in the abstract.

axioms (1)
  • domain assumption Biased competition theory from cognitive neuroscience can be directly translated into an adaptive suppression mechanism for audio-visual-text fusion.
    The paper states it is inspired by the theory but offers no derivation or empirical justification for the mapping.
invented entities (2)
  • Modality Prior Decoder no independent evidence
    purpose: Estimates primary modality reliance to guide suppression
    New component introduced to produce the modality prior.
  • Token Distiller no independent evidence
    purpose: Extracts and shares compact global visual tokens
    New component for providing hierarchical context.

pith-pipeline@v0.9.0 · 5500 in / 1417 out tokens · 65304 ms · 2026-05-11T02:27:04.015356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

  1. [1]

    Actor and action video segmentation from a sentence,

    K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. Snoek, “Actor and action video segmentation from a sentence,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5958–5966

  2. [2]

    Audio-visual event local- ization in unconstrained videos,

    Y . Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event local- ization in unconstrained videos,” inEuropean Conference on Computer Vision. Springer, 2018, pp. 247–263

  3. [3]

    Video object segmentation with language referring expressions,

    A. Khoreva, A. Rohrbach, and B. Schiele, “Video object segmentation with language referring expressions,” inAsian Conference on Computer Vision. Springer, 2018, pp. 123–141

  4. [4]

    Urvos: Unified referring video object segmentation network with a large-scale benchmark,

    S. Seo, J.-Y . Lee, and B. Han, “Urvos: Unified referring video object segmentation network with a large-scale benchmark,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 208–223

  5. [5]

    Robust referring video object segmentation with cyclic structural consensus,

    X. Li, J. Wang, X. Xu, X. Li, B. Raj, and Y . Lu, “Robust referring video object segmentation with cyclic structural consensus,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 236–22 245

  6. [6]

    Effived: Efficient video editing via text-instruction diffusion models,

    Z. Zhang, Z. Dai, L. Qin, and W. Wang, “Effived: Efficient video editing via text-instruction diffusion models,”arXiv preprint arXiv:2403.11568, 2024

  7. [7]

    Raccoon: Versatile instructional video editing with auto-generated narratives,

    J. Yoon, S. Yu, and M. Bansal, “Raccoon: Versatile instructional video editing with auto-generated narratives,” inConference on Empirical Methods in Natural Language Processing, 2025, pp. 27 960–27 996

  8. [8]

    Embodied uncertainty- aware object segmentation,

    X. Fang, L. P. Kaelbling, and T. Lozano-P ´erez, “Embodied uncertainty- aware object segmentation,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 2639–2646

  9. [9]

    Roboground: Robotic manipulation with grounded vision- language priors,

    H. Huang, X. Chen, Y . Chen, H. Li, X. Han, Z. Wang, T. Wang, J. Pang, and Z. Zhao, “Roboground: Robotic manipulation with grounded vision- language priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 22 540–22 550

  10. [10]

    Ref-avs: Refer and segment objects in audio-visual scenes,

    Y . Wang, P. Sun, D. Zhou, G. Li, H. Zhang, and D. Hu, “Ref-avs: Refer and segment objects in audio-visual scenes,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 196–213

  11. [11]

    Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism,

    C. Guo, H. Huang, Y .-H. Zhou, C. Yuan, and D. Han, “Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism,”Expert Systems with Applications, p. 129371, 2025

  12. [12]

    Sam2-love: Segment any- thing model 2 in language-aided audio-visual scenes,

    Y . Wang, H. Xu, Y . Liu, J. Li, and Y . Tang, “Sam2-love: Segment any- thing model 2 in language-aided audio-visual scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 28 932–28 941

  13. [13]

    Tsam: Temporal sam augmented with multimodal prompts for referring audio-visual segmentation,

    A. Radman and J. Laaksonen, “Tsam: Temporal sam augmented with multimodal prompts for referring audio-visual segmentation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 23 947–23 956

  14. [14]

    Crab: A unified audio-visual scene understanding model with explicit cooperation,

    H. Du, G. Li, C. Zhou, C. Zhang, A. Zhao, and D. Hu, “Crab: A unified audio-visual scene understanding model with explicit cooperation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 18 804–18 814

  15. [15]

    Aurora: Augmented understand- ing via structured reasoning and reinforcement learning for reference audio-visual segmentation,

    Z. Luo, N. Liu, F. S. Khan, and J. Han, “Aurora: Augmented understand- ing via structured reasoning and reinforcement learning for reference audio-visual segmentation,”arXiv preprint arXiv:2508.02149, 2025. 11

  16. [16]

    Think before you segment: An object-aware reasoning agent for referring audio-visual segmentation,

    J. Zhou, Y . Zhou, M. Han, T. Wang, X. Chang, H. Cholakkal, and R. M. Anwer, “Think before you segment: An object-aware reasoning agent for referring audio-visual segmentation,”arXiv preprint arXiv:2508.04418, 2025

  17. [17]

    Neural mechanisms of selective visual attention,

    R. Desimone, J. Duncanet al., “Neural mechanisms of selective visual attention,”Annual review of neuroscience, vol. 18, no. 1, pp. 193–222, 1995

  18. [18]

    Top-down and bottom-up mechanisms in biasing competition in the human brain,

    D. M. Beck and S. Kastner, “Top-down and bottom-up mechanisms in biasing competition in the human brain,”Vision research, vol. 49, no. 10, pp. 1154–1165, 2009

  19. [19]

    Audio–visual segmentation,

    J. Zhou, J. Wang, J. Zhang, W. Sun, J. Zhang, S. Birchfield, D. Guo, L. Kong, M. Wang, and Y . Zhong, “Audio–visual segmentation,” in European Conference on Computer Vision. Springer, 2022, pp. 386– 403

  20. [20]

    Audio-visual segmentation with semantics,

    J. Zhou, X. Shen, J. Wang, J. Zhang, W. Sun, J. Zhang, S. Birchfield, D. Guo, L. Kong, M. Wanget al., “Audio-visual segmentation with semantics,”International Journal of Computer Vision, vol. 133, no. 4, pp. 1644–1664, 2025

  21. [21]

    Catr: Combinatorial- dependence audio-queried transformer for audio-visual video segmenta- tion,

    K. Li, Z. Yang, L. Chen, Y . Yang, and J. Xiao, “Catr: Combinatorial- dependence audio-queried transformer for audio-visual video segmenta- tion,” inProceedings of the ACM International Conference on Multime- dia, 2023, pp. 1485–1494

  22. [22]

    Avsegformer: Audio- visual segmentation with transformer,

    S. Gao, Z. Chen, G. Chen, W. Wang, and T. Lu, “Avsegformer: Audio- visual segmentation with transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 155– 12 163

  23. [23]

    Transavs: End- to-end audio-visual segmentation with transformer,

    Y . Ling, Y . Li, Z. Gan, J. Zhang, M. Chi, and Y . Wang, “Transavs: End- to-end audio-visual segmentation with transformer,” inICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 7845–7849

  24. [24]

    Discovering sounding objects by audio queries for audio visual segmentation,

    S. Huang, H. Li, Y . Wang, H. Zhu, J. Dai, J. Han, W. Rong, and S. Liu, “Discovering sounding objects by audio queries for audio visual segmentation,” inProceedings of the International Joint Conference on Artificial Intelligence, 2023, pp. 876–884

  25. [25]

    Consistency-queried transformer for audio-visual segmentation,

    Y . Lv, Z. Liu, and X. Chang, “Consistency-queried transformer for audio-visual segmentation,”IEEE Transactions on Image Processing, 2025

  26. [26]

    Qdformer: Towards robust audiovisual segmentation in complex environments with quantization-based semantic decomposition,

    X. Li, J. Wang, X. Xu, X. Peng, R. Singh, Y . Lu, and B. Raj, “Qdformer: Towards robust audiovisual segmentation in complex environments with quantization-based semantic decomposition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3402–3413

  27. [27]

    Av-sam: Segment any- thing model meets audio-visual localization and segmentation

    S. Mo and Y . Tian, “Av-sam: Segment anything model meets audio- visual localization and segmentation,”arXiv preprint arXiv:2305.01836, 2023

  28. [28]

    Annotation- free audio-visual segmentation,

    J. Liu, Y . Wang, C. Ju, C. Ma, Y . Zhang, and W. Xie, “Annotation- free audio-visual segmentation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5604–5614

  29. [29]

    Prompting segmentation with sound is generalizable audio-visual source localizer,

    Y . Wang, W. Liu, G. Li, J. Ding, D. Hu, and X. Li, “Prompting segmentation with sound is generalizable audio-visual source localizer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5669–5677

  30. [30]

    Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256,

    H. Zhong, M. Zhu, Z. Du, Z. Huang, C. Zhao, M. Liu, W. Wang, H. Chen, and C. Shen, “Omni-r1: Reinforcement learning for om- nimodal reasoning via two-system collaboration,”arXiv preprint arXiv:2505.20256, 2025

  31. [31]

    Towards omnimodal expressions and reasoning in referring audio-visual segmentation,

    K. Ying, H. Ding, G. Jie, and Y .-G. Jiang, “Towards omnimodal expressions and reasoning in referring audio-visual segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 575–22 585

  32. [32]

    Hiera: A hierarchical vision transformer without the bells- and-whistles,

    C. Ryali, Y .-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y . Huang, V . Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y . Li, and C. Fe- ichtenhofer, “Hiera: A hierarchical vision transformer without the bells- and-whistles,”International Conference on Machine Learning, 2023

  33. [33]

    Clap learn- ing audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learn- ing audio concepts from natural language supervision,” inIEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  34. [34]

    Hts-at: A hierarchical token-semantic audio transformer for sound clas- sification and detection,

    K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound clas- sification and detection,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650

  35. [35]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

  36. [36]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,”arXiv preprint arXiv:1910.01108, 2019

  37. [37]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

  38. [38]

    Language as queries for referring video object segmentation,

    J. Wu, Y . Jiang, P. Sun, Z. Yuan, and P. Luo, “Language as queries for referring video object segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4974–4984