pith. the verified trust layer for science. sign in

arxiv: 2510.18346 · v2 · submitted 2025-10-21 · 💻 cs.CV

AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering

Pith reviewed 2026-05-18 05:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords audio-visual question answeringAVQAdynamic samplingmodality preferencecontrastive learningmultimodal perceptiontemporal focuscross-modal reasoning
0
0 comments X p. Extension

The pith

AV-Master uses dynamic adaptive sampling and modality preferences to better answer questions about complex audio-visual scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing audio-visual question answering models often fail to adaptively select relevant time segments or modalities when scenes contain lots of redundant information. The paper proposes AV-Master to dynamically focus on the most question-relevant parts across time and across visual versus audio channels. It does this through a new sampling method that narrows in on key segments and a strategy that weighs each modality's usefulness separately. A dual-path contrastive loss then helps the model learn consistent and complementary representations. If this works, it should lead to stronger results on real-world AVQA tasks that require reasoning rather than simple pattern matching.

Core claim

The central claim is that modeling both temporal and modality dimensions dynamically allows AV-Master to extract key information from redundant audio-visual scenes more effectively than prior approaches. The temporal path uses dynamic adaptive focus sampling to progressively select relevant segments. The modality path uses a preference-aware strategy to activate critical features selectively. These are tied together by a dual-path contrastive loss that promotes question-specific cross-modal collaboration. This results in better performance on four large-scale benchmarks, particularly for complex reasoning questions.

What carries the argument

Dual-path framework consisting of dynamic adaptive focus sampling for temporal selection and preference-aware strategy for modality contribution modeling, reinforced by dual-path contrastive loss.

If this is right

  • Traditional fixed sampling methods are replaced by progressive focus on question-relevant segments to reduce redundancy.
  • Independent modeling of modality contributions enables selective feature activation for better accuracy.
  • The dual-path contrastive loss ensures consistency and complementarity in cross-modal representations.
  • Overall reasoning capability improves especially on complex questions about audio-visual scenes.
  • Outperformance is demonstrated across four large-scale benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This sampling and preference approach might generalize to other multimodal tasks that involve noisy or redundant inputs.
  • Efficient implementation of the focus mechanism could support deployment in resource-constrained environments.
  • Combining this perception module with larger reasoning models could further enhance performance on open-ended questions.

Load-bearing premise

The dynamic adaptive focus sampling and preference-aware modality strategy can be trained to reliably identify and prioritize question-relevant information without introducing selection biases or requiring dataset-specific tuning.

What would settle it

Running AV-Master on a new audio-visual benchmark with high redundancy and complex reasoning questions and finding no significant improvement over existing methods would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.18346 by Jiayu Zhang, Qilang Ye, Shuo Ye, Xun Lin, Zihan Song, Zitong Yu.

Figure 1
Figure 1. Figure 1: Illustration of the AVQA task and the comparison of our method [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AV-Master. We utilize three separate pre-trained encoders to extract features from video, audio, and question inputs. The encoded features [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of (a) audio-visual focus capture and (c) audio-visual key fusion in the temporal dynamic perception path, where (b) represents the specific implementation process of focus sampling in (a) audio-visual focus capture. SAB and CAB represent the self-attention block and the cross-attention block, respectively. represent the input predefined CLS tokens (serve as audio-visual templates), represent … view at source ↗
Figure 5
Figure 5. Figure 5: The ablation study on input modalities and comparison with other [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the audio-visual focus capturing process, including [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: The attention visualization for video-question (upper) and audio [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative demonstration of our proposed AV-Master and comparison with MLLM (VideoLLaMa2-7B and Qwen3-Max) and AVQA expert model (QA [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model's ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality's contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes AV-Master, a framework for Audio-Visual Question Answering that introduces dynamic adaptive focus sampling to progressively select question-relevant audio-visual segments, a preference-aware modality strategy that models each modality's contribution independently, and a dual-path contrastive loss to enforce consistency and complementarity. Experiments on four large-scale benchmarks are reported to show significant outperformance over prior methods, especially on complex reasoning tasks.

Significance. If the reported gains hold under rigorous validation, the dual-path design could meaningfully advance AVQA by addressing temporal redundancy and modality imbalance in a question-adaptive manner, offering a more flexible alternative to fixed sampling strategies. The emphasis on end-to-end training of the focus policy is a potential strength if accompanied by controls for bias.

major comments (1)
  1. The central claim that AV-Master yields genuine cross-modal reasoning improvements (rather than artifacts of training-question correlation) depends on the dynamic adaptive focus sampling not collapsing to narrow temporal windows. No section demonstrates stability of the learned policy under shifts in question phrasing, scene complexity, or cross-dataset transfer (e.g., via adversarial rephrasing or held-out distributions), as required to rule out selection bias.
minor comments (2)
  1. The abstract states performance improvements but provides no quantitative results, error bars, or ablation details; these should be summarized with key numbers and dataset names for immediate clarity.
  2. Notation for the dual-path contrastive loss and the preference-aware weighting should be defined explicitly with equations in the method section to avoid ambiguity in how modality contributions are computed independently.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on the stability of the dynamic adaptive focus sampling below.

read point-by-point responses
  1. Referee: The central claim that AV-Master yields genuine cross-modal reasoning improvements (rather than artifacts of training-question correlation) depends on the dynamic adaptive focus sampling not collapsing to narrow temporal windows. No section demonstrates stability of the learned policy under shifts in question phrasing, scene complexity, or cross-dataset transfer (e.g., via adversarial rephrasing or held-out distributions), as required to rule out selection bias.

    Authors: We appreciate the referee highlighting the need to verify that our dynamic adaptive focus sampling does not collapse to narrow temporal windows, which is essential for claiming genuine improvements in cross-modal reasoning. In the manuscript, we present qualitative visualizations of the sampled segments for various questions, showing that the policy selects multiple relevant segments rather than fixed narrow windows, and quantitative ablations demonstrate that removing the adaptive sampling degrades performance on complex tasks. Additionally, the dual-path contrastive loss is designed to promote both consistency and complementarity, discouraging trivial solutions. Nevertheless, we acknowledge that explicit evaluations under adversarial question rephrasing, shifts in scene complexity, or cross-dataset policy transfer are not included. We will incorporate such robustness analyses, including tests on paraphrased questions and held-out distributions, in the revised manuscript to strengthen this aspect. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new architectural components trained end-to-end on external benchmarks

full rationale

The paper introduces AV-Master as a novel framework with explicitly new mechanisms: dynamic adaptive focus sampling for temporal focus, a preference-aware modality strategy, and a dual-path contrastive loss. These are presented as architectural innovations to mitigate redundancy and improve cross-modal reasoning, trained jointly on four large-scale external benchmarks. Reported gains are empirical performance improvements rather than quantities derived by construction from fitted parameters or prior self-citations. No equations or steps reduce the central claims to tautological inputs; the derivation chain relies on proposed components and standard training, remaining self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted or audited from the provided text.

pith-pipeline@v0.9.0 · 5751 in / 1010 out tokens · 33837 ms · 2026-05-18T05:23:04.593783+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

    cs.CV 2026-04 unverdicted novelty 7.0

    R²ScP recovers missing audio-visual data in question answering by retrieving semantically consistent examples and purifying noise, outperforming generative imputation in incomplete scenarios.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Mgtr-miss: More ground truth retrieving based multimodal interaction and semantic supervision for video description,

    J. Zhang, P. Tang, Y . Tan, and H. Wang, “Mgtr-miss: More ground truth retrieving based multimodal interaction and semantic supervision for video description,”Neural Networks, p. 107817, 2025

  2. [2]

    Srvc-la: Sparse regularization of visual context and latent attention based model for video description,

    P. Tang, J. Zhang, H. Wang, Y . Tan, and Y . Yi, “Srvc-la: Sparse regularization of visual context and latent attention based model for video description,”Neurocomputing, vol. 630, p. 129639, 2025

  3. [3]

    Svc 2025: the first multimodal deception detection challenge,

    X. Lin, X. Guo, T. Wang, Y . Ma, J. Huang, J. Zhang, J. Cao, and Z. Yu, “Svc 2025: the first multimodal deception detection challenge,”arXiv preprint arXiv:2508.04129, 2025

  4. [4]

    Progressive spatio-temporal perception for audio-visual question answering,

    G. Li, W. Hou, and D. Hu, “Progressive spatio-temporal perception for audio-visual question answering,” inProceedings of the 31st ACM international conference on multimedia, pp. 7808–7816, 2023

  5. [5]

    Boosting audio visual question answering via key semantic-aware cues,

    G. Li, H. Du, and D. Hu, “Boosting audio visual question answering via key semantic-aware cues,” inProceedings of the 32nd ACM Inter- national Conference on Multimedia, pp. 5997–6005, 2024

  6. [6]

    Question-aware gaussian experts for audio-visual question answering,

    H. Kim, I. Jung, D. Suh, Y . Zhang, S. Lee, and S. Hong, “Question-aware gaussian experts for audio-visual question answering,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 13681– 13690, 2025

  7. [7]

    Learning to answer questions in dynamic audio-visual scenarios,

    G. Li, Y . Wei, Y . Tian, C. Xu, J.-R. Wen, and D. Hu, “Learning to answer questions in dynamic audio-visual scenarios,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19108–19118, 2022

  8. [8]

    Audio-visual adaptive fusion network for question answering based on contrastive learning,

    X. Zhao, Y . Wang, and P. Jin, “Audio-visual adaptive fusion network for question answering based on contrastive learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 10483–10491, 2025

  9. [9]

    Sound source localization,

    M. Risoud, J.-N. Hanson, F. Gauvrit, C. Renard, P.-E. Lemesre, N.-X. Bonne, and C. Vincent, “Sound source localization,”European annals of otorhinolaryngology, head and neck diseases, vol. 135, no. 4, pp. 259– 264, 2018

  10. [10]

    Music gesture for visual sound separation,

    C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba, “Music gesture for visual sound separation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10478– 10487, 2020

  11. [11]

    Visualvoice: Audio-visual speech separation with cross-modal consistency,

    R. Gao and K. Grauman, “Visualvoice: Audio-visual speech separation with cross-modal consistency,” in2021 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 15490–15500, IEEE, 2021

  12. [12]

    Listen to look: Action recognition by previewing audio,

    R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10457– 10467, 2020

  13. [13]

    Multimodal fusion for audio-image and video action recognition,

    M. B. Shaikh, D. Chai, S. M. S. Islam, and N. Akhtar, “Multimodal fusion for audio-image and video action recognition,”Neural Computing and Applications, vol. 36, no. 10, pp. 5499–5513, 2024

  14. [14]

    Audio-visual event localization in unconstrained videos,

    Y . Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” inProceedings of the European conference on computer vision (ECCV), pp. 247–263, 2018

  15. [15]

    Cross-modal attention network for temporal inconsistent audio-visual event localization,

    H. Xuan, Z. Zhang, S. Chen, J. Yang, and Y . Yan, “Cross-modal attention network for temporal inconsistent audio-visual event localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 279–286, 2020

  16. [16]

    Contrastive positive sample propaga- tion along the audio-visual event line,

    J. Zhou, D. Guo, and M. Wang, “Contrastive positive sample propaga- tion along the audio-visual event line,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7239–7257, 2022

  17. [17]

    Exploring heterogeneous clues for weakly- supervised audio-visual video parsing,

    Y . Wu and Y . Yang, “Exploring heterogeneous clues for weakly- supervised audio-visual video parsing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1326– 1335, 2021

  18. [18]

    Modality-independent teachers meet weakly-supervised audio-visual event parser,

    Y .-H. Lai, Y .-C. Chen, and F. Wang, “Modality-independent teachers meet weakly-supervised audio-visual event parser,”Advances in Neural Information Processing systems, vol. 36, pp. 73633–73651, 2023

  19. [19]

    Label- anticipated event disentanglement for audio-visual video parsing,

    J. Zhou, D. Guo, Y . Mao, Y . Zhong, X. Chang, and M. Wang, “Label- anticipated event disentanglement for audio-visual video parsing,” in European Conference on Computer Vision, pp. 35–51, Springer, 2024

  20. [20]

    Learning to separate object sounds by watching unlabeled video,

    R. Gao, R. Feris, and K. Grauman, “Learning to separate object sounds by watching unlabeled video,” inProceedings of the European conference on computer vision (ECCV), pp. 35–53, 2018. 13

  21. [21]

    Co-separating sounds of visual objects,

    R. Gao and K. Grauman, “Co-separating sounds of visual objects,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3879–3888, 2019

  22. [22]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning, pp. 8748–8763, PmLR, 2021

  23. [23]

    Cnn architectures for large-scale audio classification,

    S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold,et al., “Cnn architectures for large-scale audio classification,” in2017 ieee international conference on acoustics, speech and signal processing (icassp), pp. 131–135, IEEE, 2017

  24. [24]

    Look, listen, and answer: Overcoming biases for audio-visual question answering,

    J. Ma, M. Hu, P. Wang, W. Sun, L. Song, H. Pei, J. Liu, and Y . Du, “Look, listen, and answer: Overcoming biases for audio-visual question answering,”arXiv preprint arXiv:2404.12020, 2024

  25. [25]

    Tackling data bias in music-avqa: Craft- ing a balanced dataset for unbiased question-answering,

    X. Liu, Z. Dong, and P. Zhang, “Tackling data bias in music-avqa: Craft- ing a balanced dataset for unbiased question-answering,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4478–4487, 2024

  26. [26]

    Avqa: A dataset for audio-visual question answering on videos,

    P. Yang, X. Wang, X. Duan, H. Chen, R. Hou, C. Jin, and W. Zhu, “Avqa: A dataset for audio-visual question answering on videos,” in Proceedings of the 30th ACM international conference on multimedia, pp. 3480–3491, 2022

  27. [27]

    Vggsound: A large- scale audio-visual dataset,

    H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large- scale audio-visual dataset,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725, IEEE, 2020

  28. [28]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

  29. [29]

    Guiding audio-visual question answering with collective question reasoning,

    B. Pei, Y . Huang, G. Chen, J. Xu, Y . Wang, L. Wang, T. Lu, Y . Qiao, and F. Wu, “Guiding audio-visual question answering with collective question reasoning,”International Journal of Computer Vision, pp. 1– 18, 2025

  30. [30]

    Shmamba: Structured hyperbolic state space model for audio-visual question answering,

    Z. Yang, W. Li, and G. Cheng, “Shmamba: Structured hyperbolic state space model for audio-visual question answering,”IEEE Transactions on Audio, Speech and Language Processing, 2025

  31. [31]

    Patch- level sounding object tracking for audio-visual question answering,

    Z. Li, J. Zhou, J. Zhang, S. Tang, K. Li, and D. Guo, “Patch- level sounding object tracking for audio-visual question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 5075–5083, 2025

  32. [32]

    Sasr- net: Source-aware semantic representation network for enhancing audio- visual question answering,

    T. Yang, Y . Nan, L. Dai, Z. Liang, Y . Tian, and X. Zhang, “Sasr- net: Source-aware semantic representation network for enhancing audio- visual question answering,” inFindings of the Association for Compu- tational Linguistics: EMNLP 2024, pp. 15894–15904, 2024

  33. [33]

    Object-aware adaptive- positivity learning for audio-visual question answering,

    Z. Li, D. Guo, J. Zhou, J. Zhang, and M. Wang, “Object-aware adaptive- positivity learning for audio-visual question answering,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 3306– 3314, 2024

  34. [34]

    Answering diverse questions via text attached with key audio-visual clues,

    Q. Ye, Z. Yu, and X. Liu, “Answering diverse questions via text attached with key audio-visual clues,”arXiv preprint arXiv:2403.06679, 2024

  35. [35]

    Question-aware global-local video understanding network for audio-visual question answering,

    Z. Chen, L. Wang, P. Wang, and P. Gao, “Question-aware global-local video understanding network for audio-visual question answering,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 5, pp. 4109–4119, 2023

  36. [36]

    Vision trans- formers are parameter-efficient audio-visual learners,

    Y .-B. Lin, Y .-L. Sung, J. Lei, M. Bansal, and G. Bertasius, “Vision trans- formers are parameter-efficient audio-visual learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2299–2309, 2023

  37. [37]

    Pano-avqa: Grounded audio-visual question answering on 360deg videos,

    H. Yun, Y . Yu, W. Yang, K. Lee, and G. Kim, “Pano-avqa: Grounded audio-visual question answering on 360deg videos,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2031– 2041, 2021

  38. [38]

    Action- centric relation transformer network for video question answering,

    J. Zhang, J. Shao, R. Cao, L. Gao, X. Xu, and H. T. Shen, “Action- centric relation transformer network for video question answering,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 63–74, 2020

  39. [39]

    Reasoning with heterogeneous graph alignment for video question answering,

    P. Jiang and Y . Han, “Reasoning with heterogeneous graph alignment for video question answering,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 11109–11116, 2020

  40. [40]

    Hierarchical conditional relation networks for video question answering,

    T. M. Le, V . Le, S. Venkatesh, and T. Tran, “Hierarchical conditional relation networks for video question answering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9972–9981, 2020

  41. [41]

    Temporal reasoning via audio question answering,

    H. M. Fayek and J. Johnson, “Temporal reasoning via audio question answering,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2283–2294, 2020

  42. [42]

    Learnable aggregating net with diversity learning for video question answering,

    X. Li, L. Gao, X. Wang, W. Liu, X. Xu, H. T. Shen, and J. Song, “Learnable aggregating net with diversity learning for video question answering,” inProceedings of the 27th ACM international conference on multimedia, pp. 1166–1174, 2019

  43. [43]

    A simple baseline for audio- visual scene-aware dialog,

    I. Schwartz, A. G. Schwing, and T. Hazan, “A simple baseline for audio- visual scene-aware dialog,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12548–12558, 2019

  44. [44]

    Heterogeneous memory enhanced multimodal attention model for video question answering,

    C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang, “Heterogeneous memory enhanced multimodal attention model for video question answering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1999–2007, 2019

  45. [45]

    Beyond rnns: Positional self-attention with co-attention for video question answering,

    X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan, “Beyond rnns: Positional self-attention with co-attention for video question answering,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 8658–8665, 2019

  46. [46]

    Deep modular co- attention networks for visual question answering,

    Z. Yu, J. Yu, Y . Cui, D. Tao, and Q. Tian, “Deep modular co- attention networks for visual question answering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6281–6290, 2019

  47. [47]

    Hierarchical question-image co- attention for visual question answering,

    J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co- attention for visual question answering,”Advances in neural information processing systems, vol. 29, 2016

  48. [48]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao, and L. Bing, “Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024

  49. [49]

    Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset,

    S. Chen, H. Li, Q. Wang, Z. Zhao, M. Sun, X. Zhu, and J. Liu, “Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset,” Advances in Neural Information Processing Systems, vol. 36, pp. 72842– 72866, 2023

  50. [50]

    Valor: Vision-audio-language omni-perception pretraining model and dataset,

    S. Chen, X. He, L. Guo, X. Zhu, W. Wang, J. Tang, and J. Liu, “Valor: Vision-audio-language omni-perception pretraining model and dataset,” arXiv preprint arXiv:2304.08345, 2023

  51. [51]

    Onellm: One framework to align all modalities with language,

    J. Han, K. Gong, Y . Zhang, J. Wang, K. Zhang, D. Lin, Y . Qiao, P. Gao, and X. Yue, “Onellm: One framework to align all modalities with language,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26584–26595, 2024

  52. [52]

    Chatbridge: Bridging modalities with large language model as a language catalyst,

    Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, and J. Liu, “Chatbridge: Bridging modalities with large language model as a language catalyst,”arXiv preprint arXiv:2305.16103, 2023

  53. [53]

    Cat: Enhancing multimodal large language model to answer questions in dynamic audio- visual scenarios,

    Q. Ye, Z. Yu, R. Shao, X. Xie, P. Torr, and X. Cao, “Cat: Enhancing multimodal large language model to answer questions in dynamic audio- visual scenarios,” inEuropean Conference on Computer Vision, pp. 146– 164, Springer, 2024

  54. [54]

    Cat+: investigating and enhancing audio-visual understanding in large language models,

    Q. Ye, Z. Yu, R. Shao, Y . Cui, X. Kang, X. Liu, P. Torr, and X. Cao, “Cat+: investigating and enhancing audio-visual understanding in large language models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  55. [55]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned audio-visual language model for video understanding,”arXiv preprint arXiv:2306.02858, 2023

  56. [56]

    Audio-visual llm for video understanding,

    F. Shu, L. Zhang, H. Jiang, and C. Xie, “Audio-visual llm for video understanding,”arXiv preprint arXiv:2312.06720, 2023

  57. [57]

    Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue,

    Y . Tang, D. Shimada, J. Bi, and C. Xu, “Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue,”arXiv preprint arXiv:2403.16276, vol. 2, 2024

  58. [58]

    Cad- contextual multi-modal alignment for dynamic avqa,

    A. Nadeem, A. Hilton, R. Dawes, G. Thomas, and A. Mustafa, “Cad- contextual multi-modal alignment for dynamic avqa,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 7251–7263, 2024

  59. [59]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016

  60. [60]

    Internvideo2: Scaling foundation models for multimodal video understanding,

    Y . Wang, K. Li, X. Li, J. Yu, Y . He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y . Shi,et al., “Internvideo2: Scaling foundation models for multimodal video understanding,” inEuropean Conference on Computer Vision, pp. 396–416, Springer, 2024

  61. [61]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

    Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, IEEE, 2023

  62. [62]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025