pith. machine review for the scientific record. sign in

arxiv: 2605.09468 · v1 · submitted 2026-05-10 · 💻 cs.MM

Recognition: unknown

Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition

Kai Gao, Peiwu Wang, Yifan Wang, Yunxian Chi, Zhinan Gou

Pith reviewed 2026-05-12 04:20 UTC · model grok-4.3

classification 💻 cs.MM
keywords multimodal intent recognitiondual-pathway reasoningmultimodal inconsistencyrepresentation disentanglementinconsistency perception mechanismcognitive reasoningintent recognition
0
0 comments X

The pith

A dual-pathway reasoning system resolves conflicts between text, video, and audio signals to recognize user intent more accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that multimodal intent recognition can be improved by explicitly separating the processing of consistent and inconsistent cues across modalities. It introduces a framework that builds a stable foundation from shared features while using a separate path to detect and adjust for conflicts. This matters because real-world multimodal data often contains contradictions, such as mismatched emotional tones, which current methods handle poorly by averaging or canceling signals. The approach uses feature disentanglement and dynamic weighting to maintain performance even under inconsistency.

Core claim

The Cognitive Dual-Pathway Reasoning framework constructs a stable semantic foundation via the intuition pathway that aggregates cross-modal consensus using shared features, while the reasoning pathway mitigates high-level semantic conflicts by quantifying inconsistency severity through semantic prototype matching and statistical probability calibration, with dynamic weight adjustment between pathways and a multi-view loss to learn structured features.

What carries the argument

The Cognitive Dual-Pathway Reasoning (CDPR) mechanism, consisting of an intuition pathway for consensus aggregation and a reasoning pathway for inconsistency perception and dynamic adjustment, built on top of representation disentanglement into modality-invariant and specific features.

If this is right

  • CDPR achieves state-of-the-art performance on two multimodal intent recognition benchmarks.
  • The framework demonstrates superior robustness when handling cases of multimodal inconsistency.
  • The multi-view loss function helps prevent modality laziness by encouraging learning of structured features at different stages.
  • Dynamic adjustment of weights between pathways allows the model to prioritize the more reliable path based on detected conflict levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of consensus and conflict processing could extend to other multimodal tasks such as emotion recognition or event detection where signals may conflict.
  • Explicit quantification of inconsistency might enable better explainability in multimodal models by highlighting which modality is causing issues.
  • Future systems might incorporate similar dual pathways to handle real-time streaming data with varying levels of noise or disagreement.

Load-bearing premise

That the disentanglement of modality-invariant and specific features accurately captures the necessary information without introducing distortions, and that the inconsistency perception mechanism correctly identifies and quantifies conflicts.

What would settle it

Testing the model on a modified benchmark where high-conflict examples are artificially amplified to see if performance drops below baseline methods that do not use dual pathways.

Figures

Figures reproduced from arXiv: 2605.09468 by Kai Gao, Peiwu Wang, Yifan Wang, Yunxian Chi, Zhinan Gou.

Figure 1
Figure 1. Figure 1: Comparison of CDPR and Existing Paradigms. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of CDPR. Our approach comprises three key steps: (1) Dual-Pathway Reasoning, which [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative analysis of representative samples from the MIntRec and MIntRec2.0 datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of feature distributions in the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Multimodal Intent Recognition (MIR) aims to understand complex user intentions by leveraging text, video, and audio signals. However, existing approaches face two key challenges: (1) overlooking intricate cross-modal interactions for distinguishing consistent and inconsistent cues, and (2) ineffectively modeling multimodal conflicts, leading to semantic cancellation. To address these, we propose a novel Cognitive Dual-Pathway Reasoning (CDPR) framework, which constructs a stable semantic foundation via the intuition pathway and mitigates high-level semantic conflicts through the reasoning pathway, cooperatively establishing deep semantic relations. Specifically, we first employ a representation disentanglement strategy to extract modality-invariant and specific features. Subsequently, the intuition pathway aggregates cross-modal consensus using shared features for solid global representations. The reasoning pathway introduces an inconsistency perception mechanism, combining semantic prototype matching with statistical probability calibration to precisely quantify conflict severity, and dynamically adjusting the weights between both pathways. Furthermore, a multi-view loss function is adopted to alleviate modality laziness and learn structured features at different stages. Extensive experiments on two benchmarks show that CDPR achieves SOTA performance and superior robustness in mitigating multimodal inconsistency. The code is available at https://github.com/Hebust-NLP/CDPR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes the Cognitive Dual-Pathway Reasoning (CDPR) framework for Multimodal Intent Recognition (MIR). It addresses cross-modal interaction and conflict challenges via a representation disentanglement strategy that extracts modality-invariant and specific features, an intuition pathway that aggregates cross-modal consensus from shared features, a reasoning pathway that employs semantic prototype matching combined with statistical probability calibration to quantify inconsistency severity and dynamically modulate pathway weights, and a multi-view loss to mitigate modality laziness. Experiments on two benchmarks are reported to demonstrate SOTA performance and improved robustness against multimodal inconsistency, with code released publicly.

Significance. If the empirical claims hold, the work introduces a cognitively motivated dual-pathway architecture that explicitly separates consensus-building from conflict resolution, offering a principled way to prevent semantic cancellation in multimodal settings. This could meaningfully advance MIR systems in noisy or conflicting real-world scenarios. The public code release is a clear strength that supports reproducibility and extension by the community.

minor comments (3)
  1. [Abstract] Abstract: The SOTA and robustness claims would be more compelling if the abstract included at least one or two key quantitative results (e.g., accuracy deltas over baselines) rather than leaving all numbers to the main text.
  2. [Method] The description of the inconsistency perception mechanism (prototype matching plus probability calibration) and the multi-view loss would benefit from explicit equations or pseudocode in the main body to allow readers to verify the claimed parameter-free or calibration properties.
  3. [Abstract] The two benchmarks are referenced but not named in the abstract; adding their standard names (e.g., IEMOCAP, MELD or whichever are used) would improve immediate clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our Cognitive Dual-Pathway Reasoning (CDPR) framework and for recommending minor revision. We appreciate the recognition of the work's potential to advance multimodal intent recognition by separating consensus-building from conflict resolution, as well as the value placed on the public code release for reproducibility.

Circularity Check

0 steps flagged

No significant circularity; framework is descriptive and empirical

full rationale

The provided abstract and high-level description contain no equations, derivations, or first-principles claims that could reduce to their inputs by construction. The CDPR framework is introduced as a novel architecture with components (disentanglement, intuition/reasoning pathways, inconsistency perception, multi-view loss) whose correctness is asserted via empirical SOTA results on benchmarks rather than any self-referential mathematical reduction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear. The central claims remain externally falsifiable through experiments and do not collapse into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no mathematical details, derivations, or explicit assumptions; the framework appears to rest on standard deep-learning practices for feature extraction and fusion.

pith-pipeline@v0.9.0 · 5520 in / 944 out tokens · 32356 ms · 2026-05-12T04:20:29.437219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks.Advances in neural information processing systems29 (2016)

  2. [2]

    Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmer- mann, Rada Mihalcea, and Soujanya Poria. 2019. Towards multimodal sarcasm detection (an _obviously_ perfect paper).arXiv preprint arXiv:1906.01815(2019)

  3. [3]

    Cen Chen, Xiaolu Zhang, Sheng Ju, Chilin Fu, Caizhi Tang, Jun Zhou, and Xiao- long Li. 2019. AntProphet: an Intention Mining System behind Alipay’s Intelligent Customer Service Bot.. InIJCAI, Vol. 8. 6497–6499

  4. [4]

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.IEEE Journal of Selected Topics in Sig...

  5. [5]

    Zhanpeng Chen, Zhihong Zhu, Xianwei Zhuang, Zhiqi Huang, and Yuexian Zou

  6. [6]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Dual-oriented Disentangled Network with Counterfactual Intervention for Multimodal Intent Detection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 17554–17567

  7. [7]

    Ruining Chong, Cunliang Kong, Liu Wu, Zhenghao Liu, Ziye Jin, Liner Yang, Yange Fan, Hanghang Fan, and Erhong Yang. 2023. Leveraging prefix transfer for multi-intent text revision. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 1219–1228

  8. [8]

    Benedict GC Dellaert, Suzanne B Shu, Theo A Arentze, Tom Baker, Kristin Diehl, Bas Donkers, Nathanael J Fast, Gerald Häubl, Heidi Johnson, Uma R Karmarkar, et al. 2020. Consumer decisions with artificially intelligent voice assistants. Marketing Letters31 (2020), 335–347

  9. [9]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

  10. [10]

    Qian Dong, Yuezhou Dong, Ke Qin, Guiduo Duan, and Tao He. 2025. Unbiased Multimodal Audio-to-Intent Recognition. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  11. [11]

    Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun, Junyu Lu, and Linbo Zhu. 2025. WDMIR: Wavelet-Driven Multimodal Intent Recognition. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelli- gence, IJCAI-25, James Kwok (Ed.). International Joint Conferences on Artificial Intelligence Organization, 5226–5234. doi:1...

  12. [12]

    Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis.arXiv preprint arXiv:2109.00412(2021)

  13. [13]

    Md Kamrul Hasan, Wasifur Rahman, AmirAli Bagher Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, and Mohammed (Ehsan) Hoque

  14. [14]

    InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP)

    UR-FUNNY: A Multimodal Language Dataset for Understanding Humor. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2046–2056. doi:10.18653/v1/D19-1211

  15. [15]

    Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Anal- ysis. InProceedings of the 28th ACM International Conference on Multimedia(Seat- tle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 1122–1131. doi:10.1145/3394171.3413678

  16. [16]

    Bo Hu, Kai Zhang, Yanghai Zhang, and Yuyang Ye. 2025. Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition.Proceedings of the AAAI Conference on Artificial Intelligence39, 16 (Apr. 2025), 17267–17275. doi:10.1609/aaai.v39i16.33898

  17. [17]

    Shijue Huang, Libo Qin, Bingbing Wang, Geng Tu, and Ruifeng Xu. 2024. SDIF- DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-Modal Intent Detection. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 10206–10210. doi:10.1109/ ICASSP48485.2024.10446922

  18. [18]

    Wuliang Huang, Yiqiang Chen, Xinlong Jiang, Chenlong Gao, Teng Zhang, Qian Chen, and Yifan Wang. 2025. Mitigating Pervasive Modality Absence Through Multimodal Generalization and Refinement. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 26796–26804

  19. [19]

    Sepideh Kaffash, An Truong Nguyen, and Joe Zhu. 2021. Big data algorithms and applications in intelligent transportation system: A review and bibliometric analysis.International Journal of Production Economics231 (2021), 107868. doi:10. 1016/j.ijpe.2020.107868

  20. [20]

    Daniel Kahneman. 2011. Thinking, fast and slow.Farrar, Straus and Giroux (2011)

  21. [21]

    Christoph Kofler, Martha Larson, and Alan Hanjalic. 2016. User intent in multime- dia search: a survey of the state of the art and future challenges.ACM Computing Surveys (CSUR)49, 2 (2016), 1–37

  22. [22]

    Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics22, 1 (1951), 79–86

  23. [23]

    Jianhua Lin. 2002. Divergence measures based on the Shannon entropy.IEEE Transactions on Information theory37, 1 (2002), 145–151

  24. [24]

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In2021 IEEE/CVF International Conference on Computer Vision (ICCV). 9992–10002. doi:10.1109/ICCV48922.2021.00986

  25. [25]

    Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247–2256

  26. [26]

    Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations. https://openreview.net/ forum?id=Bkg6RiCqY7

  27. [27]

    Yingbo Ma, Mehmet Celepkolu, Kristy Elizabeth Boyer, Collin F Lynch, Eric Wiebe, and Maya Israel. 2023. How noisy is too noisy? The impact of data noise on multimodal recognition of confusion and conflict during collaborative learning. InProceedings of the 25th International Conference on Multimodal Interaction. 326– 335

  28. [28]

    Jong Hak Moon, Hyungyung Lee, Woncheol Shin, Young-Hak Kim, and Edward Choi. 2022. Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training.IEEE Journal of Biomedical and Health Informatics26, 12 (2022), 6070–6080. doi:10.1109/JBHI.2022.3207502

  29. [29]

    Sheuli Paul, Michael Sintek, Veton Këpuska, Marius Silaghi, and Liam Robertson

  30. [30]

    In2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)

    Intent based Multimodal Speech and Gesture Fusion for Human-Robot Com- munication in Assembly Situation. In2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA). 760–763. doi:10.1109/ICMLA55696. 2022.00127

  31. [31]

    Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrat- ing Multimodal Information in Large Pretrained Transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Ed...

  32. [32]

    Tulika Saha, Aditya Patra, Sriparna Saha, and Pushpak Bhattacharyya. 2020. Towards Emotion-aided Multi-modal Dialogue Act Classification. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 4361–4...

  33. [33]

    Claude E Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

  34. [34]

    Kaili Sun, Zhiwen Xie, Mang Ye, and Huyin Zhang. 2024. Contextual Augmented Global Contrast for Multimodal Intent Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26963–26973

  35. [35]

    Elarabawy, Mohammed Abd-Elnaby, Noor Mohd, Gaurav Dhiman, and Subhash Sharma

    Pallavi Tiwari, Bhaskar Pant, Mahmoud M. Elarabawy, Mohammed Abd-Elnaby, Noor Mohd, Gaurav Dhiman, and Subhash Sharma. 2022. CNN Based Multiclass Brain Tumor Detection Using Medical Imag- ing.Computational Intelligence and Neuroscience2022, 1 (2022), 1830010. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2022/1830010 doi:10.1155/2022/1830010

  36. [36]

    Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov

    Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Comput...

  37. [37]

    Lu Xiao, Jiahao Wu, Zhanke Wang, Guanhua Wu, Runling Liu, Zhiyan Wang, and Ronggang Wang. 2025. Multi-View Image Enhancement Inconsistency Decoupling Guided 3D Gaussian Splatting. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  38. [38]

    Wei Xu. 2019. Toward human-centered AI: a perspective from human-computer interaction.Interactions26, 4 (June 2019), 42–46. doi:10.1145/3328485

  39. [39]

    Qu Yang, Xiyang Li, Fu Lin, and Mang Ye. [n. d.]. Adaptive Re-calibration Learning for Balanced Multimodal Intention Recognition. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  40. [40]

    Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal senti- ment analysis. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 10790–10797

  41. [41]

    Ziqi Yuan, Wei Li, Hua Xu, and Wenmeng Yu. 2021. Transformer-based feature reconstruction network for robust multimodal sentiment analysis. InProceedings of the 29th ACM international conference on multimedia. 4400–4407. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Yifan Wang, Peiwu Wang, Yunxian Chi, Zhinan Gou, and Kai Gao

  42. [42]

    Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis.arXiv preprint arXiv:1707.07250(2017)

  43. [43]

    Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos.arXiv preprint arXiv:1606.06259(2016)

  44. [44]

    AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2236–2246

  45. [45]

    Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central moment discrepancy (CMD) for domain- invariant representation learning.arXiv preprint arXiv:1702.08811(2017)

  46. [46]

    Hanlei Zhang, Xiaoteng Li, Hua Xu, Panpan Zhang, Kang Zhao, and Kai Gao

  47. [47]

    TEXTOIR: An integrated and visualized platform for text open intent recognition.arXiv preprint arXiv:2110.15063(2021)

  48. [48]

    Hanlei Zhang, Xin Wang, Hua Xu, Qianrui Zhou, Kai Gao, Jianhua Su, jinyue Zhao, Wenrui Li, and Yanting Chen. 2024. MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=nY9nITZQjc

  49. [49]

    Hanlei Zhang, Hua Xu, Xin Wang, Qianrui Zhou, Shaojie Zhao, and Jiayan Teng. 2022. MIntRec: A New Dataset for Multimodal Intent Recognition. In Proceedings of the 30th ACM International Conference on Multimedia (MM ’22). ACM, 1688–1697. doi:10.1145/3503161.3547906

  50. [50]

    Hanlei Zhang, Qianrui Zhou, Hua Xu, Jianhua Su, Roberto Evans, and Kai Gao

  51. [51]

    Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding.arXiv preprint arXiv:2412.12453(2024)

  52. [52]

    Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang, and Kai Gao. 2024. Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition.Proceedings of the AAAI Conference on Artificial Intelligence38, 15 (Mar. 2024), 17114–17122. doi:10.1609/aaai.v38i15. 29656

  53. [53]

    Qianrui Zhou, Hua Xu, Yifan Wang, Xinzhi Dong, and Hanlei Zhang. 2025. LLM- Guided Semantic Relational Reasoning for Multimodal Intent Recognition. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational L...

  54. [54]

    Zhihong Zhu, Xuxin Cheng, Zhaorun Chen, Yuyan Chen, Yunyan Zhang, Xian Wu, Yefeng Zheng, and Bowen Xing. 2024. InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing. In Proceedings of the 32nd ACM International Conference on Multimedia(Melbourne VIC, Australia)(MM ’24). Association for Computing Machinery...

  55. [55]

    Yicheng Zou, Hongwei Liu, Tao Gui, Junzhe Wang, Qi Zhang, Meng Tang, Haix- iang Li, and Daniel Wang. 2022. Divide and conquer: Text semantic matching with disentangled keywords and intents.arXiv preprint arXiv:2203.02898(2022)