arxiv: 2605.09468 · v1 · submitted 2026-05-10 · 💻 cs.MM

Recognition: unknown

Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition

Kai Gao, Peiwu Wang, Yifan Wang, Yunxian Chi, Zhinan Gou

Pith reviewed 2026-05-12 04:20 UTC · model grok-4.3

classification 💻 cs.MM

keywords multimodal intent recognitiondual-pathway reasoningmultimodal inconsistencyrepresentation disentanglementinconsistency perception mechanismcognitive reasoningintent recognition

0 comments

The pith

A dual-pathway reasoning system resolves conflicts between text, video, and audio signals to recognize user intent more accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that multimodal intent recognition can be improved by explicitly separating the processing of consistent and inconsistent cues across modalities. It introduces a framework that builds a stable foundation from shared features while using a separate path to detect and adjust for conflicts. This matters because real-world multimodal data often contains contradictions, such as mismatched emotional tones, which current methods handle poorly by averaging or canceling signals. The approach uses feature disentanglement and dynamic weighting to maintain performance even under inconsistency.

Core claim

The Cognitive Dual-Pathway Reasoning framework constructs a stable semantic foundation via the intuition pathway that aggregates cross-modal consensus using shared features, while the reasoning pathway mitigates high-level semantic conflicts by quantifying inconsistency severity through semantic prototype matching and statistical probability calibration, with dynamic weight adjustment between pathways and a multi-view loss to learn structured features.

What carries the argument

The Cognitive Dual-Pathway Reasoning (CDPR) mechanism, consisting of an intuition pathway for consensus aggregation and a reasoning pathway for inconsistency perception and dynamic adjustment, built on top of representation disentanglement into modality-invariant and specific features.

If this is right

CDPR achieves state-of-the-art performance on two multimodal intent recognition benchmarks.
The framework demonstrates superior robustness when handling cases of multimodal inconsistency.
The multi-view loss function helps prevent modality laziness by encouraging learning of structured features at different stages.
Dynamic adjustment of weights between pathways allows the model to prioritize the more reliable path based on detected conflict levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of consensus and conflict processing could extend to other multimodal tasks such as emotion recognition or event detection where signals may conflict.
Explicit quantification of inconsistency might enable better explainability in multimodal models by highlighting which modality is causing issues.
Future systems might incorporate similar dual pathways to handle real-time streaming data with varying levels of noise or disagreement.

Load-bearing premise

That the disentanglement of modality-invariant and specific features accurately captures the necessary information without introducing distortions, and that the inconsistency perception mechanism correctly identifies and quantifies conflicts.

What would settle it

Testing the model on a modified benchmark where high-conflict examples are artificially amplified to see if performance drops below baseline methods that do not use dual pathways.

Figures

Figures reproduced from arXiv: 2605.09468 by Kai Gao, Peiwu Wang, Yifan Wang, Yunxian Chi, Zhinan Gou.

**Figure 2.** Figure 2: The overall architecture of CDPR. Our approach comprises three key steps: (1) Dual-Pathway Reasoning, which [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative analysis of representative samples from the MIntRec and MIntRec2.0 datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of feature distributions in the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Multimodal Intent Recognition (MIR) aims to understand complex user intentions by leveraging text, video, and audio signals. However, existing approaches face two key challenges: (1) overlooking intricate cross-modal interactions for distinguishing consistent and inconsistent cues, and (2) ineffectively modeling multimodal conflicts, leading to semantic cancellation. To address these, we propose a novel Cognitive Dual-Pathway Reasoning (CDPR) framework, which constructs a stable semantic foundation via the intuition pathway and mitigates high-level semantic conflicts through the reasoning pathway, cooperatively establishing deep semantic relations. Specifically, we first employ a representation disentanglement strategy to extract modality-invariant and specific features. Subsequently, the intuition pathway aggregates cross-modal consensus using shared features for solid global representations. The reasoning pathway introduces an inconsistency perception mechanism, combining semantic prototype matching with statistical probability calibration to precisely quantify conflict severity, and dynamically adjusting the weights between both pathways. Furthermore, a multi-view loss function is adopted to alleviate modality laziness and learn structured features at different stages. Extensive experiments on two benchmarks show that CDPR achieves SOTA performance and superior robustness in mitigating multimodal inconsistency. The code is available at https://github.com/Hebust-NLP/CDPR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CDPR gives a clear dual-pathway setup for handling multimodal conflicts in intent recognition, but the SOTA and robustness claims need the actual results and ablations to hold up.

read the letter

The quick take is that this paper outlines a Cognitive Dual-Pathway Reasoning model that splits the work into an intuition path for building consensus from shared features and a reasoning path that spots inconsistencies through prototype matching and probability calibration, then adjusts the balance between them. It adds feature disentanglement upfront and a multi-view loss to keep modalities from getting lazy. That combination is the main new element on offer. The architecture description is straightforward and targets the real problems of cross-modal cue conflicts and semantic cancellation without obvious internal contradictions or circular steps. Releasing the code is a plus for anyone who wants to inspect the implementation. The paper does a reasonable job making the high-level logic easy to follow and tying each piece to the stated challenges in multimodal intent recognition. The soft spots sit mostly in the empirical claims. The abstract states SOTA performance and superior robustness on two benchmarks, yet the details on numbers, baselines, ablations, or how they tested injected inconsistencies are not visible here. Without those, it is difficult to judge whether the new components actually drive the gains or if the improvements are modest. If the full manuscript supplies clean tables, controls, and error analysis, that would address the gap; otherwise the central assertions rest on unshown work. This is aimed at people working on multimodal fusion for intent or similar tasks in human-computer interaction. A reader looking for concrete ideas on disentanglement plus conflict quantification could extract useful pieces, but they would need to verify the experiments before relying on the performance numbers. I would send it to peer review. The framework is coherent enough to merit referee time, mainly so the results can be checked properly.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes the Cognitive Dual-Pathway Reasoning (CDPR) framework for Multimodal Intent Recognition (MIR). It addresses cross-modal interaction and conflict challenges via a representation disentanglement strategy that extracts modality-invariant and specific features, an intuition pathway that aggregates cross-modal consensus from shared features, a reasoning pathway that employs semantic prototype matching combined with statistical probability calibration to quantify inconsistency severity and dynamically modulate pathway weights, and a multi-view loss to mitigate modality laziness. Experiments on two benchmarks are reported to demonstrate SOTA performance and improved robustness against multimodal inconsistency, with code released publicly.

Significance. If the empirical claims hold, the work introduces a cognitively motivated dual-pathway architecture that explicitly separates consensus-building from conflict resolution, offering a principled way to prevent semantic cancellation in multimodal settings. This could meaningfully advance MIR systems in noisy or conflicting real-world scenarios. The public code release is a clear strength that supports reproducibility and extension by the community.

minor comments (3)

[Abstract] Abstract: The SOTA and robustness claims would be more compelling if the abstract included at least one or two key quantitative results (e.g., accuracy deltas over baselines) rather than leaving all numbers to the main text.
[Method] The description of the inconsistency perception mechanism (prototype matching plus probability calibration) and the multi-view loss would benefit from explicit equations or pseudocode in the main body to allow readers to verify the claimed parameter-free or calibration properties.
[Abstract] The two benchmarks are referenced but not named in the abstract; adding their standard names (e.g., IEMOCAP, MELD or whichever are used) would improve immediate clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our Cognitive Dual-Pathway Reasoning (CDPR) framework and for recommending minor revision. We appreciate the recognition of the work's potential to advance multimodal intent recognition by separating consensus-building from conflict resolution, as well as the value placed on the public code release for reproducibility.

Circularity Check

0 steps flagged

No significant circularity; framework is descriptive and empirical

full rationale

The provided abstract and high-level description contain no equations, derivations, or first-principles claims that could reduce to their inputs by construction. The CDPR framework is introduced as a novel architecture with components (disentanglement, intuition/reasoning pathways, inconsistency perception, multi-view loss) whose correctness is asserted via empirical SOTA results on benchmarks rather than any self-referential mathematical reduction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear. The central claims remain externally falsifiable through experiments and do not collapse into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no mathematical details, derivations, or explicit assumptions; the framework appears to rest on standard deep-learning practices for feature extraction and fusion.

pith-pipeline@v0.9.0 · 5520 in / 944 out tokens · 32356 ms · 2026-05-12T04:20:29.437219+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

[1]

Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks.Advances in neural information processing systems29 (2016)

work page 2016
[2]

Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmer- mann, Rada Mihalcea, and Soujanya Poria. 2019. Towards multimodal sarcasm detection (an _obviously_ perfect paper).arXiv preprint arXiv:1906.01815(2019)

work page arXiv 2019
[3]

Cen Chen, Xiaolu Zhang, Sheng Ju, Chilin Fu, Caizhi Tang, Jun Zhou, and Xiao- long Li. 2019. AntProphet: an Intention Mining System behind Alipay’s Intelligent Customer Service Bot.. InIJCAI, Vol. 8. 6497–6499

work page 2019
[4]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.IEEE Journal of Selected Topics in Sig...

work page doi:10.1109/jstsp.2022.3188113 2022
[5]

Zhanpeng Chen, Zhihong Zhu, Xianwei Zhuang, Zhiqi Huang, and Yuexian Zou

work page
[6]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Dual-oriented Disentangled Network with Counterfactual Intervention for Multimodal Intent Detection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 17554–17567

work page 2024
[7]

Ruining Chong, Cunliang Kong, Liu Wu, Zhenghao Liu, Ziye Jin, Liner Yang, Yange Fan, Hanghang Fan, and Erhong Yang. 2023. Leveraging prefix transfer for multi-intent text revision. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 1219–1228

work page 2023
[8]

Benedict GC Dellaert, Suzanne B Shu, Theo A Arentze, Tom Baker, Kristin Diehl, Bas Donkers, Nathanael J Fast, Gerald Häubl, Heidi Johnson, Uma R Karmarkar, et al. 2020. Consumer decisions with artificially intelligent voice assistants. Marketing Letters31 (2020), 335–347

work page 2020
[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

work page 2019
[10]

Qian Dong, Yuezhou Dong, Ke Qin, Guiduo Duan, and Tao He. 2025. Unbiased Multimodal Audio-to-Intent Recognition. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025
[11]

Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun, Junyu Lu, and Linbo Zhu. 2025. WDMIR: Wavelet-Driven Multimodal Intent Recognition. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelli- gence, IJCAI-25, James Kwok (Ed.). International Joint Conferences on Artificial Intelligence Organization, 5226–5234. doi:1...

work page doi:10.24963/ijcai.2025/582 2025
[12]

Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis.arXiv preprint arXiv:2109.00412(2021)

work page arXiv 2021
[13]

Md Kamrul Hasan, Wasifur Rahman, AmirAli Bagher Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, and Mohammed (Ehsan) Hoque

work page
[14]

InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP)

UR-FUNNY: A Multimodal Language Dataset for Understanding Humor. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2046–2056. doi:10.18653/v1/D19-1211

work page doi:10.18653/v1/d19-1211 2019
[15]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Anal- ysis. InProceedings of the 28th ACM International Conference on Multimedia(Seat- tle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 1122–1131. doi:10.1145/3394171.3413678

work page doi:10.1145/3394171.3413678 2020
[16]

Bo Hu, Kai Zhang, Yanghai Zhang, and Yuyang Ye. 2025. Adaptive Multimodal Fusion: Dynamic Attention Allocation for Intent Recognition.Proceedings of the AAAI Conference on Artificial Intelligence39, 16 (Apr. 2025), 17267–17275. doi:10.1609/aaai.v39i16.33898

work page doi:10.1609/aaai.v39i16.33898 2025
[17]

Shijue Huang, Libo Qin, Bingbing Wang, Geng Tu, and Ruifeng Xu. 2024. SDIF- DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-Modal Intent Detection. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 10206–10210. doi:10.1109/ ICASSP48485.2024.10446922

work page arXiv 2024
[18]

Wuliang Huang, Yiqiang Chen, Xinlong Jiang, Chenlong Gao, Teng Zhang, Qian Chen, and Yifan Wang. 2025. Mitigating Pervasive Modality Absence Through Multimodal Generalization and Refinement. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 26796–26804

work page 2025
[19]

Sepideh Kaffash, An Truong Nguyen, and Joe Zhu. 2021. Big data algorithms and applications in intelligent transportation system: A review and bibliometric analysis.International Journal of Production Economics231 (2021), 107868. doi:10. 1016/j.ijpe.2020.107868

work page arXiv 2021
[20]

Daniel Kahneman. 2011. Thinking, fast and slow.Farrar, Straus and Giroux (2011)

work page 2011
[21]

Christoph Kofler, Martha Larson, and Alan Hanjalic. 2016. User intent in multime- dia search: a survey of the state of the art and future challenges.ACM Computing Surveys (CSUR)49, 2 (2016), 1–37

work page 2016
[22]

Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics22, 1 (1951), 79–86

work page 1951
[23]

Jianhua Lin. 2002. Divergence measures based on the Shannon entropy.IEEE Transactions on Information theory37, 1 (2002), 145–151

work page 2002
[24]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In2021 IEEE/CVF International Conference on Computer Vision (ICCV). 9992–10002. doi:10.1109/ICCV48922.2021.00986

work page doi:10.1109/iccv48922.2021.00986 2021
[25]

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247–2256

work page 2018
[26]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations. https://openreview.net/ forum?id=Bkg6RiCqY7

work page 2019
[27]

Yingbo Ma, Mehmet Celepkolu, Kristy Elizabeth Boyer, Collin F Lynch, Eric Wiebe, and Maya Israel. 2023. How noisy is too noisy? The impact of data noise on multimodal recognition of confusion and conflict during collaborative learning. InProceedings of the 25th International Conference on Multimodal Interaction. 326– 335

work page 2023
[28]

Jong Hak Moon, Hyungyung Lee, Woncheol Shin, Young-Hak Kim, and Edward Choi. 2022. Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training.IEEE Journal of Biomedical and Health Informatics26, 12 (2022), 6070–6080. doi:10.1109/JBHI.2022.3207502

work page doi:10.1109/jbhi.2022.3207502 2022
[29]

Sheuli Paul, Michael Sintek, Veton Këpuska, Marius Silaghi, and Liam Robertson

work page
[30]

In2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)

Intent based Multimodal Speech and Gesture Fusion for Human-Robot Com- munication in Assembly Situation. In2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA). 760–763. doi:10.1109/ICMLA55696. 2022.00127

work page doi:10.1109/icmla55696 2022
[31]

Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrat- ing Multimodal Information in Large Pretrained Transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Ed...

work page doi:10.18653/v1/2020.acl-main.214 2020
[32]

Tulika Saha, Aditya Patra, Sriparna Saha, and Pushpak Bhattacharyya. 2020. Towards Emotion-aided Multi-modal Dialogue Act Classification. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 4361–4...

work page doi:10.18653/v1/2020.acl-main.402 2020
[33]

Claude E Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

work page 1948
[34]

Kaili Sun, Zhiwen Xie, Mang Ye, and Huyin Zhang. 2024. Contextual Augmented Global Contrast for Multimodal Intent Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26963–26973

work page 2024
[35]

Elarabawy, Mohammed Abd-Elnaby, Noor Mohd, Gaurav Dhiman, and Subhash Sharma

Pallavi Tiwari, Bhaskar Pant, Mahmoud M. Elarabawy, Mohammed Abd-Elnaby, Noor Mohd, Gaurav Dhiman, and Subhash Sharma. 2022. CNN Based Multiclass Brain Tumor Detection Using Medical Imag- ing.Computational Intelligence and Neuroscience2022, 1 (2022), 1830010. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2022/1830010 doi:10.1155/2022/1830010

work page doi:10.1155/2022/1830010 2022
[36]

Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Comput...

work page doi:10.18653/v1/p19-1656 2019
[37]

Lu Xiao, Jiahao Wu, Zhanke Wang, Guanhua Wu, Runling Liu, Zhiyan Wang, and Ronggang Wang. 2025. Multi-View Image Enhancement Inconsistency Decoupling Guided 3D Gaussian Splatting. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025
[38]

Wei Xu. 2019. Toward human-centered AI: a perspective from human-computer interaction.Interactions26, 4 (June 2019), 42–46. doi:10.1145/3328485

work page doi:10.1145/3328485 2019
[39]

Qu Yang, Xiyang Li, Fu Lin, and Mang Ye. [n. d.]. Adaptive Re-calibration Learning for Balanced Multimodal Intention Recognition. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[40]

Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal senti- ment analysis. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 10790–10797

work page 2021
[41]

Ziqi Yuan, Wei Li, Hua Xu, and Wenmeng Yu. 2021. Transformer-based feature reconstruction network for robust multimodal sentiment analysis. InProceedings of the 29th ACM international conference on multimedia. 4400–4407. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Yifan Wang, Peiwu Wang, Yunxian Chi, Zhinan Gou, and Kai Gao

work page 2021
[42]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis.arXiv preprint arXiv:1707.07250(2017)

work page arXiv 2017
[43]

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos.arXiv preprint arXiv:1606.06259(2016)

work page arXiv 2016
[44]

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2236–2246

work page 2018
[45]

Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central moment discrepancy (CMD) for domain- invariant representation learning.arXiv preprint arXiv:1702.08811(2017)

work page arXiv 2017
[46]

Hanlei Zhang, Xiaoteng Li, Hua Xu, Panpan Zhang, Kang Zhao, and Kai Gao

work page
[47]

TEXTOIR: An integrated and visualized platform for text open intent recognition.arXiv preprint arXiv:2110.15063(2021)

work page arXiv 2021
[48]

Hanlei Zhang, Xin Wang, Hua Xu, Qianrui Zhou, Kai Gao, Jianhua Su, jinyue Zhao, Wenrui Li, and Yanting Chen. 2024. MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=nY9nITZQjc

work page 2024
[49]

Hanlei Zhang, Hua Xu, Xin Wang, Qianrui Zhou, Shaojie Zhao, and Jiayan Teng. 2022. MIntRec: A New Dataset for Multimodal Intent Recognition. In Proceedings of the 30th ACM International Conference on Multimedia (MM ’22). ACM, 1688–1697. doi:10.1145/3503161.3547906

work page doi:10.1145/3503161.3547906 2022
[50]

Hanlei Zhang, Qianrui Zhou, Hua Xu, Jianhua Su, Roberto Evans, and Kai Gao

work page
[51]

Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding.arXiv preprint arXiv:2412.12453(2024)

work page arXiv 2024
[52]

Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang, and Kai Gao. 2024. Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition.Proceedings of the AAAI Conference on Artificial Intelligence38, 15 (Mar. 2024), 17114–17122. doi:10.1609/aaai.v38i15. 29656

work page doi:10.1609/aaai.v38i15 2024
[53]

Qianrui Zhou, Hua Xu, Yifan Wang, Xinzhi Dong, and Hanlei Zhang. 2025. LLM- Guided Semantic Relational Reasoning for Multimodal Intent Recognition. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational L...

work page doi:10.18653/v1/2025.emnlp-main.1130 2025
[54]

Zhihong Zhu, Xuxin Cheng, Zhaorun Chen, Yuyan Chen, Yunyan Zhang, Xian Wu, Yefeng Zheng, and Bowen Xing. 2024. InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing. In Proceedings of the 32nd ACM International Conference on Multimedia(Melbourne VIC, Australia)(MM ’24). Association for Computing Machinery...

work page doi:10.1145/3664647.3681623 2024
[55]

Yicheng Zou, Hongwei Liu, Tao Gui, Junzhe Wang, Qi Zhang, Meng Tang, Haix- iang Li, and Daniel Wang. 2022. Divide and conquer: Text semantic matching with disentangled keywords and intents.arXiv preprint arXiv:2203.02898(2022)

work page arXiv 2022