DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning
Pith reviewed 2026-06-30 05:58 UTC · model grok-4.3
The pith
Dynamic agent scheduling with compressed communication outperforms static multimodal fusion models across five benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DAIN reconceptualizes multimodal fusion as a dynamic, multi-agent collaborative process. It employs a context-aware Meta-Controller that dynamically schedules sparse activation of specialized interaction agents and orchestrates compressed inter-agent communication for consensus-building. The framework is guided by a multi-objective loss function that jointly optimizes task accuracy, agent specialization, and operational efficiency through sparse activation and communication regularization.
What carries the argument
The context-aware Meta-Controller, which dynamically schedules sparse activation of specialized interaction agents and orchestrates their compressed communication.
If this is right
- Performance improvements including a 2.6 percent accuracy gain on ADNI and gains on MIMIC-IV, MM-IMDB, CMU-MOSI, and ENRICO.
- Enhanced interpretability through exposure of context-dependent agent roles and collaboration patterns.
- Computational efficiency maintained by sample-wise sparse agent activation.
- Ablation studies confirm the necessity of both dynamic scheduling and agent communication for the observed results.
Where Pith is reading between the lines
- The same meta-controller approach could be tested on additional multimodal combinations such as video and text to check if the scheduling mechanism generalizes.
- Compressed inter-agent communication may lower latency in deployed systems where multiple modalities arrive in real time.
- If agent specialization emerges reliably, the framework could be extended to automatically discover new agent types for previously unseen data patterns.
Load-bearing premise
The reported accuracy gains arise specifically from the dynamic scheduling and agent communication mechanisms rather than from other unstated implementation details or dataset factors.
What would settle it
An ablation that freezes agent activation to a static pattern while keeping all other components identical and measures whether the 2.6 percent ADNI gain and other benchmark improvements disappear.
Figures
read the original abstract
Current multimodal fusion approaches, particularly those based on static Mixture-of-Experts (MoE) architectures, often struggle to provide the adaptive and efficient collaborative reasoning required by complex real-world applications. We introduce the Dynamic Agent-based Interaction Network (DAIN), which reconceptualizes multimodal fusion as a dynamic, multi-agent collaborative process. DAIN employs a context-aware Meta-Controller that dynamically schedules sparse activation of specialized interaction agents and orchestrates compressed inter-agent communication for consensus-building. The framework is guided by a multi-objective loss function that jointly optimizes task accuracy, agent specialization, and operational efficiency through sparse activation and communication regularization. Comprehensive evaluations across five diverse benchmarks -- ADNI, MIMIC-IV, MM-IMDB, CMU-MOSI, and ENRICO -- establish DAIN as a new state-of-the-art, delivering significant performance improvements including a 2.6\% accuracy gain on ADNI. Ablation studies verify the critical roles of both dynamic scheduling and agent communication. Furthermore, DAIN offers enhanced interpretability by exposing context-dependent agent roles and collaboration patterns while maintaining computational efficiency through sample-wise sparse agent activation. Our work demonstrates the promise of dynamic, agent-based paradigms for multimodal reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DAIN, a Dynamic Agent-based Interaction Network that reconceptualizes multimodal fusion as a dynamic multi-agent process. A context-aware Meta-Controller dynamically schedules sparse activation of specialized interaction agents and orchestrates compressed inter-agent communication for consensus. A multi-objective loss jointly optimizes task accuracy, agent specialization, and efficiency via sparsity and communication regularization. The work claims new state-of-the-art results across five benchmarks (ADNI, MIMIC-IV, MM-IMDB, CMU-MOSI, ENRICO), including a 2.6% accuracy gain on ADNI, with ablation studies confirming the roles of dynamic scheduling and agent communication, plus gains in interpretability and sample-wise efficiency.
Significance. If the reported gains can be shown to arise specifically from the context-aware dynamic scheduling and compressed communication rather than from changes in effective compute or regularization, the framework would offer a substantive alternative to static MoE architectures, with potential advantages in adaptability, efficiency, and interpretability for complex multimodal reasoning.
major comments (2)
- [Ablation Studies] Ablation Studies section: the reported ablations that disable the Meta-Controller or remove the communication term necessarily alter average agent activation count and total message volume. No control is described that holds per-sample agent count, communication budget, and loss weights fixed while replacing only the scheduling policy (context-aware vs. static/random), leaving the attribution of the 2.6% ADNI gain to the dynamic mechanism unsecured.
- [Experimental Results] Experimental Results section (benchmark tables): the manuscript provides no error bars, standard deviations across runs, or statistical significance tests for the claimed improvements (including the 2.6% ADNI gain). Without these, it is impossible to assess whether the SOTA claims are robust or sensitive to random seeds and hyperparameter choices.
minor comments (2)
- [Introduction] The abstract and introduction use the term 'parameter-free' in reference to certain ratios; if this is intended, the relevant equations should be checked for hidden dependencies on fitted hyperparameters.
- [Method] Notation for the multi-objective loss components (accuracy, specialization, efficiency terms) should be introduced with explicit equation numbers rather than descriptive text only.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the two major comments below and will incorporate revisions to strengthen the attribution of results and the robustness of experimental reporting.
read point-by-point responses
-
Referee: [Ablation Studies] Ablation Studies section: the reported ablations that disable the Meta-Controller or remove the communication term necessarily alter average agent activation count and total message volume. No control is described that holds per-sample agent count, communication budget, and loss weights fixed while replacing only the scheduling policy (context-aware vs. static/random), leaving the attribution of the 2.6% ADNI gain to the dynamic mechanism unsecured.
Authors: We agree this is a valid concern and that the existing ablations do not fully isolate the contribution of the context-aware scheduling policy. In the revised manuscript we will add a new controlled ablation experiment that fixes per-sample agent activation count, total message volume, and loss weights while varying only the scheduling policy (context-aware Meta-Controller versus static or random baselines). This will provide a clearer attribution of the reported gains to the dynamic mechanism. revision: yes
-
Referee: [Experimental Results] Experimental Results section (benchmark tables): the manuscript provides no error bars, standard deviations across runs, or statistical significance tests for the claimed improvements (including the 2.6% ADNI gain). Without these, it is impossible to assess whether the SOTA claims are robust or sensitive to random seeds and hyperparameter choices.
Authors: We acknowledge that the absence of variability measures and significance testing weakens the strength of the SOTA claims. In the revision we will rerun all benchmark experiments across multiple random seeds (minimum of five runs), report means with standard deviations in the tables, and include paired statistical significance tests against the strongest baselines to substantiate the reported improvements. revision: yes
Circularity Check
No circularity detected; claims rest on empirical benchmarks without self-referential derivations
full rationale
The provided abstract and description contain no equations, derivations, or mathematical claims that reduce to fitted parameters or self-citations. Performance gains (e.g., 2.6% on ADNI) are asserted via external benchmark evaluations and ablations, with no load-bearing steps that equate outputs to inputs by construction. The multi-objective loss and Meta-Controller are described at a high level without any reduction to prior self-defined quantities. This is the common case of an empirical architecture paper whose central assertions are falsifiable against held-out data rather than internally forced.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Meta-Controller
no independent evidence
-
specialized interaction agents
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Alzheimer’s disease neuroimaging initiative (adni)
ADNI Consortium. Alzheimer’s disease neuroimaging initiative (adni). https://adni.loni. usc.edu/, 2004
2004
-
[2]
González
John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A. González. Gated multi- modal units for information fusion. InInternational Conference on Learning Representations Workshop, 2017
2017
-
[3]
Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers
Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 397–406, 2021. 11 Table 6: Sensitivity analysis of key hyperparameters on ADNI validation set. Default values (bold) are used in all experi...
2021
-
[4]
Transformer interpretability beyond attention visualization
Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 782–791, 2021
2021
-
[5]
Mvi-bench: A comprehensive benchmark for evaluating robustness to misleading visual inputs in lvlms
Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, and Lu Cheng. Mvi-bench: A comprehensive benchmark for evaluating robustness to misleading visual inputs in lvlms. InProceedings of the 43rd International Conference on Machine Learning (ICML 2026), 2025
2026
-
[6]
R2i-bench: Benchmarking reasoning-driven text-to-image generation
Kaijie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. R2i-bench: Benchmarking reasoning-driven text-to-image generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12606–12641, 2025
2025
-
[7]
Ke Chen, Lei Xu, and H. Chi. Improved learning algorithms for mixture of experts in multiclass classification.Neural Networks, 12(9):1229–1252, 1999
1999
-
[8]
Shapleyvic: Shapley values for multimodal importance scoring
Giulia Dominici, Huy Dang, Lara Silini, Karsten Roth, Fabio Chatelain, and Louis-Philippe Morency. Shapleyvic: Shapley values for multimodal importance scoring. InProceedings of the AAAI Conference on Artificial Intelligence, 2023. 12 ADNI: CN ADNI: MCI ADNI: Dementia IMDB: Animation IMDB: Biography IMDB: Thriller 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Average Ac...
2023
-
[9]
Dsadf: Thinking fast and slow for decision making.International Journal of Computer Vision, 134(6):270, 2026
Zhihao Dou, Dongfei Cui, Jun Yan, Weida Wang, Benteng Chen, Haoming Wang, Zeke Xie, and Shufei Zhang. Dsadf: Thinking fast and slow for decision making.International Journal of Computer Vision, 134(6):270, 2026
2026
-
[10]
CoRe-Code: Collaborative Reinforcement Learning for Code Generation
Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Xiaoyu Xia, and Sumon Biswas. Core-code: Collaborative reinforcement learning for code generation.arXiv preprint arXiv:2605.24812, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Zhiqiang Gao, et al. Plan then action: High-level planning guidance reinforcement learning for llm reasoning.arXiv preprint arXiv:2510.01833, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Multimodal contrastive learning for brain imaging.Medical Image Analysis, 2024
Benoit Dufumier, Pietro Gori, Danielle Battaglia, Antoine Grigis, and Edouard Duchesnay. Multimodal contrastive learning for brain imaging.Medical Image Analysis, 2024
2024
-
[13]
Multimodal fusion for explainable ai: A survey
Xiang Fan, Paul Pu Liang, and Louis-Philippe Morency. Multimodal fusion for explainable ai: A survey. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[14]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
2022
-
[15]
Xiang Fei, Jinghui Lu, Qi Sun, Hao Feng, Yanjie Wang, Wei Shi, An-Lan Wang, Jingqun Tang, and Can Huang. Advancing sequential numerical prediction in autoregressive models.arXiv preprint arXiv:2505.13077, 2025. 13 10 20 30 40 50 Epoch 0.0 0.5 1.0 1.5 2.0 2.5Validation Loss Training Convergence Comparison DAIN (Full) DAIN-Static Static MoE Figure 8: Traini...
-
[16]
Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China Information Sciences, 67(12):1–14, 2024
Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wengang Zhou, Houqiang Li, and Can Huang. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China Information Sciences, 67(12):1–14, 2024
2024
-
[17]
Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023
-
[18]
Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025
-
[19]
Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, and Qingming Huang. Exploring hallucination of large multimodal models in video understanding: Benchmark, analysis and mitigation.arXiv preprint arXiv:2503.19622, 2025
-
[21]
Exploiting modality-specific features for multi-modal manipulation detection and grounding
Pallabi Ghosh, Shreya Nag, Aman Bardhan, and Saumik Deb. Exploiting modality-specific features for multi-modal manipulation detection and grounding. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024
2024
-
[22]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
GUI Agents for Continual Game Generation
Yixu Huang, Bo Li, Na Li, Zhe Wang, Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, and Hongcheng Guo. Gui agents for continual game generation. arXiv preprint arXiv:2605.28258, 2026. 14
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Improving deep learning interpretability by saliency guided training.Advances in Neural Information Processing Systems, 34, 2021
Aya Abdelsalam Ismail, Mohamed Gunady, Héctor Corrada Bravo, and Soheil Feizi. Improving deep learning interpretability by saliency guided training.Advances in Neural Information Processing Systems, 34, 2021
2021
-
[25]
Jacobs, Michael I
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991
1991
-
[26]
Multimodal representation learning and fusion
Qihang Jin, Enze Ge, Yuhang Xie, Hongying Luo, Junhao Song, Ziqian Bi, Chia Xin Liang, Jibin Guan, Joe Yeong, and Junfeng Hao. Multimodal representation learning and fusion. arXiv:2506.20494, 2025
-
[27]
Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-wei H. Lehman, Leo A. Celi, and Roger G. Mark. Mimic-iv, a freely accessible electronic health record dataset.Scientific Data, 10:1, 2023
2023
-
[28]
Jordan and Robert A
Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6(2):181–214, 1994
1994
-
[29]
Missing modality imagination network for emotion recognition with uncertain missing modalities
Jinwoo Kim, Hyungjoon Kim, Geonhui Lee, and Jongwoo Lee. Missing modality imagination network for emotion recognition with uncertain missing modalities. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023
2023
-
[30]
Leiva, Asutosh Hota, and Antti Oulasvirta
Luis A. Leiva, Asutosh Hota, and Antti Oulasvirta. Enrico: A dataset for topic modeling of mobile ui designs. In22nd International Conference on Human-Computer Interaction with Mobile Devices and Services, 2020
2020
-
[31]
Guohui Li, Ling Bai, Heng Zhang, Qiang Xu, Yuanze Zhou, Yuan Gao, Mujiangshan Wang, and Zihao Li. Velocity anomalies around the mantle transition zone beneath the qiangtang terrane, central tibetan plateau from triplicated p waveforms.Earth and Space Science, 9(2):e2021EA002060, 2022
2022
-
[32]
Treble counterfactual vlms: A causal approach to hallucination
Li Li, Jiashu Qu, Linxin Song, Yuxiao Zhou, Yuehan Qin, Tiankai Yang, and Yue Zhao. Treble counterfactual vlms: A causal approach to hallucination. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 18423–18434, 2025
2025
-
[33]
Chia Xin Liang, Pu Tian, Caitlyn Heqi Yin, Yao Yua, Wei An-Hou, Li Ming, Tianyang Wang, Ziqian Bi, and Ming Liu. A comprehensive survey and guide to multimodal large language models in vision-language tasks.arXiv:2411.06284, 2024
-
[34]
Quantifying & modeling multimodal interactions: An information decomposition framework
Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, Ruslan Salakhutdinov, and Louis-Philippe Morency. Quantifying & modeling multimodal interactions: An information decomposition framework. InAdvances in Neural Information Processing Systems, volume 36, 2023
2023
-
[35]
Lee, Yuke Zhu, Ruslan Salakhutdinov, and Louis-Philippe Morency
Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Chen, Peter Wu, Michelle A. Lee, Yuke Zhu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Multi- bench: Multiscale benchmarks for multimodal representation learning. InAdvances in Neural Information Processing Systems, volume 34, 2021
2021
-
[36]
Moe-llava: Mixture of experts for large vision-language models
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024
2024
-
[37]
Spts v2: single-point scene text spotting
Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chunhua Shen, Xiang Bai, et al. Spts v2: single-point scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023
2023
-
[38]
Efficient low-rank multimodal fusion with modality- specific factors
Zhun Liu, Ying Shen, Varun Bharadwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. Efficient low-rank multimodal fusion with modality- specific factors. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2247–2256, 2018. 15
2018
-
[39]
Multimodal learning for healthcare: A survey
Junxiang Long, Kun Huang, and Mengling Chen. Multimodal learning for healthcare: A survey. Innpj Digital Medicine, 2024
2024
-
[40]
Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding.arXiv preprint arXiv:2407.01976, 2024
-
[41]
Jinghui Lu, Haiyang Yu, Siliang Xu, Shiwei Ran, Guozhi Tang, Siqi Wang, Bin Shan, Teng Fu, Hao Feng, Jingqun Tang, et al. Prolonged reasoning is not all you need: Certainty-based adaptive routing for efficient llm/mllm reasoning.arXiv preprint arXiv:2505.15154, 2025
-
[42]
André F. T. Martins and Ramón Fernandez Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. InProceedings of the 33rd International Conference on Machine Learning, pages 1614–1623. PMLR, 2016
2016
-
[43]
Multi- modal contrastive learning with limoe: the language-image mixture of experts
Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multi- modal contrastive learning with limoe: the language-image mixture of experts. InAdvances in Neural Information Processing Systems, volume 35, 2022
2022
-
[44]
Hybridgnn: A self-supervised graph neural network for efficient maximum matching in bipartite graphs.Symmetry, 16(12):1631, 2024
Chun-Hui Pan, Yi Qu, Yao Yao, and Mu-Jiang-Shan Wang. Hybridgnn: A self-supervised graph neural network for efficient maximum matching in bipartite graphs.Symmetry, 16(12):1631, 2024
2024
-
[45]
Multimodal explanations: Justifying decisions and pointing to the evidence
Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8779–8788, 2018
2018
-
[46]
Mctbench: Multimodal cognition towards text-rich visual scenes benchmark
Bin Shan, Xiang Fei, Wei Shi, An-Lan Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, and Can Huang. Mctbench: Multimodal cognition towards text-rich visual scenes benchmark. arXiv preprint arXiv:2410.11538, 2024
-
[47]
Shanghai AI Lab, Yicheng Bao, Guanxu Chen, Mingkang Chen, Yunhao Chen, Chiyu Chen, Lingjie Chen, Sirui Chen, Xinquan Chen, Jie Cheng, et al. Safework-r1: Coevolving safety and intelligence under the ai-45◦ law.arXiv preprint arXiv:2507.18576, 2025
-
[48]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017
2017
-
[49]
Attentive eraser: Unleashing dif- fusion model’s object removal potential via self-attention redirection guidance
Wenhao Sun, Xue-Mei Dong, Benlei Cui, and Jingqun Tang. Attentive eraser: Unleashing dif- fusion model’s object removal potential via self-attention redirection guidance. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20734–20742, 2025
2025
-
[50]
Interpretable modality-aware knowledge distillation for multimodal learning
Vinitra Swamy, Pol Gomez, Julien Beuret, Martin Jaggi, and Tanja Käser. Interpretable modality-aware knowledge distillation for multimodal learning. InProceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024
2024
-
[51]
Character recognition competition for street view shop signs.National Science Review, 10(6):nwad141, 2023
Jingqun Tang, Weidong Du, Bin Wang, Wenyang Zhou, Shuqi Mei, Tao Xue, Xing Xu, and Hai Zhang. Character recognition competition for street view shop signs.National Science Review, 10(6):nwad141, 2023
2023
-
[52]
Textsquare: Scaling up text-centric visual instruction tuning.arXiv preprint arXiv:2404.12803, 2024
Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, et al. Textsquare: Scaling up text-centric visual instruction tuning.arXiv preprint arXiv:2404.12803, 2024
-
[53]
Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, et al. Mtvqa: Benchmarking multilingual text-centric visual question answering.arXiv preprint arXiv:2405.11985, 2024. 16
-
[54]
Optimal boxes: boosting end-to-end scene text recognition by adjusting annotated bounding boxes via reinforcement learning
Jingqun Tang, Wenming Qian, Luchuan Song, Xiena Dong, Lan Li, and Xiang Bai. Optimal boxes: boosting end-to-end scene text recognition by adjusting annotated bounding boxes via reinforcement learning. InEuropean Conference on Computer Vision, pages 233–248. Springer, 2022
2022
-
[55]
You can even annotate text with voice: Transcription-only-supervised text spotting
Jingqun Tang, Su Qiao, Benlei Cui, Yuhang Ma, Sheng Zhang, and Dimitrios Kanoulas. You can even annotate text with voice: Transcription-only-supervised text spotting. InProceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 4154–4163, New York, NY , USA, 2022. Association for Computing Machinery
2022
-
[56]
Few could be better than all: Feature sampling and grouping for scene text detection
Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4572, 2022
2022
-
[57]
Learning Individual Styles of Conversational Gesture
Shiry Tsai, Yu Cheng, Paul Pu Liang, Louis-Philippe Morency, and Ruslan Salakhutdinov. Learning individual styles of conversational gesture.arXiv preprint arXiv:1906.04160, 2020
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[58]
Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, 2019
2019
-
[59]
Pargo: Bridging vision-language with partial and global views
An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, and Wei-Shi Zheng. Pargo: Bridging vision-language with partial and global views. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7491–7499, 2025
2025
-
[60]
An-Lan Wang, Jingqun Tang, Liao Lei, Hao Feng, Qi Liu, Xiang Fei, Jinghui Lu, Han Wang, Weiwei Liu, Hao Liu, et al. Wilddoc: How far are we from achieving comprehensive and robust document understanding in the wild?arXiv preprint arXiv:2505.11015, 2025
-
[61]
Vision as lora.arXiv preprint arXiv:2503.20680, 2025
Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, and Can Huang. Vision as lora.arXiv preprint arXiv:2503.20680, 2025
-
[62]
Sufficient conditions for graphs to be maximally 4-restricted edge connected.Australas
Mujiangshan Wang, Yuqing Lin, Shiying Wang, and Meiyu Wang. Sufficient conditions for graphs to be maximally 4-restricted edge connected.Australas. J Comb., 70:123–136, 2018
2018
-
[63]
Connectivity and diagnosability of center k-ary n-cubes
Mujiangshan Wang and Shiying Wang. Connectivity and diagnosability of center k-ary n-cubes. Discrete Applied Mathematics, 294:98–107, 2021
2021
-
[64]
The diagnosability of interconnection networks.Discrete Applied Mathematics, 357:413–428, 2024
Mujiangshan Wang, Dong Xiang, Yi Qu, and Guohui Li. The diagnosability of interconnection networks.Discrete Applied Mathematics, 357:413–428, 2024
2024
-
[65]
Connectivity and diagnosability of leaf-sort graphs.Parallel Processing Letters, 30(03):2040004, 2020
Mujiangshan Wang, Dong Xiang, and Shiying Wang. Connectivity and diagnosability of leaf-sort graphs.Parallel Processing Letters, 30(03):2040004, 2020
2020
-
[66]
Global reliable diagnosis of networks based on self-comparative diagnosis model and g-good-neighbor property.Journal of Computer and System Sciences, page 103698, 2025
Mujiangshan Wang, Shuhao Xu, Jincheng Jiang, Dong Xiang, and Sun-Yuan Hsieh. Global reliable diagnosis of networks based on self-comparative diagnosis model and g-good-neighbor property.Journal of Computer and System Sciences, page 103698, 2025
2025
-
[67]
A note on the connectivity of m-ary n-dimensional hypercubes.Parallel Processing Letters, 29(04):1950017, 2019
Shiying Wang and Mujiangshan Wang. A note on the connectivity of m-ary n-dimensional hypercubes.Parallel Processing Letters, 29(04):1950017, 2019
2019
-
[68]
g-good-neighbor conditional diagnosability of star graph networks under pmc model and mm* model.Frontiers of Mathematics in China, 12(5):1221–1234, 2017
Shiying Wang, Zhenhua Wang, Mujiangshan Wang, and Weiping Han. g-good-neighbor conditional diagnosability of star graph networks under pmc model and mm* model.Frontiers of Mathematics in China, 12(5):1221–1234, 2017
2017
-
[69]
Zichen Wen, Jiashu Qu, Zhaorun Chen, Xiaoya Lu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, and Linfeng Zhang. The devil behind the mask: An emergent safety vulnerability of diffusion llms.arXiv preprint arXiv:2507.11097, 2025. 17
-
[70]
Explainable multimodal ai: A survey
Sandra Wenderoth, Yiming Yang, and Manfred Hauswirth. Explainable multimodal ai: A survey. arXiv preprint arXiv:2405.00000, 2024
-
[71]
Bahr, and Louis-Philippe Morency
Torsten Wörtwein, Gunnar S. Bahr, and Louis-Philippe Morency. Multimodal interaction modeling via self-supervised multi-task learning. InProceedings of the International Conference on Affective Computing and Intelligent Interaction, 2024
2024
-
[72]
Beyond additive fusion: Learning non-additive multimodal interactions
Torsten Wörtwein, Louis-Philippe Morency, and Stefan Scherer. Beyond additive fusion: Learning non-additive multimodal interactions. InFindings of the Association for Computational Linguistics: EMNLP 2022, 2022
2022
-
[73]
G-good-neighbor diagnosability under the modified comparison model for multiprocessor systems.Theoretical Computer Science, 1028:115027, 2025
Dong Xiang, Sun-Yuan Hsieh, et al. G-good-neighbor diagnosability under the modified comparison model for multiprocessor systems.Theoretical Computer Science, 1028:115027, 2025
2025
-
[74]
Dynamic multimodal fusion
Zihui Xue and Radu Marculescu. Dynamic multimodal fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023
2023
-
[75]
Multimodal embeddings for representation learning
Zheyu Yao, Yichao Zhang, Ningyuan Deng, Xinyuan Song, Ziqian Bi, Keyu Chen, Ming Li, Qian Niu, Junyu Liu, Benji Peng, et al. Multimodal embeddings for representation learning. 2025
2025
-
[76]
Drdgrl: Dual-relational dynamic graph repre- sentation learning for delay-sensitive stock trend prediction
Mingjie You, Kaijie Chen, and Dawei Cheng. Drdgrl: Dual-relational dynamic graph repre- sentation learning for delay-sensitive stock trend prediction. InInternational Conference on Database Systems for Advanced Applications, pages 35–50. Springer, 2026
2026
-
[77]
Mmoe: Enhancing multimodal models with mixtures of multimodal interaction experts
Haofei Yu, Zhengyang Qi, Lawrence Jang, Ruslan Salakhutdinov, Louis-Philippe Morency, and Paul Pu Liang. Mmoe: Enhancing multimodal models with mixtures of multimodal interaction experts. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10006–10030, 2024
2024
-
[78]
Wilson, and Paul D
Seniha Esen Yüksel, Joseph N. Wilson, and Paul D. Gader. Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177–1193, 2012
2012
-
[79]
Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. InIEEE Intelligent Systems, 2016
2016
-
[80]
Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2236–2246, 2018
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.