arxiv: 2605.11468 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

CAMPA: Efficient and Aligned Multimodal Graph Learning via Decoupled Propagation and Aggregation

Daohan Su , Hao Liu , Xunkai Li , Yinlin Zhu , Xiong Yongfu , Yi Liu , Hongchao Qin , Rong-Hua Li

show 1 more author

Guoren Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal graph neural networksdecoupled propagationmodal conflictcross-modal alignmenttrajectory alignmentgraph representation learningefficient multimodal learning

0 comments

The pith

Decoupled multimodal graph networks gain accuracy by aligning cross-modal propagation and aggregation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that decoupled multimodal graph neural networks are substantially more efficient and scalable than coupled ones, yet they are limited by modal conflicts that create semantic divergence during independent multi-hop propagation and misaligned feature trajectories during naive aggregation. CAMPA counters this with a two-stage alignment process that preserves consistency across modalities without adding parameters. If correct, the work shows that the efficiency benefits of decoupling can be retained while closing the performance gap to more expensive coupled architectures on multimodal attributed graphs. This matters for applications involving large graphs with multiple data types, such as social networks or recommendation systems, where both speed and accurate fusion of information are needed.

Core claim

CAMPA resolves modal conflict in decoupled MGNN pipelines by introducing cross-modal aligned propagation, which injects cross-modal similarity priors into message passing to maintain semantic consistency, and trajectory aligned aggregation, which applies trajectory-level self-attention and cross-attention to capture and align long-range dependencies across modalities and hops; extensive experiments confirm this yields consistent outperformance over strong coupled and decoupled baselines while preserving the computational advantages of decoupling.

What carries the argument

Two-stage alignment mechanism of cross-modal aligned propagation (injecting similarity priors into message passing) and trajectory aligned aggregation (using self-attention and cross-attention on multi-hop trajectories).

Load-bearing premise

Modal conflict is the primary bottleneck in existing decoupled MGNNs and the proposed two-stage alignment resolves it without introducing new inconsistencies or overhead.

What would settle it

Running CAMPA on additional large-scale multimodal graph datasets and finding either no accuracy gains over strong decoupled baselines or a loss of efficiency advantages would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.11468 by Daohan Su, Guoren Wang, Hao Liu, Hongchao Qin, Rong-Hua Li, Xiong Yongfu, Xunkai Li, Yi Liu, Yinlin Zhu.

**Figure 2.** Figure 2: Overall framework of CAMPA. It first performs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Efficiency analysis on four representative dataset-task pairs. Each subplot reports task [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Hyperparameter sensitivity with respect to propagation depth [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Heatmap visualization of cross-modal feature correlations. From top to bottom, we show the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: T-SNE visualization of image and text embeddings. From top to bottom, we show the [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Multimodal Graph Neural Networks (MGNNs) have shown strong potential for learning from multimodal attributed graphs, yet most existing approaches rely on tightly coupled architectures that suffer from prohibitive computational overhead. In this paper, we present a systematic empirical analysis showing that decoupled MGNNs are substantially more efficient and scalable for large-scale graph learning. However, we identify a critical bottleneck in existing decoupled pipelines, namely modal conflict, which arises in both the propagation and aggregation stages. Specifically, independent multi-hop diffusion causes cross-modal semantic divergence during propagation, while naive fusion fails to align multi-hop feature trajectories during aggregation, jointly limiting effective representation learning. To address this challenge, we propose CAMPA, a Cross-modal Aligned Multimodal Propagation & Aggregation framework for decoupled multimodal graph learning. Concretely, CAMPA introduces a two-stage alignment mechanism: (1) cross-modal aligned propagation, which injects cross-modal similarity priors into message passing to preserve semantic consistency without additional parameter overhead; (2) trajectory aligned aggregation, which leverages trajectory-level self-attention and cross-attention to capture and align long-range dependencies across modalities and hops. Extensive experiments on diverse benchmark datasets and tasks demonstrate that CAMPA consistently outperforms strong coupled and decoupled baselines while preserving the efficiency advantages of the decoupled paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAMPA targets modal conflict in decoupled MGNNs with two alignment steps but the attention-based aggregation needs proof it stays linear.

read the letter

The core of this paper is a practical fix for decoupled multimodal GNNs: they run a systematic check showing decoupled pipelines beat coupled ones on speed, then pinpoint modal conflict in propagation (cross-modal divergence from independent hops) and aggregation (misaligned trajectories). Their CAMPA adds cross-modal similarity priors to message passing and trajectory self/cross-attention for fusion, claiming no extra parameters in the first part and preserved efficiency overall. That two-stage split is the actual new piece beyond standard decoupled MGNNs. It is a clean way to keep the scalability edge while trying to improve representation quality, and the abstract's claim of consistent gains on benchmarks is worth checking if the ablations back it up. The citation pattern looks standard and cites the right prior decoupled work without obvious gaps. The soft spot is the aggregation step. Attention over multi-hop trajectories across modalities is quadratic unless they linearize it with windows, low-rank tricks, or sparsification; the abstract asserts efficiency is retained but supplies no complexity bound or per-hop runtime table. If that cost grows on large graphs or deeper hops, the main selling point collapses. Experiments need to isolate this with wall-clock numbers and scaling plots, plus statistical tests on the reported wins. This is for people building scalable multimodal graph models who already like decoupled designs. A reader can extract the alignment ideas even if the full results need scrutiny. It shows coherent problem framing and honest engagement with the efficiency trade-off, so it deserves a serious referee who can demand the missing scaling evidence and ablations. Recommendation: send to peer review with a note to verify the aggregation cost in revision.

Referee Report

2 major / 1 minor

Summary. The paper claims that decoupled MGNNs are substantially more efficient and scalable than coupled ones for large-scale multimodal graph learning, but suffer from a critical 'modal conflict' bottleneck: independent multi-hop diffusion causes cross-modal semantic divergence in propagation, while naive fusion fails to align multi-hop feature trajectories in aggregation. To resolve this, CAMPA introduces a two-stage alignment: (1) cross-modal aligned propagation that injects cross-modal similarity priors into message passing with no extra parameters, and (2) trajectory aligned aggregation that uses trajectory-level self-attention and cross-attention to capture long-range dependencies across modalities and hops. Extensive experiments are said to show consistent outperformance over strong coupled and decoupled baselines while preserving decoupled efficiency.

Significance. If the empirical claims hold with full verification, this would be a useful contribution to scalable multimodal graph learning by showing how to add alignment to the decoupled paradigm without sacrificing its asymptotic advantages. The identification of modal conflict as a joint propagation-aggregation issue provides a concrete diagnostic lens, and the parameter-free prior injection in propagation is a clean design choice that could influence future MGNN architectures.

major comments (2)

[Method description of trajectory aligned aggregation] The central efficiency claim (preserving the decoupled paradigm's advantages) is load-bearing, yet the abstract and method description provide no analysis or bounds on the computational complexity of trajectory aligned aggregation. Attention over multi-hop trajectories across modalities is quadratic in the number of hops/modalities unless linearized, sparsified, or windowed; without explicit confirmation that the implementation avoids this cost on large graphs, the headline claim that efficiency is retained cannot be assessed.
[Experimental results summary] The abstract asserts that CAMPA 'consistently outperforms strong coupled and decoupled baselines' on 'diverse benchmark datasets and tasks,' but supplies no information on the datasets used, exact baseline implementations, evaluation metrics, statistical tests, number of runs, or ablation studies isolating the two alignment stages. This absence prevents verification of the central empirical claim and the weakest assumption that the proposed alignment resolves modal conflict without introducing new inconsistencies.

minor comments (1)

[Introduction] The term 'modal conflict' is introduced as a new bottleneck without a formal definition, mathematical characterization, or references to related concepts in multimodal or multi-view graph learning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment in detail below, providing clarifications from the manuscript and committing to targeted revisions that strengthen the presentation of our efficiency claims and experimental details without altering the core contributions.

read point-by-point responses

Referee: [Method description of trajectory aligned aggregation] The central efficiency claim (preserving the decoupled paradigm's advantages) is load-bearing, yet the abstract and method description provide no analysis or bounds on the computational complexity of trajectory aligned aggregation. Attention over multi-hop trajectories across modalities is quadratic in the number of hops/modalities unless linearized, sparsified, or windowed; without explicit confirmation that the implementation avoids this cost on large graphs, the headline claim that efficiency is retained cannot be assessed.

Authors: We agree that an explicit complexity analysis is necessary to fully substantiate the efficiency claim. In the manuscript, trajectory aligned aggregation operates on fixed-length per-node trajectories (with small constants for hops H and modalities M, typically H=2-3 and M=2-3 in our experiments), using standard scaled dot-product attention followed by efficient linear projections. The per-node cost is O((H*M)^2 * d) where d is the feature dimension, which remains negligible relative to the linear propagation cost of the decoupled paradigm. We will add a dedicated subsection in the revised Methods (Section 3.3) providing both asymptotic bounds and empirical wall-clock runtime comparisons on the largest benchmark graphs to confirm that the overhead does not compromise the overall scalability advantage. revision: yes
Referee: [Experimental results summary] The abstract asserts that CAMPA 'consistently outperforms strong coupled and decoupled baselines' on 'diverse benchmark datasets and tasks,' but supplies no information on the datasets used, exact baseline implementations, evaluation metrics, statistical tests, number of runs, or ablation studies isolating the two alignment stages. This absence prevents verification of the central empirical claim and the weakest assumption that the proposed alignment resolves modal conflict without introducing new inconsistencies.

Authors: The full manuscript contains a comprehensive Experiments section (Section 4) that details the benchmark datasets, baseline implementations (with citations and hyperparameter settings), evaluation metrics, number of runs (5 random seeds with reported means and standard deviations), and ablation studies that isolate the contributions of cross-modal aligned propagation and trajectory aligned aggregation separately. We will revise the abstract to include a brief reference to these details and expand the ablation analysis in the revision to explicitly demonstrate that each alignment stage independently mitigates modal conflict without introducing inconsistencies, including additional statistical significance tests (paired t-tests) across all tasks. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical proposal with independent mechanisms

full rationale

The paper's core contribution is an empirical identification of modal conflict in decoupled MGNNs followed by a proposed two-stage alignment (cross-modal priors in propagation and trajectory attention in aggregation). No equations, derivations, or first-principles results are presented that reduce to fitted parameters, self-definitions, or self-citation chains. The efficiency and performance claims rest on benchmark experiments rather than any tautological reduction, satisfying the criteria for a self-contained non-circular analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the domain assumption that modal conflict is the primary bottleneck in decoupled MGNNs. The paper introduces two new mechanisms without detailing free parameters or external axioms in the abstract.

axioms (2)

domain assumption Independent multi-hop diffusion causes cross-modal semantic divergence during propagation
Stated as a critical bottleneck identified in the empirical analysis of existing decoupled pipelines.
domain assumption Naive fusion fails to align multi-hop feature trajectories during aggregation
Stated as the second part of the joint limitation on effective representation learning.

invented entities (2)

cross-modal aligned propagation no independent evidence
purpose: Injects cross-modal similarity priors into message passing to preserve semantic consistency
New component proposed for the propagation stage with no additional parameter overhead.
trajectory aligned aggregation no independent evidence
purpose: Leverages trajectory-level self-attention and cross-attention to capture and align long-range dependencies
New component proposed for the aggregation stage.

pith-pipeline@v0.9.0 · 5552 in / 1341 out tokens · 39071 ms · 2026-05-13T01:49:11.662860+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

[1]

Rumor detection on social media with bi-directional graph convolutional networks

Tian Bian, Xi Xiao, Tingyang Xu, Peilin Zhao, Wenbing Huang, Yu Rong, and Junzhou Huang. Rumor detection on social media with bi-directional graph convolutional networks. In Proceedings of the Association for the Advancement of Artificial Intelligence, AAAI, 2020

work page 2020
[2]

How attentive are graph attention networks?Interna- tional Conference on Learning Representations, ICLR, 2022

Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks?Interna- tional Conference on Learning Representations, ICLR, 2022

work page 2022
[3]

Multimodal graph neural architecture search under distribution shifts

Jie Cai, Xin Wang, Haoyang Li, Ziwei Zhang, and Wenwu Zhu. Multimodal graph neural architecture search under distribution shifts. InProceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2024

work page 2024
[4]

Nagphormer: A tokenized graph transformer for node classification in large graphs

Jinsong Chen, Kaiyuan Gao, Gaichao Li, and Kun He. Nagphormer: A tokenized graph transformer for node classification in large graphs. InInternational Conference on Learning Representations, ICLR, 2023

work page 2023
[5]

Simple and deep graph convolutional networks

Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. InInternational Conference on Machine Learning, ICML, 2020

work page 2020
[6]

Redcaps: Web-curated image-text data created by the people, for the people

Karan Desai, Gaurav Kaul, Zubin Trivadi Aysola, and Justin Johnson. Redcaps: Web-curated image-text data created by the people, for the people. InAdvances in Neural Information Processing Systems, NeurIPS Datasets and Benchmarks Track, 2021

work page 2021
[7]

Mlaga: Multimodal large language and graph assistant.arXiv preprint arXiv:2506.02568, 2025

Dongzhe Fan, Yi Fang, Jiajin Liu, Djellel Difallah, and Qiaoyu Tan. Mlaga: Multimodal large language and graph assistant.arXiv preprint arXiv:2506.02568, 2025

work page arXiv 2025
[8]

Graphgpt- o: Synergistic multimodal comprehension and generation on graphs

Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, and Jiawei Han. Graphgpt- o: Synergistic multimodal comprehension and generation on graphs. InProceedings of the Computer Vision and Pattern Recognition Conference, CVPR, pages 19467–19476, 2025

work page 2025
[9]

Sign: Scalable inception graph neural networks.arXiv preprint arXiv:2004.11198,

Fabrizio Frasca, Emanuele Rossi, Davide Eynard, Ben Chamberlain, Michael Bronstein, and Federico Monti. Sign: Scalable inception graph neural networks.arXiv preprint arXiv:2004.11198, 2020

work page arXiv 2004
[10]

How to read paintings: semantic art understanding with multi- modal retrieval

Noa Garcia and George V ogiatzis. How to read paintings: semantic art understanding with multi- modal retrieval. InProceedings of the European Conference on Computer Vision Workshops, ECCV, 2018

work page 2018
[11]

Disentangling homophily and heterophily in multimodal graph clustering.CoRR, abs/2507.15253, 2025

Zhaochen Guo, Zhixiang Shen, Xuanting Xie, Liangjian Wen, and Zhao Kang. Disentangling homophily and heterophily in multimodal graph clustering.CoRR, abs/2507.15253, 2025

work page arXiv 2025
[12]

Lgmrec: Local and global graph learning for multimodal recommendation

Zhiqiang Guo, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi, and Bin Ruan. Lgmrec: Local and global graph learning for multimodal recommendation. InAAAI, pages 8454–8462. AAAI Press, 2024

work page 2024
[13]

Unigraph2: Learning a unified embedding space to bind multimodal graphs.arXiv preprint arXiv:2502.00806, 2025

Yufei He, Yuan Sui, Xiaoxin He, Yue Liu, Yifei Sun, and Bryan Hooi. Unigraph2: Learning a unified embedding space to bind multimodal graphs.arXiv preprint arXiv:2502.00806, 2025

work page arXiv 2025
[14]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Ntsformer: A self-teaching graph transformer for multimodal isolated cold-start node classification

Jun Hu, Yufei He, Yuan Li, Bryan Hooi, and Bingsheng He. Ntsformer: A self-teaching graph transformer for multimodal isolated cold-start node classification. InProceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2025

work page 2025
[16]

Modality-independent graph neural networks with global transformers for multimodal recommendation

Jun Hu, Bryan Hooi, Bingsheng He, and Yinwei Wei. Modality-independent graph neural networks with global transformers for multimodal recommendation. InAAAI, pages 11790– 11798. AAAI Press, 2025

work page 2025
[17]

Mgdcf: Distance learning via markov graph diffusion for neural collaborative filtering.IEEE Transactions on Knowledge and Data Engineering, 36(7):3281–3296, 2024

Jun Hu, Bryan Hooi, Shengsheng Qian, Quan Fang, and Changsheng Xu. Mgdcf: Distance learning via markov graph diffusion for neural collaborative filtering.IEEE Transactions on Knowledge and Data Engineering, 36(7):3281–3296, 2024. 10

work page 2024
[18]

Instructg2i: Synthesizing images from multimodal attributed graphs.Advances in Neural Information Processing Systems, 37:117614–117635, 2024

Bowen Jin, Ziqi Pang, Bingjun Guo, Yu-Xiong Wang, Jiaxuan You, and Jiawei Han. Instructg2i: Synthesizing images from multimodal attributed graphs.Advances in Neural Information Processing Systems, 37:117614–117635, 2024

work page 2024
[19]

Semi-supervised classification with graph convolutional networks

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, ICLR, 2017

work page 2017
[20]

C-mag: Cascade multimodal attributed graphs for supply chain link prediction

Yunqing Li, Zixiang Tang, Jiaying Zhuang, Zhenyu Yang, Farhad Ameri, and Jianbang Zhang. C-mag: Cascade multimodal attributed graphs for supply chain link prediction. InProceedings of the KDD Workshop on AI for Supply Chain, 2025

work page 2025
[21]

Graph- mllm: Harnessing multimodal large language models for multimodal graph learning.arXiv preprint arXiv:2506.10282, 2025

Jiajin Liu, Dongzhe Fan, Jiacheng Shen, Chuanhao Ji, Daochen Zha, and Qiaoyu Tan. Graph- mllm: Harnessing multimodal large language models for multimodal graph learning.arXiv preprint arXiv:2506.10282, 2025

work page arXiv 2025
[22]

Justifying recommendations using distantly- labeled reviews and fine-grained aspects

Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly- labeled reviews and fine-grained aspects. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 188–197. Association for Computational L...

work page 2019
[23]

Justifying recommendations using distantly- labeled reviews and fine-grained aspects

Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly- labeled reviews and fine-grained aspects. InProceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, 2019

work page 2019
[24]

Graph4mm: Weaving multimodal learning with structural information

Xuying Ning, Dongqi Fu, Tianxin Wei, Wujiang Xu, and Jingrui He. Graph4mm: Weaving multimodal learning with structural information. InProceedings of the International Conference on Machine Learning, ICML, 2025

work page 2025
[25]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaa El-Nouby, et al. Dinov2: Learning robust visual features without supervision. InTransactions on Machine Learning Research, TMLR, 2024

work page 2024
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, ICML, 2021

work page 2021
[27]

MGAT: multimodal graph attention network for recommendation.Inf

Zhulin Tao, Yinwei Wei, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat-Seng Chua. MGAT: multimodal graph attention network for recommendation.Inf. Process. Manag., 57(5):102277, 2020

work page 2020
[28]

Graph attention networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Representations, ICLR, 2018

work page 2018
[29]

arXiv preprint arXiv:2602.05576 , year =

Chenxi Wan, Xunkai Li, Yilong Zuo, Haokun Deng, Sihan Li, Bowen Fan, Hongchao Qin, Ronghua Li, and Guoren Wang. Openmag: A comprehensive benchmark for multimodal- attributed graph.arXiv preprint arXiv:2602.05576, 2026

work page arXiv 2026
[30]

Item recommendation on monotonic behavior chains

Mengting Wan and Julian McAuley. Item recommendation on monotonic behavior chains. In Proceedings of the ACM Conference on Recommender Systems, RecSys, 2018

work page 2018
[31]

Fine-grained spoiler detection from large-scale review corpora

Mengting Wan, Rishabh Misra, Ndapandula Nakashole, and Julian McAuley. Fine-grained spoiler detection from large-scale review corpora. InProceedings of the Annual Meeting of the Association for Computational Linguistics, ACL, 2019

work page 2019
[32]

Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024. 11

work page 2024
[33]

Towards multi-modal graph large language model.arXiv preprint arXiv:2506.09738, 2025

Xin Wang, Zeyang Zhang, Linxin Xiao, Haibo Chen, Chendi Ge, and Wenwu Zhu. Towards multi-modal graph large language model.arXiv preprint arXiv:2506.09738, 2025

work page arXiv 2025
[34]

MMGCN: multi-modal graph convolution network for personalized recommendation of micro- video

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. MMGCN: multi-modal graph convolution network for personalized recommendation of micro- video. InACM Multimedia, pages 1437–1445. ACM, 2019

work page 2019
[35]

Simplifying graph convolutional networks

Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. InInternational Conference on Machine Learning, ICML, 2019

work page 2019
[36]

A comprehensive survey on graph neural networks.IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, 2020

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks.IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, 2020

work page 2020
[37]

When graph meets multimodal: benchmarking and meditating on multimodal attributed graph learning

Hao Yan, Chaozhuo Li, Jun Yin, Zhigang Yu, Weihao Han, Mingzheng Li, Zhengxin Zeng, Hao Sun, and Senzhang Wang. When graph meets multimodal: benchmarking and meditating on multimodal attributed graph learning. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD, 2025

work page 2025
[38]

Multimodal graph learning for generative tasks.arXiv preprint arXiv:2310.07478, 2023

Minji Yoon, Jing Yu Koh, Bryan Hooi, and Ruslan Salakhutdinov. Multimodal graph learning for generative tasks.arXiv preprint arXiv:2310.07478, 2023

work page arXiv 2023
[39]

Node dependent local smoothing for scalable graph learning.Advances in Neural Information Processing Systems, NeurIPS, 2021

Wentao Zhang, Mingyu Yang, Zeang Sheng, Yang Li, Wen Ouyang, Yangyu Tao, Zhi Yang, and Bin Cui. Node dependent local smoothing for scalable graph learning.Advances in Neural Information Processing Systems, NeurIPS, 2021

work page 2021
[40]

Graph attention multi-layer perceptron.Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD, 2022

Wentao Zhang, Ziqi Yin, Zeang Sheng, Yang Li, Wen Ouyang, Xiaosen Li, Yangyu Tao, Zhi Yang, and Bin Cui. Graph attention multi-layer perceptron.Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD, 2022

work page 2022
[41]

Cross-contrastive clustering for multimodal attributed graphs with dual graph filtering, 2025

Haoran Zheng, Renchi Yang, Hongtao Wang, and Jianliang Xu. Cross-contrastive clustering for multimodal attributed graphs with dual graph filtering, 2025

work page 2025
[42]

Graph neural networks: Taxonomy, advances, and trends.ACM Transactions on Intelligent Systems and TechnoLoGy, 13(1):1–54, 2022

Yu Zhou, Haixia Zheng, Xin Huang, Shufeng Hao, Dengao Li, and Jumin Zhao. Graph neural networks: Taxonomy, advances, and trends.ACM Transactions on Intelligent Systems and TechnoLoGy, 13(1):1–54, 2022

work page 2022
[43]

Toys & Games

Jing Zhu, Yuhang Zhou, Shengyi Qian, Zhongmou He, Tong Zhao, Neil Shah, and Danai Koutra. Mosaic of modalities: A comprehensive benchmark for multimodal graph learning. In Proceedings of the Computer Vision and Pattern Recognition Conference, CVPR, 2025. 12 A Supplementary Empirical Analysis This appendix supplements the empirical studies in Sec. 1 and pr...

work page 2025
[44]

Limitations

For Modality Retrieval, Modality Matching, and Modality Alignment, we adopt task-specific downstream contrastive models and objectives with a temperature scaling factorτ= 0.1 . The models are trained for up to 500 epochs with a learning rate of 1×10 −3, batch size of 256, and early-stop patience of 50. For the G2Image [18] task, we adopt Stable Diffusion ...

work page
[45]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page