Behavior-Guided Candidate Calibration for Multimodal Recommendation

Chengchang Pan; Honggang Qi; Zesheng Li

arxiv: 2605.22073 · v1 · pith:BJS5Q6HYnew · submitted 2026-05-21 · 💻 cs.IR

Behavior-Guided Candidate Calibration for Multimodal Recommendation

Zesheng Li , Chengchang Pan , Honggang Qi This is my paper

Pith reviewed 2026-05-22 04:16 UTC · model grok-4.3

classification 💻 cs.IR

keywords multimodal recommendationcandidate calibrationco-user overlapspectral analysisbehavior-guided modelranking pipelineAmazon datasetscontent signals

0 comments

The pith

A calibration model uses training-only co-user overlaps to adjust shortlists from multimodal recommenders without altering their core representation space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal recommenders gain from content signals only when those signals do not suppress ranking-specific variation. Spectral analysis reveals that low-frequency components hold shared structure while higher-frequency components keep discriminative details. This split supports a new calibration step that turns co-user overlap seen only in training into signed evidence and applies it solely to the shortlist the multimodal backbone produces. The backbone therefore stays stable while the added evidence targets the point where final ranking occurs. Tests on Amazon Baby, Sports, and Electronics datasets record steady lifts over strong multimodal baselines.

Core claim

The central claim is that a behavior-guided candidate calibration model converts training-only co-user overlap into signed candidate evidence and restricts its use to the shortlist generated by the multimodal backbone. The backbone preserves the representation space while behavior evidence influences only the ranking decision stage. Spectral analysis of cross-view agreement underpins the selective application by showing low-frequency components capture shared structure and higher-frequency components retain more discriminative signal.

What carries the argument

The behavior-guided candidate calibration model that turns training-only co-user overlap into signed candidate evidence applied exclusively to the multimodal backbone's shortlist.

If this is right

Moderate cross-view agreement improves recommendations while stronger agreement reduces recommendation-specific variation.
The multimodal backbone maintains stable representations because behavior evidence touches only the final shortlist.
Consistent accuracy gains appear across Amazon Baby, Sports, and Electronics over strong multimodal baselines.
Behavior evidence from training data can be injected without retraining the content backbone or introducing test-time signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective calibration pattern could transfer to other fusion methods that currently blend all signals at once.
Training-only overlap signals may lower the risk of leakage compared with approaches that rely on overlaps observed at inference time.
Extending the calibration to update shortlists dynamically during online serving could further reduce ranking errors in live systems.

Load-bearing premise

The spectral analysis split between low-frequency shared structure and higher-frequency discriminative signal remains reliable enough to guide selective use of behavior evidence without creating new instabilities in the ranking pipeline.

What would settle it

Applying the same signed evidence to the full candidate list instead of only the shortlist and measuring whether ranking stability or accuracy degrades would directly test the selective-application premise.

Figures

Figures reproduced from arXiv: 2605.22073 by Chengchang Pan, Honggang Qi, Zesheng Li.

**Figure 1.** Figure 1: Overview of BRIDGE. DFGE constructs dual-frequency graph evidence; BEN computes normalized behavior support; [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Representation-spectral frequency diagnostic on Amazon Baby. Error bars denote the observed ranges. Leading [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Compact interpretability controls for BRIDGE. BEN [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Multimodal recommendation benefits from content signals, but the gain depends on how those signals interact with the ranking pipeline. We find that moderate cross-view agreement helps, while stronger agreement suppresses recommendation-specific variation. Spectral analysis shows a clear split: low-frequency components capture shared structure, and higher-frequency components preserve more discriminative signal. Based on this finding, we introduce a behavior-guided candidate calibration model that converts training-only co-user overlap into signed candidate evidence and applies it only to the shortlist produced by the multimodal backbone. The backbone keeps the representation space stable; behavior evidence acts only where ranking is decided. Results on Amazon Baby, Sports, and Electronics show consistent gains over strong multimodal baselines. Code is available at https://github.com/LIZESHENG13/bridge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that multimodal recommendation can be improved by a behavior-guided candidate calibration that converts training-only co-user overlap into signed candidate evidence and applies it selectively to the shortlist from a multimodal backbone. This is motivated by a spectral analysis showing that low-frequency components capture shared structure while higher-frequency components preserve more discriminative signal. The backbone maintains stable representations, and the calibration intervenes only at ranking time. Experiments on Amazon Baby, Sports, and Electronics datasets report consistent gains over strong multimodal baselines, with code released.

Significance. The selective integration of behavioral evidence based on spectral properties addresses a key challenge in multimodal recsys where content signals can suppress variation. The training-only data usage avoids test leakage, and the code availability is a positive for reproducibility. If validated, this could offer a practical calibration technique for ranking pipelines.

major comments (2)

The spectral split observation is load-bearing for the design choice of selective application, but the manuscript lacks details on the construction of the graph or matrix for spectral decomposition and the precise definition of low vs high frequency components.
The results section reports consistent gains but does not include error bars, number of random seeds, or statistical significance tests, which weakens the support for the central claim of consistent improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The comments are constructive and we address each one below, indicating the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: The spectral split observation is load-bearing for the design choice of selective application, but the manuscript lacks details on the construction of the graph or matrix for spectral decomposition and the precise definition of low vs high frequency components.

Authors: We agree that additional detail is needed to make the spectral analysis reproducible and to fully justify the selective calibration design. In the revised manuscript we will expand the relevant section to describe the exact construction of the graph (a normalized user-item interaction matrix derived solely from the training split), the Laplacian used for decomposition, and the precise cutoff criterion separating low-frequency (shared structure) from high-frequency (discriminative) components based on eigenvalue thresholds. This will clarify why behavior-guided evidence is applied only to the higher-frequency regime. revision: yes
Referee: The results section reports consistent gains but does not include error bars, number of random seeds, or statistical significance tests, which weakens the support for the central claim of consistent improvements.

Authors: We acknowledge that variability reporting strengthens empirical claims. In the revised version we will report mean performance and standard deviation over five independent random seeds for all methods, explicitly state the seed count, and include paired t-test p-values against the strongest baselines to demonstrate that the observed gains are statistically significant. These additions will be placed in the main results tables and the experimental setup subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: spectral observation and training-only evidence are independent of reported gains

full rationale

The paper first conducts spectral analysis on the multimodal representations to identify the low-frequency shared structure versus higher-frequency discriminative signal split; this is an empirical measurement performed on the data prior to model design. It then constructs a behavior-guided calibration that converts training-only co-user overlap statistics into signed evidence and restricts application to the backbone shortlist. Because the overlap evidence is drawn exclusively from training interactions and the spectral split is an observed property rather than a fitted parameter renamed as a prediction, neither step reduces to the final performance numbers by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation chain. The reported improvements on Amazon datasets are therefore measured against external baselines rather than being tautological with the input construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the spectral frequency split observation and the modeling choice that behavior evidence can be isolated to the candidate stage without destabilizing the backbone.

free parameters (1)

signed evidence conversion rules
Rules or thresholds for turning co-user overlap into positive or negative signed evidence are likely chosen or tuned to data.

axioms (1)

domain assumption Moderate cross-view agreement helps while stronger agreement suppresses recommendation-specific variation
This observation from spectral analysis is used to motivate the selective application of behavior evidence.

pith-pipeline@v0.9.0 · 5652 in / 1263 out tokens · 69431 ms · 2026-05-22T04:16:47.533724+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Behavior Evidence Normalizer (BEN) builds an item-item behavior graph from co-user similarity... b_ui = (e_bui - μ_u) / (σ_u + ε)... Candidate Residual Integrator (CRI) uses BEN evidence as a signed score residual... Δ_bridge(u,i) = I[i ∈ C_tr_u] λ_b b_ui

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

Ji Dai, Quan Fang, Jun Hu, Desheng Cai, Yang Yang, and Can Zhao. 2026. Cross- Modal Attention Network with Dual Graph Learning in Multimodal Recommen- dation.arXiv preprint arXiv:2601.11151(2026)

work page arXiv 2026
[2]

Ziyuan Guo, Jie Guo, Zhenghao Chen, Bin Song, and Fei Richard Yu. 2025. IGDMRec: Behavior Conditioned Item Graph Diffusion for Multimodal Recom- mendation.arXiv preprint arXiv:2512.19983(2025)

work page arXiv 2025
[3]

Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI ’16)

work page 2016
[4]

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. InProceedings of the 43rd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval (SIGIR ’20). 639–648. doi:10.1145/3397271.3401063

work page doi:10.1145/3397271.3401063 2020
[5]

Yangqin Jiang, Lianghao Xia, Wei Wei, Da Luo, Kangyi Lin, and Chao Huang

work page
[6]

InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24)

DiffMM: Multi-Modal Diffusion Model for Recommendation. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). 7591–7599

work page
[7]

Chenghao Li, Wei Zhou, Yihao Zhang, Jiahao Hu, Huayi Shen, and Junhao Wen. 2026. MSCF-Net: Multi-scale Frequency Denoising and Co-frequency Enhancement Network for Multimodal Recommendation.Expert Systems with Applications285 (2026), 127702. doi:10.1016/j.eswa.2026.127702

work page doi:10.1016/j.eswa.2026.127702 2026
[8]

Yuan Li, Jun Hu, Jiaxin Jiang, Bryan Hooi, and Bingsheng He. 2026. Robust Mul- timodal Recommendation via Graph Retrieval-Enhanced Modality Completion. arXiv preprint arXiv:2605.00670(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Yuecheng Li, Hengwei Ju, Zeyu Song, Wei Yang, Chi Lu, Peng Jiang, and Kun Gai. 2026. RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment.arXiv preprint arXiv:2602.00682(2026)

work page arXiv 2026
[10]

Yang Li, Qi’Ao Zhao, Chen Lin, Jinsong Su, and Zhilin Zhang. 2024. Who To Align With: Feedback-Oriented Multi-Modal Alignment in Recommendation Systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). 667–676

work page 2024
[11]

Yifan Liu, Kangning Zhang, Xiangyuan Ren, Yanhua Huang, Jiarui Jin, Yingjie Qin, Ruilong Su, Ruiwen Xu, Yong Yu, and Weinan Zhang. 2024. AlignRec: Aligning and Training in Multimodal Recommendations(CIKM ’24). Association for Computing Machinery, New York, NY, USA, 1503–1512. doi:10.1145/3627673. 3679626

work page doi:10.1145/3627673 2024
[12]

Zihao Liu and Wen Qu. 2025. DSGRec: Dual-path Selection Graph for Multimodal Recommendation.PeerJ Computer Science11 (2025), e2779. doi:10.7717/peerj- cs.2779

work page doi:10.7717/peerj- 2025
[13]

Haokai Ma, Yimeng Yang, Lei Meng, Ruobing Xie, and Xiangxu Meng. 2024. Multimodal Conditioned Diffusion Model for Recommendation. InCompanion Proceedings of the ACM Web Conference 2024. doi:10.1145/3589335.3651956

work page doi:10.1145/3589335.3651956 2024
[14]

Hongjian Ma, Yan Zhang, Yahui Zhou, Bing Yang, Dunhui Yu, and Zhifei Li

work page
[15]

Let Two Graphs Talk: Self-Supervised Dual-Graph Reconstruction for Multimodal Recommendation.Information Fusion125 (2026), 103462. doi:10. 1016/j.inffus.2025.103462

work page arXiv 2026
[16]

Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel

work page
[17]

InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15)

Image-Based Recommendations on Styles and Substitutes. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15). 43–52. doi:10.1145/2766462.2767755

work page doi:10.1145/2766462.2767755
[18]

Feng Mo, Lin Xiao, Qiya Song, Xieping Gao, Wenzhuo Song, and Shoujin Wang

work page
[19]

doi:10.1109/MMUL.2025.3542757

FGCM: Modality-Behavior Fusion Model Integrated with Graph Con- trastive Learning for Multimodal Recommendation.IEEE Multimedia(2025). doi:10.1109/MMUL.2025.3542757

work page doi:10.1109/mmul.2025.3542757 2025
[20]

Rongqing Kenneth Ong and Andy W. H. Khong. 2024. Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recom- mendation.arXiv preprint arXiv:2412.14978(2024)

work page arXiv 2024
[21]

Lin Pan, Zhiqiang Pan, Fei Cai, and Honghui Chen. 2026. Multimodal Rec- ommender Systems: A Survey of Representation, Modeling, and Optimization. Information Fusion128 (2026), 103991. doi:10.1016/j.inffus.2025.103991

work page doi:10.1016/j.inffus.2025.103991 2026
[22]

Xiangchen Pan and Wei Wei. 2026. Joint Behavior-Guided and Modality- Coherence Conditional Graph Diffusion Denoising for Multi-Modal Recom- mendation.arXiv preprint arXiv:2601.22498(2026)

work page arXiv 2026
[23]

Yuchao Ping, Shuqin Wang, Ziyi Yang, Bugui He, Nan Zhou, and Yongquan Dong. 2024. DDRec: Dual Denoising Multimodal Graph Recommendation.IEEE Transactions on Computational Social Systems(2024). doi:10.1109/TCSS.2024. 3490801

work page doi:10.1109/tcss.2024 2024
[24]

Yuxin Qi, Quan Zhang, Xi Lin, Xiu Su, Jiani Zhu, Jingyu Wang, and Jianhua Li

work page
[25]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Seeing Beyond Noise: Joint Graph Structure Evaluation and Denoising for Multimodal Recommendation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 12461–12469. doi:10.1609/aaai.v39i12.33358

work page doi:10.1609/aaai.v39i12.33358
[26]

Xuanzhe Qin, Zhuoyue Wang, Yifeng Zhang, Qin Chen, Yuhan Huang, Peng Cheng, Lei Zhang, and Peng Wang. 2026. Beyond Feature Concatenation: Mu- tual Information-Driven Fusion for Multimodal Sequential Recommendation. Knowledge-Based Systems(2026)

work page 2026
[27]

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt- Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. InProceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelli- gence (UAI ’09). 452–461

work page 2009
[28]

Bucher Sahyouni, Matthew Vowels, Liqun Chen, and Simon Hadfield. 2026. Sequences as Nodes for Contrastive Multimodal Graph Recommendation.arXiv preprint arXiv:2602.07208(2026)

work page arXiv 2026
[29]

Zhulin Tao, Xiaohao Liu, Yewei Xia, Xiang Wang, Lifang Yang, Xianglin Huang, and Tat-Seng Chua. 2022. Self-Supervised Learning for Multimedia Recommen- dation.IEEE Transactions on Multimedia(2022)

work page 2022
[30]

Qifan Wang, Yinwei Wei, Jianhua Yin, Jianlong Wu, Xuemeng Song, and Liqiang Nie. 2021. DualGNN: Dual Graph Neural Network for Multimedia Recommenda- tion.IEEE Transactions on Multimedia25 (2021), 1074–1084

work page 2021
[31]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2020. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. InProceedings of the 28th ACM International Conference on Multimedia (MM ’20). 3541–3549

work page 2020
[32]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. InProceedings of the 27th ACM International Conference on Multimedia (MM ’19). 1437–1445

work page 2019
[33]

Jun Wu, Yu Zheng, Tianfeng Zhang, Shilong Jing, Jinyu Liu, Shuai Guo, and Fang Deng. 2026. D-DPDG: Diffusion-based Dual-Graph Attention with Dual- Path Feature Extraction for Multimodal Recommendation.Journal of Intelligent Information Systems64, 2 (2026). doi:10.1007/s10844-025-01014-7

work page doi:10.1007/s10844-025-01014-7 2026
[34]

Yuhan Xiu and Xiangrong Tong. 2026. Dual-layer Cross-modal Alignment Recommendation Based on the Diffusion Model.Information Fusion125 (2026), 103472. doi:10.1016/j.inffus.2025.103472

work page doi:10.1016/j.inffus.2025.103472 2026
[35]

Jie Yang, Chenyang Gu, and Zixuan Liu. 2025. Causal Inspired Multi Modal Recommendation.arXiv preprint arXiv:2510.12325(2025)

work page arXiv 2025
[36]

Wei Yang and Qingchen Yang. 2024. Multimodal-aware Multi-intention Learning for Recommendation. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). 5663–5672

work page 2024
[37]

Wei Yang, Rui Zhong, Yiqun Chen, Shixuan Li, Heng Ping, Chi Lu, and Peng Jiang

work page
[38]

InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25)

FITMM: Adaptive Frequency-Aware Multimodal Recommendation via Information-Theoretic Representation Learning. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/3746027.3755540

work page doi:10.1145/3746027.3755540
[39]

Wei Yang, Rui Zhong, Yiqun Chen, Chi Lu, and Peng Jiang. 2025. Structured Spectral Reasoning for Frequency-Adaptive Multimodal Recommendation. In Advances in Neural Information Processing Systems (NeurIPS ’25)

work page 2025
[40]

Yuyang Ye, Zhi Zheng, Yishan Shen, Tianshu Wang, Hengruo Zhang, Peijun Zhu, Runlong Yu, Kai Zhang, and Hui Xiong. 2025. Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation. InProceedings of the AAAI Conference on Artificial Intelligence

work page 2025
[41]

Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Shu Wu, Shuhui Wang, and Liang Wang

work page
[42]

InProceedings of the 29th ACM International Conference on Multimedia (MM ’21)

Mining Latent Structures for Multimedia Recommendation. InProceedings of the 29th ACM International Conference on Multimedia (MM ’21). 3872–3880

work page
[43]

Shanshan Zhong, Zhongzhan Huang, Daifeng Li, Wushao Wen, Jinghui Qin, and Liang Lin. 2024. Mirror Gradient: Towards Robust Multimodal Recommender Systems via Exploring Flat Local Minima.arXiv preprint arXiv:2402.11262(2024)

work page arXiv 2024
[44]

Hongyu Zhou, Xin Zhou, Lingzi Zhang, and Zhiqi Shen. 2023. Enhancing Dyadic Relations with Homogeneous Graphs for Multimodal Recommendation. InProceedings of the European Conference on Artificial Intelligence (Frontiers in Artificial Intelligence and Applications). 3124–3129. doi:10.3233/FAIA230631

work page doi:10.3233/faia230631 2023
[45]

Xin Zhou and Zhiqi Shen. 2023. A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation. InProceedings of the 31st ACM International Conference on Multimedia (MM ’23). 935–943

work page 2023
[46]

Xin Zhou, Hongyu Zhou, Yong Liu, Zhiwei Zeng, Chunyan Miao, Pengwei Wang, Yuan You, and Feijun Jiang. 2023. Bootstrap Latent Representations for Multi- Modal Recommendation. InProceedings of the ACM Web Conference 2023 (WWW ’23). 845–854

work page 2023
[47]

Yan Zhou, Jie Guo, Hao Sun, Bin Song, and Fei Richard Yu. 2023. Attention-Guided Multi-Step Fusion: A Hierarchical Fusion Network for Multimodal Recommen- dation. InProceedings of the AAAI Conference on Artificial Intelligence

work page 2023
[48]

Xiaofei Zhu, Yimeng Yang, Lei Meng, Ruobing Xie, and Xiangxu Meng. 2026. Distillation Conditional Diffusion with Spectral-Enhanced Hierarchical Fusion for Multi-Behavior Recommendation.Neurocomputing(2026)

work page 2026

[1] [1]

Ji Dai, Quan Fang, Jun Hu, Desheng Cai, Yang Yang, and Can Zhao. 2026. Cross- Modal Attention Network with Dual Graph Learning in Multimodal Recommen- dation.arXiv preprint arXiv:2601.11151(2026)

work page arXiv 2026

[2] [2]

Ziyuan Guo, Jie Guo, Zhenghao Chen, Bin Song, and Fei Richard Yu. 2025. IGDMRec: Behavior Conditioned Item Graph Diffusion for Multimodal Recom- mendation.arXiv preprint arXiv:2512.19983(2025)

work page arXiv 2025

[3] [3]

Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI ’16)

work page 2016

[4] [4]

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. InProceedings of the 43rd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval (SIGIR ’20). 639–648. doi:10.1145/3397271.3401063

work page doi:10.1145/3397271.3401063 2020

[5] [5]

Yangqin Jiang, Lianghao Xia, Wei Wei, Da Luo, Kangyi Lin, and Chao Huang

work page

[6] [6]

InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24)

DiffMM: Multi-Modal Diffusion Model for Recommendation. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). 7591–7599

work page

[7] [7]

Chenghao Li, Wei Zhou, Yihao Zhang, Jiahao Hu, Huayi Shen, and Junhao Wen. 2026. MSCF-Net: Multi-scale Frequency Denoising and Co-frequency Enhancement Network for Multimodal Recommendation.Expert Systems with Applications285 (2026), 127702. doi:10.1016/j.eswa.2026.127702

work page doi:10.1016/j.eswa.2026.127702 2026

[8] [8]

Yuan Li, Jun Hu, Jiaxin Jiang, Bryan Hooi, and Bingsheng He. 2026. Robust Mul- timodal Recommendation via Graph Retrieval-Enhanced Modality Completion. arXiv preprint arXiv:2605.00670(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Yuecheng Li, Hengwei Ju, Zeyu Song, Wei Yang, Chi Lu, Peng Jiang, and Kun Gai. 2026. RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment.arXiv preprint arXiv:2602.00682(2026)

work page arXiv 2026

[10] [10]

Yang Li, Qi’Ao Zhao, Chen Lin, Jinsong Su, and Zhilin Zhang. 2024. Who To Align With: Feedback-Oriented Multi-Modal Alignment in Recommendation Systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). 667–676

work page 2024

[11] [11]

Yifan Liu, Kangning Zhang, Xiangyuan Ren, Yanhua Huang, Jiarui Jin, Yingjie Qin, Ruilong Su, Ruiwen Xu, Yong Yu, and Weinan Zhang. 2024. AlignRec: Aligning and Training in Multimodal Recommendations(CIKM ’24). Association for Computing Machinery, New York, NY, USA, 1503–1512. doi:10.1145/3627673. 3679626

work page doi:10.1145/3627673 2024

[12] [12]

Zihao Liu and Wen Qu. 2025. DSGRec: Dual-path Selection Graph for Multimodal Recommendation.PeerJ Computer Science11 (2025), e2779. doi:10.7717/peerj- cs.2779

work page doi:10.7717/peerj- 2025

[13] [13]

Haokai Ma, Yimeng Yang, Lei Meng, Ruobing Xie, and Xiangxu Meng. 2024. Multimodal Conditioned Diffusion Model for Recommendation. InCompanion Proceedings of the ACM Web Conference 2024. doi:10.1145/3589335.3651956

work page doi:10.1145/3589335.3651956 2024

[14] [14]

Hongjian Ma, Yan Zhang, Yahui Zhou, Bing Yang, Dunhui Yu, and Zhifei Li

work page

[15] [15]

Let Two Graphs Talk: Self-Supervised Dual-Graph Reconstruction for Multimodal Recommendation.Information Fusion125 (2026), 103462. doi:10. 1016/j.inffus.2025.103462

work page arXiv 2026

[16] [16]

Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel

work page

[17] [17]

InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15)

Image-Based Recommendations on Styles and Substitutes. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15). 43–52. doi:10.1145/2766462.2767755

work page doi:10.1145/2766462.2767755

[18] [18]

Feng Mo, Lin Xiao, Qiya Song, Xieping Gao, Wenzhuo Song, and Shoujin Wang

work page

[19] [19]

doi:10.1109/MMUL.2025.3542757

FGCM: Modality-Behavior Fusion Model Integrated with Graph Con- trastive Learning for Multimodal Recommendation.IEEE Multimedia(2025). doi:10.1109/MMUL.2025.3542757

work page doi:10.1109/mmul.2025.3542757 2025

[20] [20]

Rongqing Kenneth Ong and Andy W. H. Khong. 2024. Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recom- mendation.arXiv preprint arXiv:2412.14978(2024)

work page arXiv 2024

[21] [21]

Lin Pan, Zhiqiang Pan, Fei Cai, and Honghui Chen. 2026. Multimodal Rec- ommender Systems: A Survey of Representation, Modeling, and Optimization. Information Fusion128 (2026), 103991. doi:10.1016/j.inffus.2025.103991

work page doi:10.1016/j.inffus.2025.103991 2026

[22] [22]

Xiangchen Pan and Wei Wei. 2026. Joint Behavior-Guided and Modality- Coherence Conditional Graph Diffusion Denoising for Multi-Modal Recom- mendation.arXiv preprint arXiv:2601.22498(2026)

work page arXiv 2026

[23] [23]

Yuchao Ping, Shuqin Wang, Ziyi Yang, Bugui He, Nan Zhou, and Yongquan Dong. 2024. DDRec: Dual Denoising Multimodal Graph Recommendation.IEEE Transactions on Computational Social Systems(2024). doi:10.1109/TCSS.2024. 3490801

work page doi:10.1109/tcss.2024 2024

[24] [24]

Yuxin Qi, Quan Zhang, Xi Lin, Xiu Su, Jiani Zhu, Jingyu Wang, and Jianhua Li

work page

[25] [25]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Seeing Beyond Noise: Joint Graph Structure Evaluation and Denoising for Multimodal Recommendation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 12461–12469. doi:10.1609/aaai.v39i12.33358

work page doi:10.1609/aaai.v39i12.33358

[26] [26]

Xuanzhe Qin, Zhuoyue Wang, Yifeng Zhang, Qin Chen, Yuhan Huang, Peng Cheng, Lei Zhang, and Peng Wang. 2026. Beyond Feature Concatenation: Mu- tual Information-Driven Fusion for Multimodal Sequential Recommendation. Knowledge-Based Systems(2026)

work page 2026

[27] [27]

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt- Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. InProceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelli- gence (UAI ’09). 452–461

work page 2009

[28] [28]

Bucher Sahyouni, Matthew Vowels, Liqun Chen, and Simon Hadfield. 2026. Sequences as Nodes for Contrastive Multimodal Graph Recommendation.arXiv preprint arXiv:2602.07208(2026)

work page arXiv 2026

[29] [29]

Zhulin Tao, Xiaohao Liu, Yewei Xia, Xiang Wang, Lifang Yang, Xianglin Huang, and Tat-Seng Chua. 2022. Self-Supervised Learning for Multimedia Recommen- dation.IEEE Transactions on Multimedia(2022)

work page 2022

[30] [30]

Qifan Wang, Yinwei Wei, Jianhua Yin, Jianlong Wu, Xuemeng Song, and Liqiang Nie. 2021. DualGNN: Dual Graph Neural Network for Multimedia Recommenda- tion.IEEE Transactions on Multimedia25 (2021), 1074–1084

work page 2021

[31] [31]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2020. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. InProceedings of the 28th ACM International Conference on Multimedia (MM ’20). 3541–3549

work page 2020

[32] [32]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. InProceedings of the 27th ACM International Conference on Multimedia (MM ’19). 1437–1445

work page 2019

[33] [33]

Jun Wu, Yu Zheng, Tianfeng Zhang, Shilong Jing, Jinyu Liu, Shuai Guo, and Fang Deng. 2026. D-DPDG: Diffusion-based Dual-Graph Attention with Dual- Path Feature Extraction for Multimodal Recommendation.Journal of Intelligent Information Systems64, 2 (2026). doi:10.1007/s10844-025-01014-7

work page doi:10.1007/s10844-025-01014-7 2026

[34] [34]

Yuhan Xiu and Xiangrong Tong. 2026. Dual-layer Cross-modal Alignment Recommendation Based on the Diffusion Model.Information Fusion125 (2026), 103472. doi:10.1016/j.inffus.2025.103472

work page doi:10.1016/j.inffus.2025.103472 2026

[35] [35]

Jie Yang, Chenyang Gu, and Zixuan Liu. 2025. Causal Inspired Multi Modal Recommendation.arXiv preprint arXiv:2510.12325(2025)

work page arXiv 2025

[36] [36]

Wei Yang and Qingchen Yang. 2024. Multimodal-aware Multi-intention Learning for Recommendation. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). 5663–5672

work page 2024

[37] [37]

Wei Yang, Rui Zhong, Yiqun Chen, Shixuan Li, Heng Ping, Chi Lu, and Peng Jiang

work page

[38] [38]

InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25)

FITMM: Adaptive Frequency-Aware Multimodal Recommendation via Information-Theoretic Representation Learning. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/3746027.3755540

work page doi:10.1145/3746027.3755540

[39] [39]

Wei Yang, Rui Zhong, Yiqun Chen, Chi Lu, and Peng Jiang. 2025. Structured Spectral Reasoning for Frequency-Adaptive Multimodal Recommendation. In Advances in Neural Information Processing Systems (NeurIPS ’25)

work page 2025

[40] [40]

Yuyang Ye, Zhi Zheng, Yishan Shen, Tianshu Wang, Hengruo Zhang, Peijun Zhu, Runlong Yu, Kai Zhang, and Hui Xiong. 2025. Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation. InProceedings of the AAAI Conference on Artificial Intelligence

work page 2025

[41] [41]

Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Shu Wu, Shuhui Wang, and Liang Wang

work page

[42] [42]

InProceedings of the 29th ACM International Conference on Multimedia (MM ’21)

Mining Latent Structures for Multimedia Recommendation. InProceedings of the 29th ACM International Conference on Multimedia (MM ’21). 3872–3880

work page

[43] [43]

Shanshan Zhong, Zhongzhan Huang, Daifeng Li, Wushao Wen, Jinghui Qin, and Liang Lin. 2024. Mirror Gradient: Towards Robust Multimodal Recommender Systems via Exploring Flat Local Minima.arXiv preprint arXiv:2402.11262(2024)

work page arXiv 2024

[44] [44]

Hongyu Zhou, Xin Zhou, Lingzi Zhang, and Zhiqi Shen. 2023. Enhancing Dyadic Relations with Homogeneous Graphs for Multimodal Recommendation. InProceedings of the European Conference on Artificial Intelligence (Frontiers in Artificial Intelligence and Applications). 3124–3129. doi:10.3233/FAIA230631

work page doi:10.3233/faia230631 2023

[45] [45]

Xin Zhou and Zhiqi Shen. 2023. A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation. InProceedings of the 31st ACM International Conference on Multimedia (MM ’23). 935–943

work page 2023

[46] [46]

Xin Zhou, Hongyu Zhou, Yong Liu, Zhiwei Zeng, Chunyan Miao, Pengwei Wang, Yuan You, and Feijun Jiang. 2023. Bootstrap Latent Representations for Multi- Modal Recommendation. InProceedings of the ACM Web Conference 2023 (WWW ’23). 845–854

work page 2023

[47] [47]

Yan Zhou, Jie Guo, Hao Sun, Bin Song, and Fei Richard Yu. 2023. Attention-Guided Multi-Step Fusion: A Hierarchical Fusion Network for Multimodal Recommen- dation. InProceedings of the AAAI Conference on Artificial Intelligence

work page 2023

[48] [48]

Xiaofei Zhu, Yimeng Yang, Lei Meng, Ruobing Xie, and Xiangxu Meng. 2026. Distillation Conditional Diffusion with Spectral-Enhanced Hierarchical Fusion for Multi-Behavior Recommendation.Neurocomputing(2026)

work page 2026