arxiv: 2604.23522 · v1 · submitted 2026-04-26 · 💻 cs.IR · cs.MM

Recognition: unknown

Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale

Daoyuan Wang, Fuji Ren, Hongyang Wang, Jun Wang, Songhao Ni, Wenwu Ou, Xu Yuan, Yongsen Pan, Yuting Yin, Yuxin Chen, Zheng Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:37 UTC · model grok-4.3

classification 💻 cs.IR cs.MM

keywords semantic IDsmultimodal recommendationcollision regulationadaptive learningcodebook utilizationindustrial recommendationdiscrete item representationsoverlap handling

0 comments

The pith

AdaSID adaptively regulates semantic ID overlaps by relaxing repulsion for compatible multimodal items and dynamically allocating pressure based on collision load and training progress.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AdaSID to improve semantic ID learning in large-scale multimodal recommendation systems. Traditional methods struggle with collisions by either over-penalizing all overlaps or using fixed rules, which can waste codebook space or lose semantic information. AdaSID uses a two-stage approach: first relaxing penalties when items share semantic compatibility inferred from multimodal data, and second adjusting regulation strength according to local congestion and learning stage. This leads to better codebook utilization, higher diversity in IDs, and improved recommendation performance on benchmarks and in production. A reader should care because it allows more efficient and accurate item representations for retrieval and ranking without increasing model size.

Core claim

AdaSID regulates SID overlaps through a two-stage process. First, it relaxes repulsion for observed overlaps when the involved items are semantically compatible, preserving admissible sharing rather than uniformly separating all collisions. Second, it allocates the remaining regulation pressure according to local collision load and training progress, strengthening control in congested regions while gradually rebalancing optimization toward recommendation alignment. This design adaptively decides which overlaps to penalize, how strongly to regulate them, and when to shift the learning focus.

What carries the argument

The AdaSID framework, which employs semantic compatibility detection from multimodal signals to selectively relax overlap repulsion and dynamic allocation of regulation pressure based on local load and training progress.

If this is right

Improved Recall and NDCG by approximately 4.5% on public multimodal recommendation benchmarks compared to strong baselines.
Better codebook utilization and increased SID diversity.
Statistically significant improvements in online A/B tests, including 0.98% GMV lift in e-commerce short-video retrieval.
Consistent AUC gains in industrial ranking evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the compatibility inference works well, AdaSID could allow smaller codebooks for the same performance level in very large catalogs.
The dynamic rebalancing might help in other sequential or generative recommendation setups by focusing optimization better over time.
This adaptive idea might apply to other discrete tokenization tasks in recommendation or retrieval systems.

Load-bearing premise

That semantic compatibility between items can be reliably inferred from their multimodal signals to correctly decide when overlaps should be preserved rather than penalized.

What would settle it

A test showing that items incorrectly identified as compatible lead to lower recommendation accuracy than a uniform repulsion baseline, or evidence of training instability from the dynamic pressure allocation.

Figures

Figures reproduced from arXiv: 2604.23522 by Daoyuan Wang, Fuji Ren, Hongyang Wang, Jun Wang, Songhao Ni, Wenwu Ou, Xu Yuan, Yongsen Pan, Yuting Yin, Yuxin Chen, Zheng Hu.

**Figure 1.** Figure 1: Motivation of AdaSID. Static overlap regulation view at source ↗

**Figure 2.** Figure 2: Overall framework of AdaSID. AdaSID maps collaborative trigger–target item pairs into quantized representations view at source ↗

**Figure 3.** Figure 3: Discrete SID space landscape on two datasets. Each point denotes a tokenizer. The x-axis is inverse normalized Top-1 view at source ↗

**Figure 4.** Figure 4: Hyperparameter sensitivity of AdaSID on the view at source ↗

read the original abstract

Modern recommendation systems involve massive catalogs of multimodal items, where scalable item identification must balance compactness, semantic fidelity, and downstream effectiveness. Semantic IDs (SIDs) address this need by representing items as short discrete token sequences derived from multimodal signals, providing a compact interface for retrieval, ranking, and generative recommendation. However, effective SID learning is hindered by collisions, where different items are assigned identical or highly confusable codes. Existing methods mainly rely on improved quantization or fixed overlap regularization, but they do not adaptively distinguish whether an overlap should be suppressed or preserved. We propose AdaSID, an adaptive semantic ID learning framework for recommendation. AdaSID regulates SID overlaps through a two-stage process. First, it relaxes repulsion for observed overlaps when the involved items are semantically compatible, preserving admissible sharing rather than uniformly separating all collisions. Second, it allocates the remaining regulation pressure according to local collision load and training progress, strengthening control in congested regions while gradually rebalancing optimization toward recommendation alignment. This design adaptively decides which overlaps to penalize, how strongly to regulate them, and when to shift the learning focus. Extensive offline and online experiments validate AdaSID. On two public benchmarks, AdaSID improves Recall and NDCG by about 4.5% on average over strong baselines, while improving codebook utilization and SID diversity. In Kuaishou e-commerce, an online A/B test on short-video retrieval covering tens of millions of users achieves statistically significant gains, including a 0.98% GMV improvement, and industrial ranking evaluation shows consistent AUC improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaSID adds a two-stage adaptive rule for relaxing or tightening SID collision penalties based on semantic compatibility and load, which helps codebook use and metrics in their tests but risks circularity in the compatibility step.

read the letter

The main takeaway is that AdaSID makes collision handling in semantic ID learning adaptive rather than static. It first relaxes the repulsion force on overlaps when items appear semantically compatible from the multimodal features, then spreads the leftover penalty pressure according to how crowded each code is and how far training has progressed. This decides on the fly which overlaps to keep, how hard to push on the rest, and when to ease off toward pure recommendation loss. That combination is the actual new piece compared to the fixed regularization or quantization tweaks cited in the abstract. The offline results on two public benchmarks show roughly 4.5% average lift in Recall and NDCG plus better codebook utilization, and the Kuaishou online test reports a 0.98% GMV gain with statistical significance. For teams already running large multimodal catalogs through retrieval or generative models, those numbers plus the industrial ranking AUC improvements are the practical signal. The soft spot sits in the compatibility decision. Because it draws from the same multimodal signals that produce the IDs, the rule can end up preserving overlaps that simply match the model's own clustering rather than any external notion of admissible sharing. Without an independent anchor such as held-out human labels or a separate ontology, it is hard to separate genuine adaptation from lighter effective regularization that happens to suit their data distribution. The load-and-progress stage then builds on whatever the first stage passed through. This paper is aimed at practitioners who already use or are building semantic ID pipelines at scale. A reader working on discrete representations for recommendation would get concrete value from the adaptive mechanism and the production-scale evidence. It deserves a serious referee because the problem is real, the empirical footprint is large, and the central claim is testable even if the compatibility step needs extra scrutiny. Recommendation: send it for review and ask specifically for ablations that isolate each stage plus any external check on the compatibility labels.

Referee Report

2 major / 2 minor

Summary. The paper proposes AdaSID, an adaptive framework for learning Semantic IDs (SIDs) from multimodal item signals in large-scale recommendation systems. It addresses SID collisions via a two-stage regulation process: (1) relaxing repulsion on observed overlaps when items are deemed semantically compatible (inferred from multimodal signals), and (2) dynamically allocating remaining regulation pressure according to local collision load and training progress. The method is claimed to improve codebook utilization and SID diversity while boosting downstream performance, with reported average gains of ~4.5% on Recall/NDCG over strong baselines on two public benchmarks and a statistically significant 0.98% GMV lift plus AUC gains in an industrial online A/B test on Kuaishou covering tens of millions of users.

Significance. If the two-stage adaptive mechanism can be shown to preserve admissible sharing without circularity or instability, AdaSID would advance scalable discrete item representations beyond static quantization or fixed regularization in multimodal recsys. The inclusion of large-scale online A/B testing with GMV and ranking metrics is a concrete strength that grounds practical relevance. The work could influence industrial pipelines for compact, semantically faithful IDs if the compatibility inference is externally validated.

major comments (2)

[§3.2] §3.2 (Adaptive Overlap Regulation): The relaxation rule for 'semantically compatible' overlaps is derived from the same multimodal signals used to construct the SIDs. This creates a load-bearing circularity risk—the decision to preserve an overlap may simply reflect the model's own embedding clusters rather than an independent semantic fact—undermining the claim that AdaSID adaptively distinguishes admissible from harmful collisions. An external anchor (e.g., held-out ontology, human labels, or causal test) is needed to substantiate the central adaptive advantage.
[§4.1–4.3] §4.1–4.3 (Offline and Online Experiments): The reported 4.5% average Recall/NDCG gains and 0.98% GMV lift are presented without sufficient detail on exact baselines, ablation results isolating the two regulation stages, statistical significance tests, or controls for optimization instability from the dynamic pressure allocation. These omissions prevent verification that improvements arise from adaptive preservation rather than reduced effective regularization, which is central to the paper's contribution.

minor comments (2)

[Abstract] Abstract: The phrase 'strong baselines' and 'statistically significant gains' should be expanded with at least the names of the primary competing methods and the exact p-value thresholds used in the online test.
[§3] Notation and reproducibility: Define 'local collision load' and the hyperparameters controlling regulation allocation more explicitly (including any free parameters) to support replication of the dynamic allocation stage.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our work. We address each major comment point by point below, providing clarifications and indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Adaptive Overlap Regulation): The relaxation rule for 'semantically compatible' overlaps is derived from the same multimodal signals used to construct the SIDs. This creates a load-bearing circularity risk—the decision to preserve an overlap may simply reflect the model's own embedding clusters rather than an independent semantic fact—undermining the claim that AdaSID adaptively distinguishes admissible from harmful collisions. An external anchor (e.g., held-out ontology, human labels, or causal test) is needed to substantiate the central adaptive advantage.

Authors: We acknowledge the referee's concern regarding potential circularity. In AdaSID, compatibility inference operates on raw multimodal feature embeddings prior to codebook quantization and SID assignment, using a separate similarity threshold computed across modalities that is independent of the discrete token outputs. This temporal and architectural separation allows the model to identify admissible overlaps based on continuous feature similarity rather than post-hoc code clusters. Nevertheless, we agree that external validation would provide stronger substantiation. In the revised manuscript, §3.2 has been expanded with a formal description of this separation and an additional ablation using a temporally held-out data split to validate compatibility decisions. Full human annotation or causal testing remains outside the scope of the present industrial-scale study. revision: partial
Referee: [§4.1–4.3] §4.1–4.3 (Offline and Online Experiments): The reported 4.5% average Recall/NDCG gains and 0.98% GMV lift are presented without sufficient detail on exact baselines, ablation results isolating the two regulation stages, statistical significance tests, or controls for optimization instability from the dynamic pressure allocation. These omissions prevent verification that improvements arise from adaptive preservation rather than reduced effective regularization, which is central to the paper's contribution.

Authors: We appreciate this observation and have revised the experimental sections accordingly. The updated §4.1–4.3 now include: (i) exhaustive specifications of all baselines with hyperparameter settings and implementation details, (ii) dedicated ablations that isolate the relaxation stage from the dynamic pressure allocation stage with incremental performance breakdowns, (iii) paired statistical significance tests (including p-values and confidence intervals) for all offline and online metrics, and (iv) controls for optimization stability comprising training dynamics plots, sensitivity analysis on the load-based allocation schedule, and comparisons against fixed-regularization variants. These additions confirm that the observed gains derive from the adaptive distinction of admissible overlaps rather than from an overall reduction in regularization strength. revision: yes

standing simulated objections not resolved

Providing an external anchor such as human labels, a held-out ontology, or dedicated causal tests for semantic compatibility would require new data collection and annotation efforts beyond the resources and scope of the current study.

Circularity Check

0 steps flagged

No significant circularity detected; framework presented as empirical design without self-referential reductions

full rationale

The abstract describes AdaSID's two-stage overlap regulation (relaxing repulsion for semantically compatible items, then allocating pressure by load and progress) but provides no equations, loss functions, or derivation steps. No self-citation chains, fitted parameters renamed as predictions, or ansatzes are quoted that would reduce the adaptive decisions to tautologies with the multimodal inputs. The method is validated via offline benchmarks and online A/B tests showing gains in Recall, NDCG, and GMV, indicating an independent empirical contribution rather than a closed loop. The potential dependence of compatibility inference on the same signals is a design assumption, not a demonstrated reduction by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about inferring semantic compatibility and the benefits of dynamic regulation; no free parameters or invented entities are explicitly listed in the abstract, but the approach implicitly depends on tunable regulation mechanisms.

free parameters (1)

regulation allocation parameters
Strength, timing, and thresholds for relaxing repulsion and allocating pressure are likely chosen or fitted during development.

axioms (1)

domain assumption Semantic compatibility of items can be determined from multimodal signals to decide whether an overlap is admissible.
This underpins the first stage of relaxing repulsion for compatible overlaps.

pith-pipeline@v0.9.0 · 5614 in / 1285 out tokens · 43604 ms · 2026-05-08T05:37:30.933033+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 35 canonical work pages · 2 internal anchors

[1]

Prabhat Agarwal, Anirudhan Badrinath, Laksh Bhasin, Jaewon Yang, Jiajing Xu, and Charles Rosenberg. 2025. Autoregressive Generative Retrieval for Industrial- Scale Recommendations at Pinterest. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Ko- rea)(CIKM ’25). Association for Computing Mac...

work page doi:10.1145/3746252.3761439 2025
[2]

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432(2013)

work page internal anchor Pith review arXiv 2013
[3]

Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, et al . 2025. Onesearch: A preliminary exploration of the unified end-to-end generative framework for e-commerce search.arXiv preprint arXiv:2509.03236(2025)

work page arXiv 2025
[4]

Gaode Chen, Ruina Sun, Yuezihan Jiang, Jiangxia Cao, Qi Zhang, Jingjian Lin, Han Li, Kun Gai, and Xinghua Zhang. 2024. A Multi-modal Modeling Framework for Cold-start Short-video Recommendation. InProceedings of the 18th ACM Confer- ence on Recommender Systems(Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 391–400. doi:1...

work page doi:10.1145/3640457.3688098 2024
[5]

Zhangtao Cheng, Jienan Zhang, Xovee Xu, Goce Trajcevski, Ting Zhong, and Fan Zhou. 2024. Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Barcelona, Spain)(KDD ’24). Association for Computing Machinery, New York, NY, USA, 445–455. doi:10...

work page doi:10.1145/3637528 2024
[6]

Dengzhao Fang, Jingtong Gao, Chengcheng Zhu, Yu Li, Xiangyu Zhao, and Yi Chang. 2025. HiD-VAE: Interpretable Generative Recommendation via Hierarchi- cal and Disentangled Semantic IDs.CoRRabs/2508.04618 (2025). arXiv:2508.04618 doi:10.48550/ARXIV.2508.04618

work page doi:10.48550/arxiv.2508.04618 2025
[7]

Christopher Fifty, Ronald Guenther Junkins, Dennis Duan, Aniketh Iyengar, Jerry Weihong Liu, Ehsan Amid, Sebastian Thrun, and Christopher Ré. 2025. Restructuring Vector Quantization with the Rotation Trick. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net...

2025
[8]

Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). InRecSys ’22: Sixteenth ACM Conference on Recommender Systems, Seattle, W A, USA, September 18 - 23, 2022. ACM, 299–315. doi:10.1145/3523227.3546767

work page doi:10.1145/3523227.3546767 2022
[9]

Ruining He and Julian McAuley. 2016. VBPR: visual Bayesian Personalized Ranking from implicit feedback. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence(Phoenix, Arizona)(AAAI’16). AAAI Press, 144–150

2016
[10]

Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learn- ing Vector-Quantized Item Representation for Transferable Sequential Rec- ommenders. InProceedings of the ACM Web Conference 2023(Austin, TX, USA)(WWW ’23). Association for Computing Machinery, New York, NY, USA, 1162–1171. doi:10.1145/3543507.3583434

work page doi:10.1145/3543507.3583434 2023
[11]

Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, and Julian McAuley. 2025. Generating Long Semantic IDs in Parallel for Recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(Toronto ON, Canada)(KDD ’25). Association for Computing Machinery, New Y...

work page doi:10.1145/3711896.3736979 2025
[12]

Zheng Hu, Yuxin Chen, Yongsen Pan, Xu Yuan, Yuting Yin, Daoyuan Wang, Boyang Xia, Zefei Luo, Hongyang Wang, Songhao Ni, et al. 2026. Stop Treating Collisions Equally: Qualification-Aware Semantic ID Learning for Recommenda- tion at Industrial Scale.arXiv preprint arXiv:2603.00632(2026)

work page arXiv 2026
[13]

Siyuan Huang, Jiahui Jin, Xin Lin, Xigang Sun, and Yukun Ban. 2025. IM-POI: Bridging ID and Multi-modal Gaps in Next POI Recommendation. InProceedings of the 33rd ACM International Conference on Multimedia(Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 5979–5987. doi:10.1145/3746027.3754937

work page doi:10.1145/3746027.3754937 2025
[14]

Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui Li, Zhengyang Wang, Zheng Li, Yang Li, Hanqing Lu, Suhang Wang, Jiawei Han, and Xianfeng Tang. 2024. Language models as semantic indexers. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 894, 16 pages

2024
[15]

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. 2022. Autoregressive Image Generation using Residual Quantization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 11513–11522. doi:10.1109/CVPR52688.2022.01123

work page doi:10.1109/cvpr52688.2022.01123 2022
[16]

Xiaopeng Li, Bo Chen, Junda She, Shiteng Cao, You Wang, Qinlin Jia, Haiying He, Zheli Zhou, Zhao Liu, Ji Liu, et al. 2025. A survey of generative recommendation from a tri-decoupled perspective: Tokenization, architecture, and optimization. (2025)

2025
[17]

Enze Liu, Bowen Zheng, Cheng Ling, Lantao Hu, Han Li, and Wayne Xin Zhao
[18]

InProceedings of the 48th International ACM SIGIR Conference on Research and De- velopment in Information Retrieval(Padua, Italy)(SIGIR ’25)

Generative Recommender with End-to-End Learnable Item Tokenization. InProceedings of the 48th International ACM SIGIR Conference on Research and De- velopment in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Com- puting Machinery, New York, NY, USA, 729–739. doi:10.1145/3726302.3729989

work page doi:10.1145/3726302.3729989
[19]

Qijiong Liu, Jieming Zhu, Yanting Yang, Quanyu Dai, Zhaocheng Du, Xiao-Ming Wu, Zhou Zhao, Rui Zhang, and Zhenhua Dong. 2024. Multimodal Pretraining, Adaptation, and Generation for Recommendation: A Survey. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Barcelona, Spain)(KDD ’24). Association for Computing Machiner...

work page doi:10.1145/3637528.3671473 2024
[20]

Zhuang Liu, Yunpu Ma, Matthias Schubert, Yuanxin Ouyang, and Zhang Xiong
[21]

InProceedings of the 2022 International Conference on Multimedia Retrieval(Newark, NJ, USA) (ICMR ’22)

Multi-Modal Contrastive Pre-training for Recommendation. InProceedings of the 2022 International Conference on Multimedia Retrieval(Newark, NJ, USA) (ICMR ’22). Association for Computing Machinery, New York, NY, USA, 99–108. doi:10.1145/3512527.3531378

work page doi:10.1145/3512527.3531378 2022
[22]

Xinchen Luo, Jiangxia Cao, Tianyu Sun, Jinkai Yu, Rui Huang, Wei Yuan, Hezheng Lin, Yichen Zheng, Shiyao Wang, Qigen Hu, Changqing Qiu, Jiaqi Zhang, Xu Zhang, Zhiheng Yan, Jingming Zhang, Simin Zhang, Mingxing Wen, Zhaojie Liu, and Guorui Zhou. 2025. QARM: Quantitative Alignment Multi-Modal Recom- mendation at Kuaishou. InProceedings of the 34th ACM Inter...

work page doi:10.1145/3746252.3761502 2025
[23]

Zheqi Lv, Shaoxuan He, Tianyu Zhan, Shengyu Zhang, Wenqiao Zhang, Jingyuan Chen, Zhou Zhao, and Fei Wu. 2024. Semantic Codebook Learning for Dynamic Recommendation Models. InProceedings of the 32nd ACM International Conference on Multimedia(Melbourne VIC, Australia)(MM ’24). Association for Computing Machinery, New York, NY, USA, 9611–9620. doi:10.1145/36...

work page doi:10.1145/3664647.3680574 2024
[24]

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. 2024. Finite Scalar Quantization: VQ-VAE Made Simple. InThe Twelfth International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2309.15505

work page arXiv 2024
[25]

https://doi.org/10.18653/v1/2022.findings- acl.272

Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, and Yinfei Yang. 2022. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. InFindings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022 (Findings of ACL, Vol. ACL 2022), Smaranda Muresan, Preslav Nakov,...

work page doi:10.18653/v1/2022.findings- 2022
[26]

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al
[27]

Conference’17, July 2017, Washington, DC, USA Yongsen Pan, Yuxin Chen, Zheng Hu, Xu Yuan, Daoyuan Wang, Yuting Yin, Songhao Ni, Hongyang Wang, Jun Wang, Fuji Ren, and Wenwu Ou

Recommender systems with generative retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315. Conference’17, July 2017, Washington, DC, USA Yongsen Pan, Yuxin Chen, Zheng Hu, Xu Yuan, Daoyuan Wang, Yuting Yin, Songhao Ni, Hongyang Wang, Jun Wang, Fuji Ren, and Wenwu Ou

2023
[28]

Anima Singh, Trung Vu, Nikhil Mehta, Raghunandan Keshavan, Maheswaran Sathiamoorthy, Yilin Zheng, Lichan Hong, Lukasz Heldt, Li Wei, Devansh Tandon, Ed Chi, and Xinyang Yi. 2024. Better Generalization with Semantic IDs: A Case Study in Ranking for Recommendations(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 1039–1044. doi:10.1145/3...

work page doi:10.1145/3640457.3688190 2024
[29]

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding.CoRRabs/1807.03748 (2018). arXiv:1807.03748 http://arxiv.org/abs/1807.03748

work page internal anchor Pith review arXiv 2018
[30]

Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning.Advances in neural information processing systems30 (2017)

2017
[31]

Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See- Kiong Ng, and Tat-Seng Chua. 2024. Learnable Item Tokenization for Generative Recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management(Boise, ID, USA)(CIKM ’24). Association for Computing Machinery, New York, NY, USA, 2400–240...

work page doi:10.1145/3627673 2024
[32]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. InProceedings of the 27th ACM International Conference on Multimedia(Nice, France)(MM ’19). Association for Computing Machinery, New York, NY, USA, 1437–1445. doi:10.1145/3343...

work page doi:10.1145/3343031.3351034 2019
[33]

Binrui Wu, Shisong Tang, Fan Li, Bing Han, Chang Meng, Jingyu Xiao, and Jiechao Gao. 2025. Aligning and Balancing ID and Multimodal Representations for Recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(Toronto ON, Canada)(KDD ’25). Association for Computing Machinery, New York, NY, USA, 5029–5038. d...

work page arXiv 2025
[34]

Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, and Edith C. H. Ngai. 2025. The Best is Yet to Come: Graph Convolution in the Testing Phase for Multimodal Recommendation. InProceedings of the 33rd ACM International Conference on Multimedia(Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 6325–6334. doi:10.1145/3746027.3755781

work page doi:10.1145/3746027.3755781 2025
[35]

Dongchao Yang, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, and Yuexian Zou. 2023. HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec.CoRRabs/2305.02765 (2023). arXiv:2305.02765 doi:10.48550/ARXIV.2305.02765

work page doi:10.48550/arxiv.2305.02765 2023
[36]

Wei Yang, Rui Zhong, Yiqun Chen, Shixuan Li, Heng Ping, Chi Lu, and Peng Jiang
[37]

ISBN 9798400720352

FITMM: Adaptive Frequency-Aware Multimodal Recommendation via Information-Theoretic Representation Learning. InProceedings of the 33rd ACM International Conference on Multimedia(Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 6193–6202. doi:10.1145/3746027. 3755540

work page doi:10.1145/3746027
[38]

Xiaoyong Yang, Yadong Zhu, Yi Zhang, Xiaobo Wang, and Quan Yuan. 2020. Large Scale Product Graph Construction for Recommendation in E-commerce. CoRRabs/2010.05525 (2020). arXiv:2010.05525 https://arxiv.org/abs/2010.05525

work page arXiv 2020
[39]

Wencai Ye, Mingjie Sun, Shaoyun Shi, Peng Wang, Wenjin Wu, and Peng Jiang
[40]

Association for Computing Machinery, New York, NY, USA, 6217–6224

DAS: Dual-Aligned Semantic IDs Empowered Industrial Recommender System(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 6217–6224. doi:10.1145/3746252.3761529

work page doi:10.1145/3746252.3761529
[41]

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. 2022. Vector-quantized Image Modeling with Improved VQGAN. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenRe- view.net. https://openreview.net/forum?id=pfNyExj7z2

2022
[42]

Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Shu Wu, Shuhui Wang, and Liang Wang
[43]

InProceedings of the 29th ACM International Conference on Multimedia(Virtual Event, China) (MM ’21)

Mining Latent Structures for Multimedia Recommendation. InProceedings of the 29th ACM International Conference on Multimedia(Virtual Event, China) (MM ’21). Association for Computing Machinery, New York, NY, USA, 3872–3880. doi:10.1145/3474085.3475259

work page doi:10.1145/3474085.3475259
[44]

Lingzi Zhang, Xin Zhou, Zhiwei Zeng, and Zhiqi Shen. 2024. Multimodal Pre- training for Sequential Recommendation via Contrastive Learning.ACM Trans. Recomm. Syst.3, 1, Article 9 (Oct. 2024), 23 pages. doi:10.1145/3682075

work page doi:10.1145/3682075 2024
[45]

Rui Zhao, Rui Zhong, Haoran Zheng, Wei Yang, Chi Lu, Beihong Jin, Peng Jiang, and Kun Gai. 2025. Hierarchical Sequence ID Representation of Large Language Models for Large-scale Recommendation Systems. InCompanion Pro- ceedings of the ACM on Web Conference 2025(Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, NY, USA, 641–65...

work page doi:10.1145/3701716.3715234 2025
[46]

Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. 2024. Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation. In2024 IEEE 40th International Conference on Data Engineering (ICDE). 1435–1448. doi:10.1109/ICDE60146.2024. 00118

work page doi:10.1109/icde60146.2024 2024
[47]

Carolina Zheng, Minhui Huang, Dmitrii Pedchenko, Kaushik Rangadurai, Siyu Wang, Fan Xia, Gaby Nahum, Jie Lei, Yang Yang, Tao Liu, Zutian Luo, Xiaohan Wei, Dinesh Ramasamy, Jiyan Yang, Yiping Han, Lin Yang, Hangjun Xu, Rong Jin, and Shuang Yang. 2025. Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID. InProceedings of ...

work page doi:10.1145/3705328.3748123 2025
[48]

Xin Zhou, Hongyu Zhou, Yong Liu, Zhiwei Zeng, Chunyan Miao, Pengwei Wang, Yuan You, and Feijun Jiang. 2023. Bootstrap Latent Representations for Multi- modal Recommendation. InProceedings of the ACM Web Conference 2023(Austin, TX, USA)(WWW ’23). Association for Computing Machinery, New York, NY, USA, 845–854. doi:10.1145/3543507.3583251

work page doi:10.1145/3543507.3583251 2023
[49]

Yongxin Zhu, Bocheng Li, Yifei Xin, Zhihua Xia, and Linli Xu. 2025. Addressing representation collapse in vector quantized models with one linear layer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22968– 22977

2025