DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling
Pith reviewed 2026-06-28 04:37 UTC · model grok-4.3
The pith
DSIRM injects query-item interaction supervision into residual quantization to learn relevance-aware discrete semantic identifiers for e-commerce ranking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a query-bridged contrastive quantization procedure on the item side, together with LLM-based SID prediction on the query side, yields discrete relevance features whose hierarchical prefix matching improves ranking over purely unsupervised SIDs or dense embeddings alone.
What carries the argument
Query-bridged contrastive quantization inside residual quantization, which uses query-item pairs to supervise the formation of semantic partitions.
If this is right
- Hierarchical prefix matching between predicted query SIDs and item SIDs supplies discriminative signals that complement dense embeddings.
- The hybrid architecture can be deployed efficiently while still improving offline AUC by 1.54 percent.
- Online deployment yields measurable lifts of 0.13 percent UCTR and 0.25 percent UCTCVR.
- Tail queries and ambiguous intents are handled by explicit LLM-based SID prediction rather than relying solely on unsupervised item clustering.
Where Pith is reading between the lines
- The same query-bridged supervision pattern could be tested on non-e-commerce retrieval tasks where discrete codes are already in use.
- If the learned partitions prove stable across time, they might reduce the frequency of full re-quantization runs.
- Combining the discrete signals with other forms of weak supervision, such as click-through data from different surfaces, is a natural next measurement.
Load-bearing premise
The assumption that contrastive supervision from logged query-item interactions will produce partitions that reflect true relevance rather than merely fitting noise in the training distribution.
What would settle it
A controlled test in which the DSIRM-derived discrete features are added to the ranking model on a fresh, temporally held-out slice of Tmall logs and produce no AUC lift or an online metric regression.
Figures
read the original abstract
Despite rapid progress of continuous embeddings for e-commerce search relevance, a long-standing open problem is the difficulty in capturing fine-grained attribute distinctions. While discrete Semantic Identifiers (SIDs) have been widely adopted as a promising alternative, existing SID generation methods rely heavily on unsupervised quantization. In realistic scenarios, the lack of explicit supervision often makes it more difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. To address the issue of unsupervised SIDs, we propose to explicitly model discrete relevance features and develop a Discrete Semantic Identifier Relevance Model (DSIRM). Specifically, we present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. On the other hand, we explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Extensive experimental results on Tmall's production data show that our proposed approach has achieved better results, improving offline AUC by +1.54\%. Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13\% UCTR, +0.25\% UCTCVR), proving its massive industrial value.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DSIRM, which learns discrete semantic identifiers (SIDs) for items via a query-bridged contrastive residual quantization approach that injects query-item interaction supervision into the quantization process. On the query side, it uses generative LLMs to predict item SIDs from text. Hierarchical prefix matching between query and item SIDs produces features that complement dense embeddings for e-commerce relevance ranking. The central empirical claim is a +1.54% offline AUC lift on Tmall production data together with online lifts of +0.13% UCTR and +0.25% UCTCVR after deployment in a hybrid architecture.
Significance. If the reported gains are robust to the controls described in the full experiments, the work would demonstrate a practical route to making discrete identifiers relevance-aware rather than purely unsupervised, which is a recurring limitation in SID-based retrieval systems. The combination of contrastive supervision, LLM-based query SID prediction, and efficient prefix matching, together with production deployment results, would constitute a concrete advance for industrial e-commerce search.
minor comments (2)
- [Abstract] Abstract: the reported +1.54% AUC improvement would be easier to evaluate if the abstract briefly named the primary baseline model and noted that the lift is measured on a held-out production slice with ablations isolating the contrastive term.
- [Experiments] The loss formulation and hierarchical matching procedure are described clearly, but a short table summarizing the ablation variants (with and without the contrastive bridge) would improve readability of the experimental claims.
Simulated Author's Rebuttal
We thank the referee for the positive summary of DSIRM, the assessment of its significance for relevance-aware discrete identifiers, and the recommendation of minor revision. We note that no specific major comments were provided in the report.
Circularity Check
No significant circularity detected
full rationale
The paper introduces a query-bridged contrastive residual quantization method plus LLM-based query SID prediction, with the loss explicitly combining reconstruction and contrastive terms. Ablations isolate the contrastive bridge contribution, and gains are measured on held-out production slices. No equations, self-citations, or fitted parameters are shown reducing any reported prediction or partition to the input data by construction. The approach adds external supervision signals and is evaluated against independent benchmarks, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qingyao Ai, Vahid Azizi, Xu Chen, and Yongfeng Zhang. 2018. Learning het- erogeneous knowledge base embeddings for explainable recommendation. In Algorithms, Vol. 11. 137
2018
-
[2]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. 1877–1901
2020
- [3]
-
[4]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. InInterna- tional Conference on Machine Learning (ICML). PMLR, 1597–1607
2020
-
[5]
W Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search engines: Information retrieval in practice. Addison-Wesley Reading
2010
-
[6]
Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. 2025. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse lexical and expansion model for first stage ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2292
2021
-
[8]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. InProceedings of the 2021 Conference on Empir- ical Methods in Natural Language Processing (EMNLP). 6894–6910
2021
-
[9]
Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu, W Bruce Croft, and Xueqi Cheng. 2020. Deep natural language processing for search and recommender systems. InProceedings of the 43rd International DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling CIKM ’26, October 2026, TBD ACM SI...
2020
- [10]
-
[11]
Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently teaching an effective dense retriever with balanced topic aware sampling. InProceedings of the 44th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval. 113–122
2021
-
[12]
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. InInternational Conference on Learning Representations (ICLR)
2020
-
[13]
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open- domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781
2020
-
[14]
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48
2020
-
[15]
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. 2022. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11523–11532
2022
-
[16]
Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2021. Sparse, dense, and attentional representations for text retrieval. InTransactions of the Association for Computational Linguistics, Vol. 9. 329–345
2021
-
[17]
Bhaskar Mitra and Nick Craswell. 2018. An introduction to neural information retrieval.Foundations and Trends in Information Retrieval13, 1 (2018), 1–126
2018
-
[18]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[19]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Gustavo Penha and Claudia Hauff. 2022. Curriculum learning for dense retrieval distillation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1979–1983
2022
-
[21]
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxi- ang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 5835–5847
2021
-
[22]
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al
-
[23]
Recommender systems with generative retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315
2023
-
[24]
Zhiqing Sun, Zhi-Hong Yang, Jian Zhang, Alessandro Bozzon, Long-Kai Huang, and Chi Xu. 2019. Multi-modal knowledge graphs for recommender systems. InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). 1405–1414
2019
-
[25]
Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search index.Advances in Neural Information Processing Systems 35 (2022), 21831–21843
2022
-
[26]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2016. Semantic product search.arXiv preprint arXiv:1603.06530(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[28]
Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al . 2022. A neural corpus indexer for document retrieval.Advances in Neural Information Processing Systems35 (2022), 25600–25614
2022
-
[29]
Yiwen Wu, Ruobing Xie, Yongchun Zhu, Xiang Ao, Xin Chen, Xu Zhang, Fuzhen Zhuang, Leyu Lin, and Qing He. 2022. Curriculum contrastive learning for sequential recommendation. InProceedings of the ACM Web Conference 2022. 1382–1393
2022
-
[30]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations (ICLR)
2021
-
[31]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. InProceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM). 497–506
2018
-
[33]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1503–1512
2021
-
[34]
Yingxia Zhang, Qingyao Ai, Xing Chen, and Peng Wang. 2020. Towards person- alized and semantic retrieval: An end-to-end solution for e-commerce search via embedding learning. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2407–2416
2020
-
[35]
Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware BERT for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9628–9635
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.