pith. sign in

arxiv: 2606.21911 · v1 · pith:BM7IK3DAnew · submitted 2026-06-20 · 💻 cs.IR · cs.LG

The Pitfall of Scaling Up: Uncovering and Mitigating Popularity Bias Amplification in Scaling Transformer-based Recommenders

Pith reviewed 2026-06-26 11:34 UTC · model grok-4.3

classification 💻 cs.IR cs.LG
keywords popularity biastransformer recommendersmodel scalingspectral collapselong-tail fairnesssequential recommendationattention mechanisms
0
0 comments X

The pith

Scaling transformer recommenders amplifies popularity bias because deeper attention and feed-forward layers cause spectral collapse in predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that increasing the depth of transformer-based sequential recommenders improves accuracy but simultaneously worsens popularity bias by favoring popular items over niche ones. It traces the amplification to a synergistic effect between attention aggregation and feed-forward projections that produces spectral collapse in the output predictions. The authors introduce SPRINT to counteract this by limiting the maximum column sums of attention score matrices and the spectral norms of feed-forward parameters. Experiments show that this approach preserves long-tail fairness and yields improved accuracy even when model size grows from 0.05 million to 0.34 billion parameters.

Core claim

As model depth increases, the two core components of the transformer architecture, attention aggregation and feed-forward projections, synergistically induce severe spectral collapse in model predictions, which directly translates to the amplification of popularity bias. SPRINT mitigates spectral collapse during scaling by constraining the maximum column-sums of the attention score matrices and the spectral norms of the feed-forward parameters, resulting in better accuracy and long-tail fairness.

What carries the argument

SPRINT regularization, which constrains maximum column-sums of attention score matrices and spectral norms of feed-forward parameters to block spectral collapse.

If this is right

  • Larger transformer recommenders can be trained without proportional increases in popularity bias.
  • Long-tail items receive more balanced exposure as model capacity grows.
  • Recommendation ecosystems experience reduced reinforcement of the Matthew effect.
  • Both accuracy and fairness metrics improve together when the constraints are applied during scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spectral-collapse mechanism could be checked in non-sequential or non-transformer recommenders to test generality.
  • Monitoring the column sums of attention matrices during training may serve as an early warning for emerging bias.
  • If spectral collapse proves causal, similar norm-based constraints might stabilize scaling in other attention-heavy models such as language models used for ranking.

Load-bearing premise

Spectral collapse induced specifically by attention and feed-forward components is the root cause of bias amplification rather than other factors like data distribution or training dynamics.

What would settle it

Training a series of deeper transformers while measuring prediction spectral properties and bias metrics, then checking whether bias still rises when the proposed column-sum and spectral-norm constraints are enforced.

Figures

Figures reproduced from arXiv: 2606.21911 by Can Wang, Chongming Gao, Jiawei Chen, Sheng Zhou, Weiqin Yang, Xiang Wang, Yue Pan.

Figure 1
Figure 1. Figure 1: Scaling laws of accuracy (NDCG@5) and fairness [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of increasingly severe spectral collapse [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Popular items tend to receive disproportionately [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer scaling results of SPRINT and representative baselines on the SASRec++ backbone, where the number of layers [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pareto frontier of accuracy and fairness for SPRINT and representative baselines on the SASRec++ backbone. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study investigating the impact of atten [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Runtime comparison (seconds per epoch) between [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Layer scaling results of SPRINT and representative baselines on the SASRec++ backbone, where the number of layers [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Hyperparameter sensitivity analysis of SPRINT on the SASRec++ backbone with respect to the attention regularization [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

We identify a critical pitfall in scaling transformer-based sequential recommenders: while increasing model size improves recommendation accuracy, it simultaneously amplifies popularity bias. This bias drives systems to over-recommend popular items at the expense of niche ones, which not only undermines fairness but also degrades the broader ecosystem by reinforcing the Matthew effect and filter bubbles. Consequently, this bias amplification emerges as a fundamental obstacle to sustainable model scaling. Through comprehensive theoretical and empirical analyses, we uncover the root cause of this amplification. Our findings reveal that as model depth increases, the two core components of the transformer architecture, i.e., attention aggregation and feed-forward projections, synergistically induce severe spectral collapse in model predictions, which directly translates to the amplification of popularity bias. To address this challenge, we propose SPRINT (Scalable Popularity Regularization IN Transformers), which mitigates spectral collapse during scaling by constraining (i) the maximum column-sums of the attention score matrices and (ii) the spectral norms of the feed-forward parameters. Extensive experiments demonstrate that SPRINT significantly improves both accuracy and long-tail fairness. Crucially, it yields more favorable scaling behaviors when expanding model sizes from 0.05M to 0.34B parameters. The code is available at https://github.com/Tiny-Snow/GenRec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that scaling transformer-based sequential recommenders improves accuracy but amplifies popularity bias, with the root cause being synergistic spectral collapse induced by attention aggregation and feed-forward projections as model depth increases. It proposes SPRINT to mitigate this via constraints on the maximum column-sums of attention score matrices and the spectral norms of feed-forward parameters, reporting improved accuracy, long-tail fairness, and more favorable scaling behavior from 0.05M to 0.34B parameters, with code released.

Significance. If the causal mechanism is rigorously established, the work would be significant for recommender systems by identifying a depth-dependent scaling obstacle and providing targeted regularization that jointly benefits accuracy and fairness at large scales. The open-sourced code supports reproducibility and is a clear strength.

major comments (2)
  1. [Theoretical analysis] Theoretical analysis section: the assertion that spectral collapse 'directly translates' to popularity bias amplification is load-bearing for the central claim, yet the argument links depth-induced spectral properties to bias metrics via observed correlation rather than a closed-form derivation showing necessity (e.g., no explicit mapping from collapsed eigenvalues of the prediction operator to the popularity skew metric). This leaves open the possibility that other factors (optimization dynamics, embedding geometry) drive the bias.
  2. [SPRINT method] SPRINT method description: the proposed constraints on max column-sums of attention matrices and spectral norms of FFN weights are presented as addressing the synergistic root cause, but without a derivation showing how these bounds specifically prevent the collapse that produces the popularity metric (as opposed to generic regularization), the targeting of the mechanism remains under-supported.
minor comments (2)
  1. [Abstract] Abstract: states 'comprehensive theoretical and empirical analyses' but supplies no equations, dataset names, or quantitative results; adding one or two key equations or headline metrics would improve clarity.
  2. [Experiments] Experiments section: while scaling from 0.05M to 0.34B parameters is highlighted, specific dataset statistics, exact long-tail fairness metrics, and baseline implementations should be detailed to allow direct replication of the scaling curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work identifying spectral collapse as a scaling obstacle in transformer recommenders and proposing SPRINT. We address the two major comments below, providing clarifications on the theoretical linkage and method motivation while noting where the manuscript can be strengthened for clarity.

read point-by-point responses
  1. Referee: [Theoretical analysis] Theoretical analysis section: the assertion that spectral collapse 'directly translates' to popularity bias amplification is load-bearing for the central claim, yet the argument links depth-induced spectral properties to bias metrics via observed correlation rather than a closed-form derivation showing necessity (e.g., no explicit mapping from collapsed eigenvalues of the prediction operator to the popularity skew metric). This leaves open the possibility that other factors (optimization dynamics, embedding geometry) drive the bias.

    Authors: We appreciate this observation. Section 3 derives the synergistic effect of attention column aggregation and FFN projections on the eigenvalues of the effective prediction operator, showing progressive rank collapse with depth. This is then linked to popularity bias because the resulting low-rank operator disproportionately weights high-norm popular items under the recommendation softmax. While a fully closed-form necessity mapping from specific eigenvalues to the exact popularity skew metric is not derived (as it would require strong distributional assumptions on item popularities that do not generalize), we isolate the spectral mechanism via controlled ablations that hold optimization and embeddings fixed. We will revise the manuscript to expand the logical chain from operator spectrum to bias metric and add explicit discussion ruling out confounding factors. revision: partial

  2. Referee: [SPRINT method] SPRINT method description: the proposed constraints on max column-sums of attention matrices and spectral norms of FFN weights are presented as addressing the synergistic root cause, but without a derivation showing how these bounds specifically prevent the collapse that produces the popularity metric (as opposed to generic regularization), the targeting of the mechanism remains under-supported.

    Authors: The constraints in SPRINT are derived from the spectral analysis: the max column-sum bound on attention scores limits the aggregation operator's ability to concentrate mass and induce collapse, while the FFN spectral-norm bound directly caps the amplification of low-rank components in the projection. This is not generic regularization; ablations in the paper show that alternative regularizers (e.g., weight decay alone) fail to preserve spectrum or fairness at scale, whereas SPRINT maintains both. We will add a short derivation sketch in Section 4 explicitly connecting each bound to the eigenvalues of the composite operator to make the targeting clearer. revision: partial

Circularity Check

0 steps flagged

No circularity; theoretical link and new regularizers are independent of fitted outputs

full rationale

The paper's central claim rests on a theoretical analysis showing that depth-induced spectral collapse in attention and FFN components translates to popularity bias amplification, followed by introduction of SPRINT constraints on column-sums and spectral norms. These constraints are newly proposed regularizers, not quantities defined from or fitted to the target popularity metric. No self-citation is invoked as a load-bearing uniqueness theorem, no parameter is fitted to a data subset and then relabeled a prediction, and no ansatz is smuggled via prior work. The derivation chain is self-contained against external benchmarks and does not reduce by construction to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only; full theoretical derivation and experimental setup unavailable, so ledger is necessarily incomplete.

free parameters (1)
  • constraint strengths for attention column-sums and FFN spectral norms
    Likely hyperparameters tuned during experiments to achieve reported accuracy-fairness trade-offs.
axioms (1)
  • domain assumption Spectral collapse in predictions can be directly measured via attention and FFN properties and is causally linked to popularity bias
    Central premise of the theoretical analysis described in the abstract.

pith-pipeline@v0.9.1-grok · 5784 in / 1170 out tokens · 28622 ms · 2026-06-26T11:34:13.271828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 17 linked inside Pith

  1. [1]

    Sanjeev Arora, Nadav Cohen, and Elad Hazan. 2018. On the optimization of deep networks: Implicit acceleration by overparameterization. InInternational conference on machine learning. PMLR, 244–253

  2. [2]

    Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. 2019. Implicit regular- ization in deep matrix factorization.Advances in neural information processing systems32 (2019)

  3. [3]

    Arlen Brown, Paul R Halmos, and Allen L Shields. 1965. Cesaro operators.Acta Sci. Math.(Szeged)26, 125-137 (1965), 81–82

  4. [4]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  5. [5]

    Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, et al . 2025. Longer: Scaling up long sequence modeling in industrial recommenders. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 247–256

  6. [6]

    Jianxin Chang, Chen Gao, Yu Zheng, Yiqun Hui, Yanan Niu, Yang Song, Depeng Jin, and Yong Li. 2021. Sequential recommendation with graph neural networks. InProceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 378–387

  7. [7]

    Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin Xin, Liang Chen, Guli Lin, and Keping Yang. 2021. AutoDebias: Learning to debias for recommendation. InProceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 21–30

  8. [8]

    Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. 2023. Bias and debias in recommender system: A survey and future directions. ACM Transactions on Information Systems41, 3 (2023), 1–39

  9. [9]

    Ruijun Chen, Chongming Gao, Jiawei Chen, Weiqin Yang, and Xiangnan He

  10. [10]

    Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation.arXiv preprint arXiv:2605.04559(2026)

  11. [11]

    Sirui Chen, Jiawei Chen, Sheng Zhou, Bohao Wang, Shen Han, Chanfei Su, Yuqing Yuan, and Can Wang. 2024. SIGformer: Sign-aware graph transformer for recommendation. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval. 1274–1284

  12. [12]

    Sirui Chen, Shen Han, Jiawei Chen, Binbin Hu, Sheng Zhou, Gang Wang, Yan Feng, Chun Chen, and Can Wang. 2025. Rankformer: A graph transformer for recommendation based on ranking objective. InProceedings of the ACM on Web Conference 2025. 3037–3048

  13. [13]

    Sirui Chen, Changxin Tian, Binbin Hu, Kunlong Chen, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. 2025. Arrows of math reasoning data synthesis for large language models: Diversity, complexity and correctness. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 4665–4669

  14. [14]

    Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018. Sequential recommendation with user memory networks. InProceedings of the eleventh ACM international conference on web search and data mining. 108–116

  15. [15]

    Erica Coppolillo, Marco Minici, Ettore Ritacco, Luciano Caroprese, Francesco Pisani, and Giuseppe Manco. 2024. Balanced Quality Score: Measuring Popularity Debiasing in Recommendation.ACM Transactions on Intelligent Systems and Technology15, 4 (2024), 1–27

  16. [16]

    Yu Cui, Feng Liu, Jiawei Chen, Canghong Jin, Xingyu Lou, Changwang Zhang, Jun Wang, Yuegang Sun, and Can Wang. 2025. HatLLM: Hierarchical Attention Masking for Enhanced Collaborative Modeling in LLM-based Recommendation. arXiv preprint arXiv:2510.10955(2025)

  17. [17]

    Yu Cui, Feng Liu, Jiawei Chen, Xingyu Lou, Changwang Zhang, Jun Wang, Yue- gang Sun, Xiaohu Yang, and Can Wang. 2026. Field matters: A lightweight LLM-enhanced method for CTR prediction. InProceedings of the ACM Web Con- ference 2026. 6365–6376

  18. [18]

    Yu Cui, Feng Liu, Pengbo Wang, Bohao Wang, Heng Tang, Yi Wan, Jun Wang, and Jiawei Chen. 2024. Distillation matters: empowering sequential recommenders to match the performance of large language models. InProceedings of the 18th ACM Conference on Recommender Systems. 507–517

  19. [19]

    Yu Cui, Feng Liu, Zhaoxiang Wang, Changwang Zhang, Jun Wang, Can Wang, and Jiawei Chen. 2026. SpecTran: Spectral-Aware Transformer-based Adapter for LLM-Enhanced Sequential Recommendation.arXiv preprint arXiv:2601.21986 (2026)

  20. [20]

    Tri Dao. 2024. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, Vol. 2024. 35549–35562

  21. [21]

    Sihao Ding, Peng Wu, Fuli Feng, Yitong Wang, Xiangnan He, Yong Liao, and Yongdong Zhang. 2022. Addressing unmeasured confounder for recommendation with sensitivity analysis. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 305–315

  22. [22]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

  23. [23]

    Gene H Golub and Henk A Van der Vorst. 2000. Eigenvalue computation in the 20th century.J. Comput. Appl. Math.123, 1-2 (2000), 35–65

  24. [24]

    2013.Matrix computations

    Gene H Golub and Charles F Van Loan. 2013.Matrix computations. JHU press

  25. [25]

    Alois Gruson, Praveen Chandar, Christophe Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, and Ben Carterette. 2019. Offline evalu- ation to make decisions about playlistrecommendation algorithms. InProceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 420–428

  26. [26]

    Wei Guo, Hao Wang, Luankang Zhang, Jin Yao Chin, Zhongzhou Liu, Kai Cheng, Qiushi Pan, Yi Quan Lee, Wanqi Xue, Tingjia Shen, et al. 2024. Scaling new fron- tiers: Insights into large recommendation models.arXiv preprint arXiv:2412.00714 (2024)

  27. [27]

    Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2023. On the embedding collapse when scaling up recommendation models. arXiv preprint arXiv:2310.04400(2023)

  28. [28]

    Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Qiang Lin, Can Wang, and Jiawei Chen. 2025. Rethinking entropy interventions in rlvr: An entropy change perspective.arXiv preprint arXiv:2510.10150(2025)

  29. [29]

    Zhezheng Hao, Hong Wang, Jian Luo, Jianqing Zhang, Yuyan Zhou, Qiang Lin, Can Wang, Hande Dong, and Jiawei Chen. 2026. ReCreate: Reasoning and Creating Domain Agents Driven by Experience.arXiv preprint arXiv:2601.11100 (2026)

  30. [30]

    Zhezheng Hao, Tianfu Wang, Huanshuo Dong, Ziyan Liu, Hong Wang, Xi- ankun Lin, Qiang Lin, Can Wang, Hande Dong, and Jiawei Chen. 2026. Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems. arXiv:2605.29790 [cs.MA] https://arxiv.org/abs/2605.29790

  31. [31]

    F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis)5, 4 (2015), 1–19

  32. [32]

    Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. Inproceedings of the 25th international conference on world wide web. 507–517

  33. [33]

    Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk

  34. [34]

    Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939(2015)

  35. [35]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556(2022)

  36. [36]

    Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. 2021. Understanding dimensional collapse in contrastive self-supervised learning.arXiv preprint arXiv:2110.09348(2021)

  37. [37]

    Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- mendation. In2018 IEEE international conference on data mining (ICDM). IEEE, 197–206

  38. [38]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

  39. [39]

    Anastasiia Klimashevskaia, Dietmar Jannach, Mehdi Elahi, and Christoph Trat- tner. 2024. A survey on popularity bias in recommender systems.User Modeling and User-Adapted Interaction34, 5 (2024), 1777–1834

  40. [40]

    Haoxuan Li, Yanghao Xiao, Chunyuan Zheng, and Peng Wu. 2023. Balancing un- observed confounding with a few unbiased ratings in debiased recommendations. InProceedings of the ACM web conference 2023. 1305–1313

  41. [41]

    Zihao Li, Yakun Chen, Tong Zhang, and Xianzhi Wang. 2025. Reembedding and Reweighting are Needed for Tail Item Sequential Recommendation. InProceedings of the ACM on Web Conference 2025. 4925–4936

  42. [42]

    Siyi Lin, Chongming Gao, Jiawei Chen, Sheng Zhou, Binbin Hu, Yan Feng, Chun Chen, and Can Wang. 2025. How do recommendation models amplify popularity bias? An analysis from the spectral perspective. InProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining. 659–668

  43. [43]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  44. [44]

    Kanglin Liu, Wenming Tang, Fei Zhou, and Guoping Qiu. 2019. Spectral regu- larization for combating mode collapse in gans. InProceedings of the IEEE/CVF international conference on computer vision. 6382–6390

  45. [45]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

  46. [46]

    Sijin Lu, Zhibo Man, Fangyuan Luo, and Jun Wu. 2025. Dual Debiasing in LLM-based Recommendation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2685–2689

  47. [47]

    Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel

  48. [48]

    InProceedings of the 38th international ACM SIGIR conference on research and development in information retrieval

    Image-based recommendations on styles and substitutes. InProceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. 43–52

  49. [49]

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normalization for generative adversarial networks.arXiv preprint The Pitfall of Scaling Up: Uncovering and Mitigating Popularity Bias Amplification in Scaling Transformer-based Recommenders KDD 2026, August 9–13, 2026, Jeju Island, Republic of Korea. arXiv:1802.05957(2018)

  50. [50]

    Wentao Ning, Reynold Cheng, Xiao Yan, Ben Kao, Nan Huo, Nur Al Hasan Haldar, and Bo Tang. 2024. Debiasing recommendation with personal popularity. In Proceedings of the ACM Web Conference 2024. 3400–3409

  51. [51]

    Naoto Ohsaka and Riku Togashi. 2023. Curse of" low" dimensionality in recom- mender systems. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 537–547

  52. [52]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research21, 140 (2020), 1–67

  53. [53]

    Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al

  54. [54]

    Recommender systems with generative retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315

  55. [55]

    William T Ross. 2022. The Cesaro operator.arXiv preprint arXiv:2210.08091 (2022)

  56. [56]

    Tingjia Shen, Hao Wang, Chuhan Wu, Jin Yao Chin, Wei Guo, Yong Liu, Huifeng Guo, Defu Lian, Ruiming Tang, and Enhong Chen. 2024. Optimizing sequential recommendation models with scaling laws and approximate entropy.arXiv preprint arXiv:2412.00430(2024)

  57. [57]

    Wentao Shi, Xiangnan He, Yang Zhang, Chongming Gao, Xinyue Li, Jizhi Zhang, Qifan Wang, and Fuli Feng. 2024. Large language models are learnable planners for long-term recommendation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1893–1903

  58. [58]

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063

  59. [59]

    Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang

  60. [60]

    InProceedings of the 28th ACM international conference on information and knowledge management

    BERT4Rec: Sequential recommendation with bidirectional encoder rep- resentations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management. 1441–1450

  61. [61]

    Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda- tion via convolutional sequence embedding. InProceedings of the eleventh ACM international conference on web search and data mining. 565–573

  62. [62]

    Viet Anh Tran, Guillaume Salha-Galvan, Bruno Sguerra, and Romain Hennequin

  63. [63]

    InProceed- ings of the 46th international ACM SIGIR conference on research and development in information retrieval

    Attention mixtures for time-aware sequential recommendation. InProceed- ings of the 46th international ACM SIGIR conference on research and development in information retrieval. 1821–1826

  64. [64]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  65. [65]

    Bohao Wang, Jiawei Chen, Changdong Li, Sheng Zhou, Qihao Shi, Yang Gao, Yan Feng, Chun Chen, and Can Wang. 2024. Distributionally robust graph- based recommendation system. InProceedings of the ACM web conference 2024. 3777–3788

  66. [66]

    Bohao Wang, Jiawei Chen, Feng Liu, Changwang Zhang, Jun Wang, Canghong Jin, Chun Chen, and Can Wang. 2026. Does LLM Focus on the Right Words? Mitigating Context Bias in LLM-based Recommenders. InProceedings of the ACM Web Conference 2026. 6688–6699

  67. [67]

    Bohao Wang, Yu Cui, Zhenxiang Xu, Jujia Zhao, Chenxiao Fan, Jizhi Zhang, Weiqin Yang, Shengjia Zhang, Sirui Chen, Yang Zhang, Xiaoyan Zhao, Wenjie Wang, Chongming Gao, Fuli Feng, Xiangnan He, and Jiawei Chen. 2026. Trust- worthy Recommendation in the Era of Large Language Models: Opportunities and Challenges. (2026)

  68. [68]

    Bohao Wang, Feng Liu, Jiawei Chen, Xingyu Lou, Changwang Zhang, Jun Wang, Yuegang Sun, Yan Feng, Chun Chen, and Can Wang. 2025. Msl: Not all tokens are what you need for tuning llm as a recommender. InProceedings of the 48th international ACM SIGIR conference on research and development in information retrieval. 1912–1922

  69. [69]

    Bohao Wang, Feng Liu, Changwang Zhang, Jiawei Chen, Yudi Wu, Sheng Zhou, Xingyu Lou, Jun Wang, Yan Feng, Chun Chen, et al. 2025. Llm4dsr: Leveraging large language model for denoising sequential recommendation.ACM Transac- tions on Information Systems44, 1 (2025), 1–32

  70. [70]

    Hong Wang, Zhezheng Hao, Jian Luo, Chenxing Wei, Yao Shu, Lei Liu, Qiang Lin, Hande Dong, and Jiawei Chen. 2025. Scheduling Your LLM Reinforcement Learning with Reasoning Trees.arXiv preprint arXiv:2510.24832(2025)

  71. [71]

    Lei Wang, Chen Ma, Xian Wu, Zhaopeng Qiu, Yefeng Zheng, and Xu Chen. 2024. Causally debiased time-aware recommendation. InProceedings of the ACM Web Conference 2024. 3331–3342

  72. [72]

    Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See- Kiong Ng, and Tat-Seng Chua. 2024. Learnable item tokenization for generative recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 2400–2409

  73. [73]

    Wenjie Wang, Fuli Feng, Xiangnan He, Xiang Wang, and Tat-Seng Chua. 2021. Deconfounded recommendation for alleviating bias amplification. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 1717– 1725

  74. [74]

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al

  75. [75]

    Emergent abilities of large language models.arXiv preprint arXiv:2206.07682 (2022)

  76. [76]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  77. [77]

    Tianxin Wei, Fuli Feng, Jiawei Chen, Ziwei Wu, Jinfeng Yi, and Xiangnan He

  78. [78]

    InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining

    Model-agnostic counterfactual reasoning for eliminating popularity bias in recommender system. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 1791–1800

  79. [79]

    Songpei Xu, Shijia Wang, Da Guo, Xianwen Guo, Qiang Xiao, Bin Huang, Guanlin Wu, and Chuanjiang Luo. 2025. Climber: Toward efficient scaling laws for large recommendation models. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6193–6200

  80. [80]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

Showing first 80 references.