arxiv: 2604.13737 · v1 · submitted 2026-04-15 · 💻 cs.IR · cs.AI

Recognition: unknown

TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds

Yifeng Zhou , Yuehong Hu , Zhixiang Feng , Junwei Pan , Kaihui Wu , Hanyong Li , Shangyu Zhang , Shudong Huang

show 4 more authors

Zhangbin Zhu Chengguo Yin Haijie Gu Jie Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:45 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords unified recommendationmulti-field featuressequential modelingattention mechanismdimensional collapsefeature interactionuser behavior sequencesrepresentation robustness

0 comments

The pith

TokenFormer unifies multi-field feature interactions and sequential user behavior modeling in one architecture by blocking dimensional collapse of sequence features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recommender systems have long split into two separate lines of work: one that models interactions among many categorical fields and another that tracks sequences of user actions over time. When researchers try to put both into the same model, the sequence features tend to lose their distinct dimensional structure, a failure the paper names Sequential Collapse Propagation. TokenFormer counters this with two targeted changes to the network: a layered attention pattern that starts with full attention and then switches to shrinking sliding windows, plus a non-linear multiplicative step applied to hidden states. These changes let a single model handle both kinds of input while keeping the sequence information intact and more distinguishable. If the approach holds, recommendation systems could stop maintaining two separate modeling traditions and instead use one backbone that works for both feature tables and behavior histories.

Core claim

The paper proposes TokenFormer, a unified recommendation architecture that overcomes Sequential Collapse Propagation through a Bottom-Full-Top-Sliding attention scheme, which applies full self-attention in the lower layers and shrinking-window sliding attention in the upper layers, together with Non-Linear Interaction Representation that applies one-sided non-linear multiplicative transformations to the hidden states. Extensive experiments on public benchmarks and Tencent's advertising platform show state-of-the-art performance, while analysis confirms improved dimensional robustness and representation discriminability under unified modeling.

What carries the argument

Bottom-Full-Top-Sliding (BFTS) attention, which runs full self-attention at lower layers and shrinking sliding-window attention at upper layers, combined with Non-Linear Interaction Representation (NLIR) that performs one-sided non-linear multiplicative transformations on hidden states.

Load-bearing premise

That the BFTS attention pattern and NLIR transformations are the direct cause of avoiding sequence collapse and improving robustness, rather than differences in model capacity, training procedure, or evaluation choices.

What would settle it

An ablation experiment on the same benchmarks that replaces BFTS with standard full attention and NLIR with linear interactions yet still shows equivalent or better performance and no collapse would falsify the claim that these two components are required for successful unification.

Figures

Figures reproduced from arXiv: 2604.13737 by Chengguo Yin, Haijie Gu, Hanyong Li, Jie Jiang, Junwei Pan, Kaihui Wu, Shangyu Zhang, Shudong Huang, Yifeng Zhou, Yuehong Hu, Zhangbin Zhu, Zhixiang Feng.

**Figure 2.** Figure 2: Overview of TokenFormer. TokenFormer represents multi-field features F, sequential behavior tokens T, and target features V as a unified token stream, which is processed by stacked Unified Interaction Blocks (UIBs). Each UIB combines the proposed Bottom-Full-Top-Sliding (BFTS) attention design, which applies full causal attention in shallow layers and shrinking SWA in deeper layers, with the Non-Linear Int… view at source ↗

**Figure 3.** Figure 3: Discriminability analysis across varying cluster [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Effective rank comparison of sequential behavioral [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Evolution of attention patterns. Top: Vanilla Transformer suffers from redundant revisiting of static fields in last layers; [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Left: Attention receptive field distributions . His [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Efficiency and effectiveness trade-offs of various [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Block-wise effective-rank trajectory of sequential [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Per-layer normalized singular value spectra. Com [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

Recommender systems have historically developed along two largely independent paradigms: feature interaction models for modeling correlations among multi-field categorical features, and sequential models for capturing user behavior dynamics from historical interaction sequences. Although recent trends attempt to bridge these paradigms within shared backbones, we empirically reveal that naive unifying these two branches may lead to a failure mode of Sequential Collapse Propagation (SCP). That is, the interaction with those dimensionally ill non-sequence fields leads to the dimensional collapse of the sequence features. To overcome this challenge, we propose TokenFormer, a unified recommendation architecture with the following innovations. First, we introduce a Bottom-Full-Top-Sliding (BFTS) attention scheme, which applies full self-attention in the lower layers and shrinking-window sliding attention in the upper layers. Second, we introduce a Non-Linear Interaction Representation (NLIR) that applies one-sided non-linear multiplicative transformations to the hidden states. Extensive experiments on public benchmarks and Tencent's advertising platform demonstrate state-of-the-art performance, while detailed analysis confirm that TokenFormer significantly improves dimensional robustness and representation discriminability under unified modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TokenFormer identifies a plausible failure mode when merging multi-field and sequential recommenders but the abstract gives almost no evidence that BFTS and NLIR actually fix it.

read the letter

The core pitch is that naive unification of feature-interaction and sequential models triggers Sequential Collapse Propagation, where non-sequence fields degrade the sequence representations, and that their Bottom-Full-Top-Sliding attention plus Non-Linear Interaction Representation solve it while delivering SOTA results. That unification goal is worth pursuing because production systems often need both categorical field features and ordered user history in one model.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies a failure mode termed Sequential Collapse Propagation (SCP) when naively unifying multi-field feature-interaction models with sequential recommendation models, in which interactions with dimensionally ill non-sequence fields cause collapse of sequence features. It proposes TokenFormer, a unified architecture that applies a Bottom-Full-Top-Sliding (BFTS) attention scheme (full self-attention in lower layers, shrinking-window sliding attention in upper layers) together with Non-Linear Interaction Representation (NLIR) via one-sided non-linear multiplicative transformations on hidden states. The paper reports state-of-the-art results on public benchmarks and Tencent advertising data, together with improved dimensional robustness and representation discriminability under unified modeling.

Significance. If the empirical claims are substantiated by properly controlled experiments, the work would offer a practical bridge between two historically separate recommendation paradigms and a concrete mechanism for preserving sequence-feature dimensionality. The emphasis on dimensional robustness under unification is a potentially valuable contribution, but its significance hinges on whether the reported gains are causally attributable to BFTS and NLIR rather than unmatched capacity, training schedules, or evaluation choices.

major comments (2)

[Abstract] Abstract: the central claim that naive unification produces Sequential Collapse Propagation is asserted without any formal definition, equations, or illustrative derivation; this absence makes it impossible to verify whether the proposed BFTS and NLIR mechanisms are necessary or sufficient to address the stated problem.
[Experiments] Experiments (implied by abstract claims): no information is supplied on baseline capacity matching, hyper-parameter schedules, data preprocessing, or ablation studies that isolate the contribution of lower-layer full attention versus upper-layer sliding windows versus the NLIR non-linearity; without these controls the attribution of SOTA performance and robustness gains to the two innovations remains unverified.

minor comments (1)

[Abstract] Abstract: the phrase 'one-sided non-linear multiplicative transformations' is introduced without a mathematical specification or reference to the exact functional form.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on clarity and experimental controls. We address each point below and will revise the manuscript to strengthen verifiability of the SCP claim and attribution of results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that naive unification produces Sequential Collapse Propagation is asserted without any formal definition, equations, or illustrative derivation; this absence makes it impossible to verify whether the proposed BFTS and NLIR mechanisms are necessary or sufficient to address the stated problem.

Authors: We agree that the abstract is too concise to stand alone on this point. The main text (Section 3.1) formally defines SCP as the propagation of dimensionality mismatch from non-sequence fields through shared attention, leading to sequence feature collapse (with the condition ||h_seq|| -> 0 derived from the attention update rule in Eq. (3)-(4)). To make the abstract self-contained, we will add a one-sentence formal characterization of SCP and note that BFTS/NLIR are designed to mitigate it. revision: yes
Referee: [Experiments] Experiments (implied by abstract claims): no information is supplied on baseline capacity matching, hyper-parameter schedules, data preprocessing, or ablation studies that isolate the contribution of lower-layer full attention versus upper-layer sliding windows versus the NLIR non-linearity; without these controls the attribution of SOTA performance and robustness gains to the two innovations remains unverified.

Authors: The manuscript reports capacity-matched baselines (parameter counts within 5% in Table 1), standard grid-search hyperparameter tuning on validation sets (Appendix C), and preprocessing details (Section 4.1). Section 4.3 already contains ablations that isolate BFTS layers (full vs. sliding) and NLIR (with/without the one-sided non-linearity). However, to improve transparency we will expand the experimental section with an explicit controls subsection, additional tables on matched capacities, and finer-grained ablations separating the lower-layer full attention from the upper-layer sliding windows. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural proposal with empirical validation, no derivations or self-referential reductions

full rationale

The paper claims an empirical observation of Sequential Collapse Propagation under naive unification of multi-field and sequential models, then introduces BFTS attention (full self-attention in lower layers, shrinking-window in upper) and NLIR (one-sided non-linear multiplicative transforms) as architectural remedies, validated by SOTA results on public benchmarks and Tencent data. No mathematical derivation chain, equations, or first-principles results appear in the abstract or described claims. Performance and robustness improvements are presented as experimental outcomes rather than predictions derived from fitted inputs or self-citations. No load-bearing steps reduce by construction to the inputs; the work is self-contained as an empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The central claim rests on the empirical observation of Sequential Collapse Propagation and on the effectiveness of two newly introduced components (BFTS and NLIR) whose independent grounding is not supplied in the abstract. No free parameters or background axioms are stated.

invented entities (3)

Sequential Collapse Propagation (SCP) no independent evidence
purpose: Names the dimensional collapse failure mode that occurs when non-sequence fields interact with sequence features
Introduced as an empirical finding of the paper; no independent verification or prior citation is mentioned in the abstract.
Bottom-Full-Top-Sliding (BFTS) attention scheme no independent evidence
purpose: Combines full self-attention in lower layers with shrinking-window sliding attention in upper layers
Newly proposed architectural pattern; no prior reference or independent evidence supplied in the abstract.
Non-Linear Interaction Representation (NLIR) no independent evidence
purpose: Applies one-sided non-linear multiplicative transformations to hidden states
Newly proposed transformation; no prior reference or independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5528 in / 1533 out tokens · 69544 ms · 2026-05-10T12:45:02.337214+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150(2020)

work page internal anchor Pith review arXiv 2020
[2]

Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, et al. 2025. LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders. arXiv:2505.04421 [cs.IR]

work page arXiv 2025
[3]

Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, et al. 2023. TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 3784–3794

2023
[4]

Jianxin Chang, Chenbin Zhang, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, and Kun Gai. 2023. Pepnet: Parameter and embedding personalized network for infusing with personalized prior information. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3795–3804

2023
[5]

Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Behavior sequence transformer for e-commerce recommendation in alibaba. InProceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data. 1–4

2019
[6]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al
[7]

InProceedings of the 1st Workshop on Deep Learning for Recommender Systems

Wide & Deep Learning for Recommender Systems. InProceedings of the 1st Workshop on Deep Learning for Recommender Systems. 7–10
[8]

Weiyu Cheng, Yanyan Shen, and Linpeng Huang. 2020. Adaptive factorization network: Learning adaptive-order feature interactions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 3609–3616

2020
[9]

Qin Ding, Kevin Course, Linjian Ma, Jianhui Sun, Rouchen Liu, Zhao Zhu, Chunx- ing Yin, Wei Li, Dai Li, Yu Shi, et al. 2026. Bending the Scaling Law Curve in Large-Scale Recommendation Systems.arXiv preprint arXiv:2602.16986(2026)

work page arXiv 2026
[10]

Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Keping Yang. 2019. Deep Session Interest Network for Click-Through Rate Prediction. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). 2301–2307

2019
[11]

Huan Gui, Ruoxi Wang, Ke Yin, Long Jin, Maciej Kula, Taibai Xu, Lichan Hong, and Ed H Chi. 2023. Hiformer: Heterogeneous feature interactions learning with transformers for recommender systems.arXiv preprint arXiv:2311.05884(2023)

work page arXiv 2023
[12]

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. InProceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI). 1725–1731

2017
[13]

Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2024. On the embedding collapse when scaling up recommendation models (ICML’24). JMLR.org, Article 671, 19 pages

2024
[14]

Xiangnan He and Tat-Seng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 355–364

2017
[15]

Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk
[16]

In International Conference on Learning Representations (ICLR) Workshop

Session-based Recommendations with Recurrent Neural Networks. In International Conference on Learning Representations (ICLR) Workshop
[17]

Bojian Hou, Xiaolong Liu, Xiaoyi Liu, Jiaqi Xu, Yasmine Badr, Mengyue Hang, Sudhanshu Chanpuriya, Junqing Zhou, Yuhang Yang, Han Xu, Qiuling Suo, Laming Chen, Yuxi Hu, Jiasheng Zhang, Huaqing Xiong, Yuzhen Huang, Chao Chen, Yue Dong, Yi Yang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Darren Liu, Jade Nie, Chunzhi Yang, Jiyan Yang, and Huayu Li....

work page arXiv 2026
[18]

Ruijie Hou, Zhaoyang Yang, Yu Ming, Hongyu Lu, Zhuobin Zheng, Yu Chen, Qinsong Zeng, and Ming Chen. 2024. Cross-Domain LifeLong Sequential Model- ing for Online Click-Through Rate Prediction. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 5116–5125. doi:10.1145/3637528.3671601

work page doi:10.1145/3637528.3671601 2024
[19]

Xian Hu, Ming Yue, Zhixiang Feng, Junwei Pan, Junjie Zhai, Ximei Wang, Xinrui Miao, Qian Li, Xun Liu, Shangyu Zhang, et al. 2025. Practice on Long Behavior Sequence Modeling in Tencent Advertising. arXiv:2510.21714 [cs.IR]

work page arXiv 2025
[20]

Tongwen Huang, Zhiqi Zhang, and Junlin Zhang. 2019. FiBiNET: Combining Feature Importance and Bilinear Feature Interaction for Click-Through Rate Prediction. InProceedings of the 13th ACM Conference on Recommender Systems (RecSys). 169–177

2019
[21]

Yunwen Huang, Shiyong Hong, Xijun Xiao, Jinqiu Jin, Xuanyuan Luo, Zhe Wang, Zheng Chai, Shikang Wu, Yuchao Zheng, and Jingjian Lin. 2026. HyFormer: Revis- iting the Roles of Sequence Modeling and Feature Interaction in CTR Prediction. arXiv preprint arXiv:2601.12681(2026)

work page arXiv 2026
[22]

2025.In- ternet Advertising Revenue Report: Full Year 2024

Interactive Advertising Bureau and PricewaterhouseCoopers. 2025.In- ternet Advertising Revenue Report: Full Year 2024. Technical Report. In- teractive Advertising Bureau (IAB) and PricewaterhouseCoopers (PwC). https://www.iab.com/wp-content/uploads/2025/04/IAB_PwC-Internet-Ad- Revenue-Report-Full-Year-2024.pdf Reports U.S. internet advertising revenue of ...

2025
[23]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.CoRRabs/2310.06...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[24]

Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field- aware Factorization Machines for CTR Prediction. InProceedings of the 10th ACM Conference on Recommender Systems. 43–50

2016
[25]

Wang-Cheng Kang and Julian McAuley. 2018. Self-Attentive Sequential Rec- ommendation. In2018 IEEE International Conference on Data Mining (ICDM). 197–206

2018
[26]

2025.Digital 2025: The State of Social Media in 2025

Simon Kemp. 2025.Digital 2025: The State of Social Media in 2025. DataRepor- tal. https://datareportal.com/reports/digital-2025-sub-section-state-of-social Reports 5.24 billion active social media user identities worldwide in early 2025

2025
[27]

2025.Digital 2025: Top Social Platforms in 2025

Simon Kemp. 2025.Digital 2025: Top Social Platforms in 2025. DataReportal. https: //datareportal.com/reports/digital-2025-sub-section-top-social-platforms Re- ports that TikTok’s Android user base spent almost 35 hours using the app in November 2024

2025
[28]

2025.TikTok Users, Stats, Data & Trends for 2025

Simon Kemp. 2025.TikTok Users, Stats, Data & Trends for 2025. DataReportal. https://datareportal.com/essential-tiktok-stats Reports TikTok advertising reach of at least 1.59 billion users in January 2025

2025
[29]

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech- niques for Recommender Systems.Computer42, 8 (2009), 30–37

2009
[30]

Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1754– 1763

2018
[31]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer us- ing shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022

2021
[32]

Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun, and Quan Lu. 2018. Field-weighted Factorization Machines for Click-Through Rate Prediction in Display Advertising. InProceedings of The Web Conference (WWW). 1349–1357

2018
[33]

Junwei Pan, Wei Xue, Ximei Wang, Haibin Yu, Xun Liu, Shijie Quan, Xueming Qiu, Dapeng Liu, Lei Xiao, and Jie Jiang. 2024. Ads Recommendation in a Collapsed and Entangled World. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3319–3330

2024
[34]

Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692

2020
[35]

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, et al. 2025. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free.arXiv preprint arXiv:2505.06708(2025)

work page internal anchor Pith review arXiv 2025
[36]

Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang
[37]

InProceed- ings of the 2016 IEEE International Conference on Data Mining

Product-Based Neural Networks for User Response Prediction. InProceed- ings of the 2016 IEEE International Conference on Data Mining. 1149–1154

2016
[38]

Steffen Rendle. 2010. Factorization Machines. In2010 IEEE International Confer- ence on Data Mining (ICDM). 995–1000

2010
[39]

Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. 2020. Neural collaborative filtering vs. matrix factorization revisited. InProceedings of the 14th ACM conference on recommender systems. 240–248

2020
[40]

Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: estimating the click-through rate for new ads. InProceedings of the 16th international conference on World Wide Web. 521–530

2007
[41]

Zihua Si, Lin Guan, ZhongXiang Sun, Xiaoxue Zang, Jing Lu, Yiqun Hui, Xingchao Cao, Zeyu Yang, Yichen Zheng, Dewei Leng, et al . 2024. TWIN-V2: Scaling Ultra-Long User Behavior Sequence Modeling for Enhanced CTR Prediction at Kuaishou. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 4890–4897

2024
[42]

Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self- Attentive Neural Networks. InProceedings of the 28th ACM International Confer- ence on Information and Knowledge Management (CIKM)

2019
[43]

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang
[44]

InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM)

BERT4Rec: Sequential Recommendation with Bidirectional Encoder Rep- resentations from Transformer. InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). 1441–1450
[45]

2021.𝐹 𝑀2: Field-matrixed Factorization Machines for Recommender Systems

Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores. 2021.𝐹 𝑀2: Field-matrixed Factorization Machines for Recommender Systems. InProceedings of the Web Conference (WWW). 2828–2837

2021
[46]

Jiaxi Tang and Ke Wang. 2018. Personalized Top-N Sequential Recommenda- tion via Convolutional Sequence Embedding. InProceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM)

2018
[47]

Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. InProceedings of the ADKDD’17. 1–7

2017
[48]

Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed H

Ruoxi Wang, Rakesh Shivanna, Derek Z. Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed H. Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. InProceedings of the Web Conference (WWW). 1785–1797

2021
[49]

Mingjia Yin, Junwei Pan, Hao Wang, Ximei Wang, Shangyu Zhang, Jie Jiang, Defu Lian, and Enhong Chen. 2025. From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models.arXiv preprint arXiv:2512.14041 (2025)

work page arXiv 2025
[50]

Zhichen Zeng, Xiaolong Liu, Mengyue Hang, Xiaoyi Liu, Qinghai Zhou, Chaofei Yang, Yiqun Liu, Yichen Ruan, Laming Chen, Yuxin Chen, et al. 2025. InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction. InProceedings of the 34th ACM International Conference on Information and Knowledge Management

2025
[51]

Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. 2024. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. InProceedings of the 41st International Conference on Machine Learning (ICML)

2024
[52]

Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, Guna Lakshminarayanan, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, and Wenlin Chen. 2024. Wukong: Towards a Scaling Law for Large-Scale Recommendation. arXiv:2403.02545 [cs.LG]

work page arXiv 2024
[53]

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, et al. 2022. DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2060–2069

2022
[54]

Junjie Zhang, Ruobing Xie, Hongyu Lu, Wenqi Sun, Wayne Xin Zhao, Yu Chen, and Zhanhui Kang. 2025. Frequency-Augmented Mixture-of-Heterogeneous- Experts Framework for Sequential Recommendation. InProceedings of the ACM on Web Conference 2025. 2596–2607

2025
[55]

Zhaoqi Zhang, Haolei Pei, Jun Guo, Tianyu Wang, Yufei Feng, Hui Sun, Shaowei Liu, and Aixin Sun. 2025. OneTrans: Unified Feature Interaction and Sequence Modeling with One Transformer in Industrial Recommender. arXiv:2510.26104 [cs.IR]

work page arXiv 2025
[56]

Zuowu Zheng, Xiaofeng Gao, Junwei Pan, Qi Luo, Guihai Chen, Dapeng Liu, and Jie Jiang. 2022. AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling. In2022 IEEE International Conference on Data Mining (ICDM). 1257–1262

2022
[57]

Guorui Zhou, Hengrui Hu, Hongtao Cheng, Huanjie Wang, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Lu Ren, Liao Yu, et al. 2025. Onerec-v2 technical report.arXiv preprint arXiv:2508.20900(2025)

work page arXiv 2025
[58]

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5941–5948

2019
[59]

Guorui Zhou, Xiaoqiang Zhu, Chengru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1059– 1068. 12 TokenFormer: Unify the Multi-Field and Sequential Recomm...

2018
[60]

Haolin Zhou, Junwei Pan, Xinyi Zhou, Xihua Chen, Jie Jiang, Xiaofeng Gao, and Guihai Chen. 2024. Temporal Interest Network for User Response Prediction. In Companion Proceedings of the ACM on Web Conference 2024. 413–422

2024
[61]

Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, et al. 2025. RankMixer: Scaling Up Ranking Models in Industrial Recommenders. arXiv:2507.15551 [cs.IR] A Complexity Analysis and Serving Optimization This appendix analyzes the computational complexity of the pro- posedBottom-Full-T...

work page arXiv 2025