Recognition: unknown
TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds
Pith reviewed 2026-05-10 12:45 UTC · model grok-4.3
The pith
TokenFormer unifies multi-field feature interactions and sequential user behavior modeling in one architecture by blocking dimensional collapse of sequence features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proposes TokenFormer, a unified recommendation architecture that overcomes Sequential Collapse Propagation through a Bottom-Full-Top-Sliding attention scheme, which applies full self-attention in the lower layers and shrinking-window sliding attention in the upper layers, together with Non-Linear Interaction Representation that applies one-sided non-linear multiplicative transformations to the hidden states. Extensive experiments on public benchmarks and Tencent's advertising platform show state-of-the-art performance, while analysis confirms improved dimensional robustness and representation discriminability under unified modeling.
What carries the argument
Bottom-Full-Top-Sliding (BFTS) attention, which runs full self-attention at lower layers and shrinking sliding-window attention at upper layers, combined with Non-Linear Interaction Representation (NLIR) that performs one-sided non-linear multiplicative transformations on hidden states.
Load-bearing premise
That the BFTS attention pattern and NLIR transformations are the direct cause of avoiding sequence collapse and improving robustness, rather than differences in model capacity, training procedure, or evaluation choices.
What would settle it
An ablation experiment on the same benchmarks that replaces BFTS with standard full attention and NLIR with linear interactions yet still shows equivalent or better performance and no collapse would falsify the claim that these two components are required for successful unification.
Figures
read the original abstract
Recommender systems have historically developed along two largely independent paradigms: feature interaction models for modeling correlations among multi-field categorical features, and sequential models for capturing user behavior dynamics from historical interaction sequences. Although recent trends attempt to bridge these paradigms within shared backbones, we empirically reveal that naive unifying these two branches may lead to a failure mode of Sequential Collapse Propagation (SCP). That is, the interaction with those dimensionally ill non-sequence fields leads to the dimensional collapse of the sequence features. To overcome this challenge, we propose TokenFormer, a unified recommendation architecture with the following innovations. First, we introduce a Bottom-Full-Top-Sliding (BFTS) attention scheme, which applies full self-attention in the lower layers and shrinking-window sliding attention in the upper layers. Second, we introduce a Non-Linear Interaction Representation (NLIR) that applies one-sided non-linear multiplicative transformations to the hidden states. Extensive experiments on public benchmarks and Tencent's advertising platform demonstrate state-of-the-art performance, while detailed analysis confirm that TokenFormer significantly improves dimensional robustness and representation discriminability under unified modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies a failure mode termed Sequential Collapse Propagation (SCP) when naively unifying multi-field feature-interaction models with sequential recommendation models, in which interactions with dimensionally ill non-sequence fields cause collapse of sequence features. It proposes TokenFormer, a unified architecture that applies a Bottom-Full-Top-Sliding (BFTS) attention scheme (full self-attention in lower layers, shrinking-window sliding attention in upper layers) together with Non-Linear Interaction Representation (NLIR) via one-sided non-linear multiplicative transformations on hidden states. The paper reports state-of-the-art results on public benchmarks and Tencent advertising data, together with improved dimensional robustness and representation discriminability under unified modeling.
Significance. If the empirical claims are substantiated by properly controlled experiments, the work would offer a practical bridge between two historically separate recommendation paradigms and a concrete mechanism for preserving sequence-feature dimensionality. The emphasis on dimensional robustness under unification is a potentially valuable contribution, but its significance hinges on whether the reported gains are causally attributable to BFTS and NLIR rather than unmatched capacity, training schedules, or evaluation choices.
major comments (2)
- [Abstract] Abstract: the central claim that naive unification produces Sequential Collapse Propagation is asserted without any formal definition, equations, or illustrative derivation; this absence makes it impossible to verify whether the proposed BFTS and NLIR mechanisms are necessary or sufficient to address the stated problem.
- [Experiments] Experiments (implied by abstract claims): no information is supplied on baseline capacity matching, hyper-parameter schedules, data preprocessing, or ablation studies that isolate the contribution of lower-layer full attention versus upper-layer sliding windows versus the NLIR non-linearity; without these controls the attribution of SOTA performance and robustness gains to the two innovations remains unverified.
minor comments (1)
- [Abstract] Abstract: the phrase 'one-sided non-linear multiplicative transformations' is introduced without a mathematical specification or reference to the exact functional form.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on clarity and experimental controls. We address each point below and will revise the manuscript to strengthen verifiability of the SCP claim and attribution of results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that naive unification produces Sequential Collapse Propagation is asserted without any formal definition, equations, or illustrative derivation; this absence makes it impossible to verify whether the proposed BFTS and NLIR mechanisms are necessary or sufficient to address the stated problem.
Authors: We agree that the abstract is too concise to stand alone on this point. The main text (Section 3.1) formally defines SCP as the propagation of dimensionality mismatch from non-sequence fields through shared attention, leading to sequence feature collapse (with the condition ||h_seq|| -> 0 derived from the attention update rule in Eq. (3)-(4)). To make the abstract self-contained, we will add a one-sentence formal characterization of SCP and note that BFTS/NLIR are designed to mitigate it. revision: yes
-
Referee: [Experiments] Experiments (implied by abstract claims): no information is supplied on baseline capacity matching, hyper-parameter schedules, data preprocessing, or ablation studies that isolate the contribution of lower-layer full attention versus upper-layer sliding windows versus the NLIR non-linearity; without these controls the attribution of SOTA performance and robustness gains to the two innovations remains unverified.
Authors: The manuscript reports capacity-matched baselines (parameter counts within 5% in Table 1), standard grid-search hyperparameter tuning on validation sets (Appendix C), and preprocessing details (Section 4.1). Section 4.3 already contains ablations that isolate BFTS layers (full vs. sliding) and NLIR (with/without the one-sided non-linearity). However, to improve transparency we will expand the experimental section with an explicit controls subsection, additional tables on matched capacities, and finer-grained ablations separating the lower-layer full attention from the upper-layer sliding windows. revision: partial
Circularity Check
No circularity: architectural proposal with empirical validation, no derivations or self-referential reductions
full rationale
The paper claims an empirical observation of Sequential Collapse Propagation under naive unification of multi-field and sequential models, then introduces BFTS attention (full self-attention in lower layers, shrinking-window in upper) and NLIR (one-sided non-linear multiplicative transforms) as architectural remedies, validated by SOTA results on public benchmarks and Tencent data. No mathematical derivation chain, equations, or first-principles results appear in the abstract or described claims. Performance and robustness improvements are presented as experimental outcomes rather than predictions derived from fitted inputs or self-citations. No load-bearing steps reduce by construction to the inputs; the work is self-contained as an empirical architecture paper.
Axiom & Free-Parameter Ledger
invented entities (3)
-
Sequential Collapse Propagation (SCP)
no independent evidence
-
Bottom-Full-Top-Sliding (BFTS) attention scheme
no independent evidence
-
Non-Linear Interaction Representation (NLIR)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150(2020)
work page internal anchor Pith review arXiv 2020
- [2]
-
[3]
Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, et al. 2023. TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 3784–3794
2023
-
[4]
Jianxin Chang, Chenbin Zhang, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, and Kun Gai. 2023. Pepnet: Parameter and embedding personalized network for infusing with personalized prior information. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3795–3804
2023
-
[5]
Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Behavior sequence transformer for e-commerce recommendation in alibaba. InProceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data. 1–4
2019
-
[6]
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al
-
[7]
InProceedings of the 1st Workshop on Deep Learning for Recommender Systems
Wide & Deep Learning for Recommender Systems. InProceedings of the 1st Workshop on Deep Learning for Recommender Systems. 7–10
-
[8]
Weiyu Cheng, Yanyan Shen, and Linpeng Huang. 2020. Adaptive factorization network: Learning adaptive-order feature interactions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 3609–3616
2020
- [9]
-
[10]
Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Keping Yang. 2019. Deep Session Interest Network for Click-Through Rate Prediction. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). 2301–2307
2019
- [11]
-
[12]
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. InProceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI). 1725–1731
2017
-
[13]
Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2024. On the embedding collapse when scaling up recommendation models (ICML’24). JMLR.org, Article 671, 19 pages
2024
-
[14]
Xiangnan He and Tat-Seng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 355–364
2017
-
[15]
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk
-
[16]
In International Conference on Learning Representations (ICLR) Workshop
Session-based Recommendations with Recurrent Neural Networks. In International Conference on Learning Representations (ICLR) Workshop
-
[17]
Bojian Hou, Xiaolong Liu, Xiaoyi Liu, Jiaqi Xu, Yasmine Badr, Mengyue Hang, Sudhanshu Chanpuriya, Junqing Zhou, Yuhang Yang, Han Xu, Qiuling Suo, Laming Chen, Yuxi Hu, Jiasheng Zhang, Huaqing Xiong, Yuzhen Huang, Chao Chen, Yue Dong, Yi Yang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Darren Liu, Jade Nie, Chunzhi Yang, Jiyan Yang, and Huayu Li....
-
[18]
Ruijie Hou, Zhaoyang Yang, Yu Ming, Hongyu Lu, Zhuobin Zheng, Yu Chen, Qinsong Zeng, and Ming Chen. 2024. Cross-Domain LifeLong Sequential Model- ing for Online Click-Through Rate Prediction. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 5116–5125. doi:10.1145/3637528.3671601
- [19]
-
[20]
Tongwen Huang, Zhiqi Zhang, and Junlin Zhang. 2019. FiBiNET: Combining Feature Importance and Bilinear Feature Interaction for Click-Through Rate Prediction. InProceedings of the 13th ACM Conference on Recommender Systems (RecSys). 169–177
2019
- [21]
-
[22]
2025.In- ternet Advertising Revenue Report: Full Year 2024
Interactive Advertising Bureau and PricewaterhouseCoopers. 2025.In- ternet Advertising Revenue Report: Full Year 2024. Technical Report. In- teractive Advertising Bureau (IAB) and PricewaterhouseCoopers (PwC). https://www.iab.com/wp-content/uploads/2025/04/IAB_PwC-Internet-Ad- Revenue-Report-Full-Year-2024.pdf Reports U.S. internet advertising revenue of ...
2025
-
[23]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.CoRRabs/2310.06...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
-
[24]
Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field- aware Factorization Machines for CTR Prediction. InProceedings of the 10th ACM Conference on Recommender Systems. 43–50
2016
-
[25]
Wang-Cheng Kang and Julian McAuley. 2018. Self-Attentive Sequential Rec- ommendation. In2018 IEEE International Conference on Data Mining (ICDM). 197–206
2018
-
[26]
2025.Digital 2025: The State of Social Media in 2025
Simon Kemp. 2025.Digital 2025: The State of Social Media in 2025. DataRepor- tal. https://datareportal.com/reports/digital-2025-sub-section-state-of-social Reports 5.24 billion active social media user identities worldwide in early 2025
2025
-
[27]
2025.Digital 2025: Top Social Platforms in 2025
Simon Kemp. 2025.Digital 2025: Top Social Platforms in 2025. DataReportal. https: //datareportal.com/reports/digital-2025-sub-section-top-social-platforms Re- ports that TikTok’s Android user base spent almost 35 hours using the app in November 2024
2025
-
[28]
2025.TikTok Users, Stats, Data & Trends for 2025
Simon Kemp. 2025.TikTok Users, Stats, Data & Trends for 2025. DataReportal. https://datareportal.com/essential-tiktok-stats Reports TikTok advertising reach of at least 1.59 billion users in January 2025
2025
-
[29]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech- niques for Recommender Systems.Computer42, 8 (2009), 30–37
2009
-
[30]
Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1754– 1763
2018
-
[31]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer us- ing shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022
2021
-
[32]
Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun, and Quan Lu. 2018. Field-weighted Factorization Machines for Click-Through Rate Prediction in Display Advertising. InProceedings of The Web Conference (WWW). 1349–1357
2018
-
[33]
Junwei Pan, Wei Xue, Ximei Wang, Haibin Yu, Xun Liu, Shijie Quan, Xueming Qiu, Dapeng Liu, Lei Xiao, and Jie Jiang. 2024. Ads Recommendation in a Collapsed and Entangled World. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3319–3330
2024
-
[34]
Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692
2020
-
[35]
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, et al. 2025. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free.arXiv preprint arXiv:2505.06708(2025)
work page internal anchor Pith review arXiv 2025
-
[36]
Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang
-
[37]
InProceed- ings of the 2016 IEEE International Conference on Data Mining
Product-Based Neural Networks for User Response Prediction. InProceed- ings of the 2016 IEEE International Conference on Data Mining. 1149–1154
2016
-
[38]
Steffen Rendle. 2010. Factorization Machines. In2010 IEEE International Confer- ence on Data Mining (ICDM). 995–1000
2010
-
[39]
Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. 2020. Neural collaborative filtering vs. matrix factorization revisited. InProceedings of the 14th ACM conference on recommender systems. 240–248
2020
-
[40]
Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: estimating the click-through rate for new ads. InProceedings of the 16th international conference on World Wide Web. 521–530
2007
-
[41]
Zihua Si, Lin Guan, ZhongXiang Sun, Xiaoxue Zang, Jing Lu, Yiqun Hui, Xingchao Cao, Zeyu Yang, Yichen Zheng, Dewei Leng, et al . 2024. TWIN-V2: Scaling Ultra-Long User Behavior Sequence Modeling for Enhanced CTR Prediction at Kuaishou. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 4890–4897
2024
-
[42]
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self- Attentive Neural Networks. InProceedings of the 28th ACM International Confer- ence on Information and Knowledge Management (CIKM)
2019
-
[43]
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang
-
[44]
InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Rep- resentations from Transformer. InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). 1441–1450
-
[45]
2021.𝐹 𝑀2: Field-matrixed Factorization Machines for Recommender Systems
Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores. 2021.𝐹 𝑀2: Field-matrixed Factorization Machines for Recommender Systems. InProceedings of the Web Conference (WWW). 2828–2837
2021
-
[46]
Jiaxi Tang and Ke Wang. 2018. Personalized Top-N Sequential Recommenda- tion via Convolutional Sequence Embedding. InProceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM)
2018
-
[47]
Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. InProceedings of the ADKDD’17. 1–7
2017
-
[48]
Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed H
Ruoxi Wang, Rakesh Shivanna, Derek Z. Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed H. Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. InProceedings of the Web Conference (WWW). 1785–1797
2021
- [49]
-
[50]
Zhichen Zeng, Xiaolong Liu, Mengyue Hang, Xiaoyi Liu, Qinghai Zhou, Chaofei Yang, Yiqun Liu, Yichen Ruan, Laming Chen, Yuxin Chen, et al. 2025. InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction. InProceedings of the 34th ACM International Conference on Information and Knowledge Management
2025
-
[51]
Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. 2024. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. InProceedings of the 41st International Conference on Machine Learning (ICML)
2024
-
[52]
Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, Guna Lakshminarayanan, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, and Wenlin Chen. 2024. Wukong: Towards a Scaling Law for Large-Scale Recommendation. arXiv:2403.02545 [cs.LG]
-
[53]
Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, et al. 2022. DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2060–2069
2022
-
[54]
Junjie Zhang, Ruobing Xie, Hongyu Lu, Wenqi Sun, Wayne Xin Zhao, Yu Chen, and Zhanhui Kang. 2025. Frequency-Augmented Mixture-of-Heterogeneous- Experts Framework for Sequential Recommendation. InProceedings of the ACM on Web Conference 2025. 2596–2607
2025
- [55]
-
[56]
Zuowu Zheng, Xiaofeng Gao, Junwei Pan, Qi Luo, Guihai Chen, Dapeng Liu, and Jie Jiang. 2022. AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling. In2022 IEEE International Conference on Data Mining (ICDM). 1257–1262
2022
- [57]
-
[58]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5941–5948
2019
-
[59]
Guorui Zhou, Xiaoqiang Zhu, Chengru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1059– 1068. 12 TokenFormer: Unify the Multi-Field and Sequential Recomm...
2018
-
[60]
Haolin Zhou, Junwei Pan, Xinyi Zhou, Xihua Chen, Jie Jiang, Xiaofeng Gao, and Guihai Chen. 2024. Temporal Interest Network for User Response Prediction. In Companion Proceedings of the ACM on Web Conference 2024. 413–422
2024
-
[61]
Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, et al. 2025. RankMixer: Scaling Up Ranking Models in Industrial Recommenders. arXiv:2507.15551 [cs.IR] A Complexity Analysis and Serving Optimization This appendix analyzes the computational complexity of the pro- posedBottom-Full-T...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.