Recognition: unknown
IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems
Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3
The pith
Compressing each historical interaction into one token enables better sequence modeling in recommender systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that by compressing all features of each historical interaction instance into a unified instance embedding in the first stage, using temporal or user-order schemes, and then using these compressed tokens in the second stage with standard sequence modeling, the framework can significantly outperform state-of-the-art methods in modeling long-range preferences with superior transferability.
What carries the argument
The two-stage Instance-As-Token (IAT) compression mechanism that reduces each multi-feature interaction to a single compact token for efficient sequence processing.
Load-bearing premise
That compressing all features of each historical interaction instance into a single unified instance embedding preserves sufficient information for downstream sequence modeling without critical loss.
What would settle it
Observing that a model using uncompressed raw features or alternative compression methods achieves equal or higher performance on the same industrial datasets would indicate that the IAT compression does not provide the claimed benefits.
Figures
read the original abstract
Although sophisticated sequence modeling paradigms have achieved remarkable success in recommender systems, the information capacity of hand-crafted sequential features constrains the performance upper bound. To better enhance user experience by encoding historical interaction patterns, this paper presents a novel two-stage sequence modeling framework termed Instance-As-Token (IAT). The first stage of IAT compresses all features of each historical interaction instance into a unified instance embedding, which encodes the interaction characteristics in a compact yet informative token. Both temporal-order and user-order compression schemes are proposed, with the latter better aligning with the demands of downstream sequence modeling. The second stage involves the downstream task fetching fixed-length compressed instance tokens via timestamps and adopting standard sequence modeling approaches to learn long-range preferences patterns. Extensive experiments demonstrate that IAT significantly outperforms state-of-the-art methods and exhibits superior in-domain and cross-domain transferability. IAT has been successfully deployed in real-world industrial recommender systems, including e-commerce advertising, shopping mall marketing, and live-streaming e-commerce, delivering substantial improvements in key business metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Instance-As-Token (IAT), a two-stage framework for historical user sequence modeling in recommender systems. Stage 1 compresses all features of each historical interaction instance into a single unified instance embedding (token) via temporal-order or user-order schemes; Stage 2 feeds the resulting fixed-length token sequence to standard sequence models for long-range preference learning. The authors claim that IAT significantly outperforms state-of-the-art methods, shows superior in-domain and cross-domain transferability, and has been deployed in industrial systems (e-commerce advertising, shopping mall marketing, live-streaming e-commerce) with substantial business-metric gains.
Significance. If the first-stage compression demonstrably preserves intra-instance feature interactions, IAT would offer a practical route to higher-capacity sequence modeling in industrial recommenders by converting variable-length, multi-feature histories into compact, fixed-length token sequences while improving both accuracy and transfer. The reported real-world deployments constitute a strong practical strength if supported by the experimental details.
major comments (2)
- [Abstract / first-stage description] Abstract / first-stage description: the central claim that the unified instance embedding 'encodes the interaction characteristics in a compact yet informative token' is load-bearing for all performance and deployment assertions, yet no reconstruction error, mutual-information, or per-feature ablation results are supplied to test whether cross-feature dependencies within a single interaction are retained. Without such evidence the observed lifts could arise from downstream architecture choices or data differences rather than the IAT compression itself.
- [Experiments and deployment claims] Experiments and deployment claims: the abstract asserts 'extensive experiments' and successful industrial deployment with 'substantial improvements in key business metrics,' but the provided text supplies neither baseline details, data-split protocols, statistical significance tests, nor comparisons against richer multi-feature sequence models. These omissions prevent verification that the reported gains are robust and attributable to IAT.
minor comments (1)
- [Method description] The distinction between the temporal-order and user-order compression schemes would be clearer if the paper included a short pseudocode or explicit algorithmic description of how timestamps and user-ordering are used to produce the fixed-length token sequence.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and proposed revisions to strengthen the presentation of evidence for the IAT compression and experimental details.
read point-by-point responses
-
Referee: [Abstract / first-stage description] Abstract / first-stage description: the central claim that the unified instance embedding 'encodes the interaction characteristics in a compact yet informative token' is load-bearing for all performance and deployment assertions, yet no reconstruction error, mutual-information, or per-feature ablation results are supplied to test whether cross-feature dependencies within a single interaction are retained. Without such evidence the observed lifts could arise from downstream architecture choices or data differences rather than the IAT compression itself.
Authors: We acknowledge that explicit metrics such as reconstruction error or mutual information for the instance embeddings are not reported. The downstream performance gains across multiple datasets and the industrial results provide indirect support that the unified tokens retain key interaction characteristics, particularly under the user-order scheme which aligns features for sequence modeling. To directly address this, we will add per-feature ablation studies and comparisons against models that retain individual features without compression in the revised manuscript. revision: yes
-
Referee: [Experiments and deployment claims] Experiments and deployment claims: the abstract asserts 'extensive experiments' and successful industrial deployment with 'substantial improvements in key business metrics,' but the provided text supplies neither baseline details, data-split protocols, statistical significance tests, nor comparisons against richer multi-feature sequence models. These omissions prevent verification that the reported gains are robust and attributable to IAT.
Authors: The full manuscript details the experimental protocols, including baselines (e.g., DIN, DIEN, and other sequence models), chronological data splits, and statistical significance via paired t-tests. However, to improve clarity and verifiability, we will expand the experiments section with explicit comparisons to richer multi-feature sequence models and additional specifics on the industrial A/B tests, including exact business metrics and deployment settings. revision: partial
Circularity Check
No significant circularity in IAT derivation or claims
full rationale
The paper introduces an empirical two-stage framework: stage one compresses per-interaction features into a single instance embedding via temporal or user-order schemes, and stage two feeds the resulting token sequence into standard sequence models for preference learning. No equations, loss functions, or performance metrics are shown to reduce by construction to the compression step itself or to any fitted parameters renamed as predictions. Claims of outperformance and industrial deployment rest on experimental results rather than self-referential definitions or load-bearing self-citations that collapse the central result. The framework is self-contained against external benchmarks with no detected self-definitional, fitted-input, or ansatz-smuggling patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- instance embedding dimension and compression parameters
axioms (1)
- domain assumption All features of a historical interaction instance can be compressed into a single token that retains the interaction characteristics needed for downstream preference modeling
Forward citations
Cited by 1 Pith paper
-
Sample Is Feature: Beyond Item-Level, Toward Sample-Level Tokens for Unified Large Recommender Models
SIF encodes full historical raw samples as tokens via hierarchical quantization to preserve sample context and unify sequential/non-sequential features in large recommender models.
Reference graph
Works this paper leans on
-
[1]
Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, et al . 2025. Longer: Scaling up long sequence modeling in industrial recommenders. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 247–256
2025
-
[2]
Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, et al. 2023. TWIN: TWo-stage interest network for lifelong user behavior modeling in CTR prediction at kuaishou. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3785–3794
2023
- [3]
-
[4]
Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. 2025. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965 (2025)
work page internal anchor Pith review arXiv 2025
- [5]
-
[6]
Lin Guan, Jia-Qi Yang, Zhishan Zhao, Beichuan Zhang, Bo Sun, Xuanyuan Luo, Jinan Ni, Xiaowen Li, Yuhang Qi, Zhifang Fan, et al. 2025. Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin.arXiv preprint arXiv:2511.06077(2025)
work page internal anchor Pith review arXiv 2025
- [7]
-
[8]
Ruidong Han, Bin Yin, Shangyu Chen, He Jiang, Fei Jiang, Xiang Li, Chi Ma, Mincong Huang, Xiaoguang Li, Chunzhen Jing, et al . 2025. Mtgr: Industrial- scale generative recommendation framework in meituan. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 5731–5738
2025
- [9]
- [10]
-
[11]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[12]
Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards universal sequence representation learning for recommender systems. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 585–593
2022
- [13]
-
[14]
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. InInternational conference on machine learning. PMLR, 4651–4664
2021
-
[15]
Pengyue Jia, Yejing Wang, Zhaocheng Du, Xiangyu Zhao, Yichao Wang, Bo Chen, Wanyu Wang, Huifeng Guo, and Ruiming Tang. 2024. Erase: Benchmarking feature selection methods for deep recommender systems. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. 5194–5205
2024
- [16]
-
[17]
Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- mendation. In2018 IEEE international conference on data mining (ICDM). IEEE, 197–206
2018
-
[18]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [19]
-
[20]
Weijiang Lai, Beihong Jin, Jiongyan Zhang, Yiyuan Zheng, Jian Dong, Jia Cheng, Jun Lei, and Xingxing Wang. 2025. Exploring Scaling Laws of CTR Model for Online Performance Improvement. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 114–123
2025
- [21]
-
[22]
Zhiwei Liu, Ziwei Fan, Yu Wang, and Philip S Yu. 2021. Augmenting sequential recommendation with pseudo-prior items via reversely pre-training transformer. InProceedings of the 44th international ACM SIGIR conference on Research and development in information retrieval. 1608–1612
2021
-
[23]
Xiao Lv, Jiangxia Cao, Shijie Guan, Xiaoyou Zhou, Zhiguang Qi, Yaqiang Zang, Ben Wang, and Guorui Zhou. 2025. MARM: Unlocking the Recommendation Cache Scaling-Law through Memory Augmentation and Scalable Complexity. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 2022–2031
2025
-
[24]
Wenhan Lyu, Devashish Tyagi, Yihang Yang, Ziwei Li, Ajay Somani, Karthikeyan Shanmugasundaram, Nikola Andrejevic, Ferdi Adeputra, Curtis Zeng, Arun K Singh, et al. 2025. DV365: Extremely Long User History Modeling at Instagram. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 4717–4727
2025
-
[25]
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al
-
[26]
Recommender systems with generative retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315
2023
- [27]
-
[28]
Benedikt Schifferer, Chris Deotte, and Even Oldridge. 2020. Tutorial: feature engineering for recommender systems. InProceedings of the 14th ACM Conference on Recommender Systems. 754–755
2020
-
[29]
Zihua Si, Lin Guan, ZhongXiang Sun, Xiaoxue Zang, Jing Lu, Yiqun Hui, Xingchao Cao, Zeyu Yang, Yichen Zheng, Dewei Leng, et al. 2024. Twin v2: Scaling ultra- long user behavior sequence modeling for enhanced ctr prediction at kuaishou. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 4890–4897
2024
-
[30]
Uriel Singer, Haggai Roitman, Yotam Eshel, Alexander Nus, Ido Guy, Or Levi, Idan Hasson, and Eliyahu Kiperwasser. 2022. Sequential modeling with multiple attributes for watchlist recommendation in e-commerce. InProceedings of the fifteenth ACM international conference on web search and data mining. 937–946
2022
-
[31]
Xin Song, Xiaochen Li, Jinxin Hu, Hong Wen, Zulong Chen, Yu Zhang, Xiaoyi Zeng, and Jing Zhang. 2025. Lrea: Low-rank efficient attention on modeling long- term user behaviors for ctr prediction. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2843–2847
2025
-
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
2017
-
[33]
Hu Wan, Yun Huang, Shuhan Bai, Xuan Sun, Tei-Wei Kuo, and Chun Jason Xue
-
[34]
InProceedings of the 40th ACM/SIGAPP Symposium on Applied Computing
Rabbitail: A Tail Latency-Aware Scheduler for Deep Learning Recommen- dation Systems with Hierarchical Embedding Storage. InProceedings of the 40th ACM/SIGAPP Symposium on Applied Computing. 279–287. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Li et al
2018
-
[35]
Shuhan Wang, Bin Shen, Xu Min, Yong He, Xiaolu Zhang, Liang Zhang, Jun Zhou, and Linjian Mo. 2024. Aligned side information fusion method for sequential recommendation. InCompanion Proceedings of the ACM Web Conference 2024. 112–120
2024
-
[36]
Zhuoxing Wei, Qi Liu, and Qingchen Xie. 2025. Deep Multiple Quantization Network on Long Behavior Sequence for Click-Through Rate Prediction. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3090–3094
2025
- [37]
-
[38]
Xue Xia, Saurabh Joshi, Kousik Rajesh, Kangnan Li, Yangyi Lu, Nikil Pancha, Dhruvil Badani, Jiajing Xu, and Pong Eksombatchai. 2025. TransAct V2: Lifelong User Action Sequence Modeling on Pinterest Recommendation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6881–6882
2025
-
[39]
Yueqi Xie, Peilin Zhou, and Sunghun Kim. 2022. Decoupled side information fusion for sequential recommendation. InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 1611– 1621
2022
- [40]
-
[41]
Songpei Xu, Shijia Wang, Da Guo, Xianwen Guo, Qiang Xiao, Bin Huang, Guanlin Wu, and Chuanjiang Luo. 2025. Climber: Toward efficient scaling laws for large recommendation models. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6193–6200
2025
-
[42]
Jinho Yang, Ji-Hoon Kim, and Joo-Young Kim. 2025. SCRec: A Scalable Computa- tional Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models.IEEE Trans. Comput.(2025)
2025
-
[43]
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. 2025. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23078–23097
2025
- [44]
-
[45]
Zhichen Zeng, Xiaolong Liu, Mengyue Hang, Xiaoyi Liu, Qinghai Zhou, Chaofei Yang, Yiqun Liu, Yichen Ruan, Laming Chen, Yuxin Chen, et al. 2025. InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6225–6233
2025
-
[46]
Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhao- jie Gong, Fangda Gu, Michael He, et al. 2024. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152(2024)
work page internal anchor Pith review arXiv 2024
- [47]
- [48]
-
[49]
Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, Deqing Wang, Guanfeng Liu, Xiaofang Zhou, et al . 2019. Feature-level deeper self- attention network for sequential recommendation. InIJCAI. 4320–4326
2019
- [50]
-
[51]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 5941–5948
2019
-
[52]
Wen-Ji Zhou, Yuhang Zheng, Yinfu Feng, Yunan Ye, Rong Xiao, Long Chen, Xiaosong Yang, and Jun Xiao. 2024. ENCODE: Breaking the trade-off between performance and efficiency in long-term user behavior modeling.IEEE Transac- tions on Knowledge and Data Engineering37, 1 (2024), 265–277
2024
-
[53]
Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, et al. 2025. Rankmixer: Scaling up ranking models in industrial recommenders. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6309–6316. IAT: Instance-As-Token Compression for Historic...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.