arxiv: 2604.16121 · v1 · submitted 2026-04-17 · 💻 cs.IR

Recognition: unknown

Beyond One-Size-Fits-All: Adaptive Test-Time Augmentation for Sequential Recommendation

Xibo Li , Liang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:29 UTC · model grok-4.3

classification 💻 cs.IR

keywords sequential recommendationtest-time augmentationadaptive inferencereinforcement learningdata augmentationuser heterogeneityactor-critic

0 comments

The pith

A learned per-sequence policy selects optimal test-time augmentations and outperforms fixed strategies in sequential recommendation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing test-time augmentation methods apply the same strategy to all user sequences in sequential recommendation systems. This uniform approach is suboptimal because different sequences benefit from different augmentation operators due to varying user behaviors. The paper introduces AdaTTA, a reinforcement learning framework that learns to choose the best augmentation for each sequence at inference time. The framework models the choice as a decision process and uses an actor-critic network to make the selection dynamically. Experiments confirm it delivers better results than the strongest fixed approaches across datasets with only moderate extra computation.

Core claim

Existing test-time augmentation methods apply the same augmentation operator to all user sequences, yet the optimal operator varies significantly across sequences with different characteristics. To address this, AdaTTA formulates augmentation selection as a Markov Decision Process and introduces an Actor-Critic policy network with hybrid state representations and a joint macro-rank reward to dynamically determine the optimal operator for each input user sequence.

What carries the argument

The Actor-Critic policy network that learns to select among augmentation operators for each user sequence based on hybrid state representations within a Markov Decision Process using a joint macro-rank reward.

If this is right

The optimal augmentation varies significantly across user sequences.
AdaTTA improves accuracy without requiring retraining of the base recommendation model.
Relative improvements reach up to 26.31 percent on the Home dataset.
The added computational cost remains moderate.
The method integrates as a plug-and-play module with existing backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive selection idea could apply to other inference-time techniques for handling sparse data.
Analysis of the learned policy might identify which sequence features predict the best augmentation.
Similar reinforcement learning policies could be developed for other types of sequential prediction problems.
Deployment in real systems would allow continuous policy updates as user behaviors evolve.

Load-bearing premise

User sequences display enough variation in their preferred augmentations that a policy can learn to predict the effective choice for new sequences reliably.

What would settle it

If applying the learned policy to new sequences does not produce higher recommendation accuracy than the best fixed augmentation on a held-out test set, or if all sequences share the same optimal operator.

Figures

Figures reproduced from arXiv: 2604.16121 by Liang Zhang, Xibo Li.

**Figure 2.** Figure 2: Illustration of data augmentation operators for se [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overall framework of AdaTTA. (1) The input user sequence is encoded into semantic and statistical features to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance with different augmentation times [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Test-time augmentation (TTA) has become a promising approach for mitigating data sparsity in sequential recommendation by improving inference accuracy without requiring costly model retraining. However, existing TTA methods typically rely on uniform, user-agnostic augmentation strategies. We show that this "one-size-fits-all" design is inherently suboptimal, as it neglects substantial behavioral heterogeneity across users, and empirically demonstrate that the optimal augmentation operators vary significantly across user sequences with different characteristics for the first time. To address this limitation, we propose AdaTTA, a plug-and-play reinforcement learning-based adaptive inference framework that learns to select sequence-specific augmentation operators on a per-sequence basis. We formulate augmentation selection as a Markov Decision Process and introduce an Actor-Critic policy network with hybrid state representations and a joint macro-rank reward design to dynamically determine the optimal operator for each input user sequence. Extensive experiments on four real-world datasets and two recommendation backbones demonstrate that AdaTTA consistently outperforms the best fixed-strategy baselines, achieving up to 26.31% relative improvement on the Home dataset while incurring only moderate computational overhead

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaTTA uses actor-critic RL to pick per-sequence test-time augmentations in sequential recommendation and reports gains over fixed strategies, but the evidence for reliable adaptation is still thin.

read the letter

The main thing here is that the authors show optimal augmentations differ across user sequences and then build AdaTTA to learn a policy that picks the right one at inference time. They cast the choice as an MDP, use an actor-critic network with hybrid states, and design a joint macro-rank reward. That framing looks new compared with the uniform TTA baselines in the literature they cite, and the plug-and-play nature is a practical plus since no base model retraining is needed. Experiments on four datasets and two backbones, plus the claim of only moderate overhead, give the work some breadth. The 26% relative lift on the Home dataset is the headline number they lead with. The soft spots sit in the evaluation details. The abstract gives no information on how the policy was trained without test leakage, whether multiple random seeds were used, or if statistical tests back the gains. The generalization worry is worth checking: if the learned policy mostly collapses to one or two operators or exploits training correlations that do not appear at test time, the adaptive benefit shrinks and the reported improvements could be overstated. Without ablations on the state features and reward coefficients, it is hard to know how much of the lift comes from the RL component versus other factors. This paper is aimed at researchers working on sequential recommendation under sparsity who want inference-time improvements. A reader already familiar with RL in recsys or TTA methods will get the most out of it. I would send it to peer review because the core idea is concrete enough to deserve a full check on the experimental controls and generalization claims.

Referee Report

3 major / 2 minor

Summary. The paper claims that uniform test-time augmentation strategies in sequential recommendation are suboptimal due to user behavioral heterogeneity, and proposes AdaTTA: a plug-and-play Actor-Critic RL framework that formulates augmentation selection as an MDP, using hybrid state representations and a joint macro-rank reward to learn per-sequence operator selection. It reports consistent empirical gains over best fixed-strategy baselines across four real-world datasets and two backbones, with a peak relative improvement of 26.31% on the Home dataset and only moderate added inference cost.

Significance. If the gains are shown to stem from genuine per-sequence adaptation rather than reward tuning or training-set correlations, the work would be a useful contribution to sequential recommendation by making TTA adaptive without model retraining. The plug-and-play framing and reported overhead are practical strengths. However, the central empirical claim rests on the RL policy's ability to reliably exploit heterogeneity at inference time, which the provided abstract does not substantiate with controls or diagnostics.

major comments (3)

[§4] §4 (Method, MDP and reward design): The joint macro-rank reward and free parameters (Actor-Critic weights plus reward coefficients) are load-bearing for the adaptation claim, yet no ablation or sensitivity analysis is described that isolates whether gains arise from learned per-sequence selection versus reward shaping that could favor certain operators on the evaluation data.
[§5] §5 (Experiments): The reported 26.31% relative improvement and 'consistent outperformance' lack any mention of statistical significance tests, number of random seeds, variance across runs, or explicit controls for post-hoc dataset splits and hyperparameter choices, making it impossible to attribute gains to the adaptive policy rather than experimental artifacts.
[§5] §5.2 or policy analysis subsection: No diagnostics are provided on policy behavior at test time (e.g., distribution of selected operators across sequences, frequency of deviation from the single best fixed strategy, or stability across train/test distribution shifts), which directly tests the skeptic concern that the Actor-Critic may collapse to near-fixed behavior or exploit training correlations.

minor comments (2)

[Abstract / Introduction] The abstract states 'for the first time' that optimal operators vary across sequences; this novelty claim should be supported by a brief related-work comparison in the introduction rather than left implicit.
[§3] Notation for the hybrid state representation and macro-rank reward should be defined with explicit equations early in §3 to improve readability for readers unfamiliar with the specific RL formulation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We agree that strengthening the empirical validation of the adaptive policy is important for substantiating the core claims. In the revised manuscript we will add the requested ablations, statistical tests, and policy diagnostics. These additions will be placed in a new subsection of §5 and an expanded §4.3. Below we respond to each major comment.

read point-by-point responses

Referee: [§4] §4 (Method, MDP and reward design): The joint macro-rank reward and free parameters (Actor-Critic weights plus reward coefficients) are load-bearing for the adaptation claim, yet no ablation or sensitivity analysis is described that isolates whether gains arise from learned per-sequence selection versus reward shaping that could favor certain operators on the evaluation data.

Authors: We acknowledge that the original submission does not contain explicit sensitivity or ablation studies isolating the reward design. In the revision we will add a dedicated paragraph in §4.3 and a new table in §5 that reports (i) performance under varied reward coefficient settings (λ_macro, λ_rank) and (ii) an ablation replacing the joint macro-rank reward with single-metric variants. These experiments confirm that the learned policy continues to outperform the best fixed baseline even when the reward is simplified, indicating that gains are not an artifact of reward shaping alone. The revised text will make this explicit. revision: yes
Referee: [§5] §5 (Experiments): The reported 26.31% relative improvement and 'consistent outperformance' lack any mention of statistical significance tests, number of random seeds, variance across runs, or explicit controls for post-hoc dataset splits and hyperparameter choices, making it impossible to attribute gains to the adaptive policy rather than experimental artifacts.

Authors: We agree that the current experimental reporting is insufficiently rigorous. The revised §5 will state that all results are averaged over five independent random seeds with standard deviations reported. We will add paired t-test p-values (with Bonferroni correction) comparing AdaTTA against each baseline. The hyperparameter search protocol and the fact that train/validation/test splits were fixed before any tuning will be described in §5.1. These changes directly address the concern about experimental artifacts. revision: yes
Referee: [§5] §5.2 or policy analysis subsection: No diagnostics are provided on policy behavior at test time (e.g., distribution of selected operators across sequences, frequency of deviation from the single best fixed strategy, or stability across train/test distribution shifts), which directly tests the skeptic concern that the Actor-Critic may collapse to near-fixed behavior or exploit training correlations.

Authors: We recognize that policy-level diagnostics are necessary to demonstrate genuine per-sequence adaptation. We will insert a new subsection 5.3 “Policy Behavior Analysis” containing: (1) the empirical distribution of chosen operators over the test set, (2) the fraction of sequences on which the policy selects an operator different from the single best fixed strategy, and (3) a comparison of policy decisions on sequences drawn from training versus test distributions. These figures will be accompanied by qualitative examples showing that operator choice correlates with sequence characteristics (e.g., length, item diversity). The added analysis will directly test and refute the collapse-to-fixed-behavior hypothesis. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical RL framework with independent experimental validation

full rationale

The paper introduces AdaTTA as an empirical plug-and-play reinforcement learning method for per-sequence test-time augmentation selection in sequential recommendation. All central claims rest on experimental comparisons against fixed baselines across four datasets and two backbones, with no equations, derivations, or fitted quantities that reduce the reported improvements to quantities defined by the evaluation data itself. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the MDP formulation, Actor-Critic policy, hybrid states, and macro-rank reward are presented as design choices justified by the empirical results rather than by prior self-referential proofs. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on standard RL assumptions plus the empirical claim of user heterogeneity; no new physical entities or ad-hoc constants are introduced in the abstract.

free parameters (2)

Actor-Critic policy network weights
Learned parameters of the policy that selects augmentations; fitted during training on recommendation data.
Joint macro-rank reward coefficients
Weights balancing ranking quality and macro performance; chosen or tuned to produce the reported gains.

axioms (2)

domain assumption Markov Decision Process formulation of augmentation selection is valid for sequential user data
Assumes the state (user sequence) and action (augmentation choice) satisfy MDP properties.
domain assumption Hybrid state representations capture sufficient information for optimal operator selection
Assumes the chosen state features are adequate without proving completeness.

pith-pipeline@v0.9.0 · 5486 in / 1285 out tokens · 63629 ms · 2026-05-10T07:29:39.129467+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 7 canonical work pages · 2 internal anchors

[1]

2006.k-means++: The advantages of careful seeding

David Arthur and Sergei Vassilvitskii. 2006.k-means++: The advantages of careful seeding. Technical Report. Stanford

2006
[2]

Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019. Top-k off-policy correction for a REINFORCE recommender system. InProceedings of the twelfth ACM international conference on web search and data mining. 456–464

2019
[3]

Yizhou Dang, Yuting Liu, Enneng Yang, Guibing Guo, Linying Jiang, Xingwei Wang, and Jianzhe Zhao. 2024. Repeated Padding for Sequential Recommendation. InProceedings of the 18th ACM Conference on Recommender Systems. 497–506

2024
[4]

Yizhou Dang, Yuting Liu, Enneng Yang, Minhan Huang, Guibing Guo, Jianzhe Zhao, and Xingwei Wang. 2025. Data augmentation as free lunch: Exploring the test-time augmentation for sequential recommendation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1466–1475

2025
[5]

Yizhou Dang, Enneng Yang, Guibing Guo, Linying Jiang, Xingwei Wang, Xiaoxiao Xu, Qinghui Sun, and Hong Liu. 2023. Uniform sequence better: Time interval aware data augmentation for sequential recommendation. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 4225–4232

2023
[6]

Yizhou Dang, Enneng Yang, Yuting Liu, Guibing Guo, Linying Jiang, Jianzhe Zhao, and Xingwei Wang. 2024. Data Augmentation for Sequential Recommendation: A Survey.arXiv preprint arXiv:2409.13545(2024)

work page arXiv 2024
[7]

Yizhou Dang, Jiahui Zhang, Yuting Liu, Enneng Yang, Yuliang Liang, Guibing Guo, Jianzhe Zhao, and Xingwei Wang. 2025. Augmenting Sequential Recommendation with Balanced Relevance and Diversity. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 11563–11571

2025
[8]

Ziwei Fan, Zhiwei Liu, Yu Wang, Alice Wang, Zahra Nazari, Lei Zheng, Hao Peng, and Philip S Yu. 2022. Sequential recommendation via stochastic self-attention. InProceedings of the ACM web conference 2022. 2036–2047

2022
[9]

Ruining He and Julian McAuley. 2016. Fusing similarity models with markov chains for sparse sequential recommendation. In2016 IEEE 16th international conference on data mining (ICDM). IEEE, 191–200

2016
[10]

Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk
[11]

Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939(2015)

work page internal anchor Pith review arXiv 2015
[12]

Wei Jin, Tong Zhao, Jiayuan Ding, Yozen Liu, Jiliang Tang, and Neil Shah. 2022. Empowering graph representation learning with test-time graph transformation. arXiv preprint arXiv:2210.03561(2022)

work page arXiv 2022
[13]

Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- mendation. In2018 IEEE international conference on data mining (ICDM). IEEE, 197–206

2018
[14]

Ildoo Kim, Younghoon Kim, and Sungwoong Kim. 2020. Learning loss for test- time augmentation.Advances in neural information processing systems33 (2020), 4163–4174

2020
[15]

Masanari Kimura. 2021. Understanding test-time augmentation. InInternational Conference on Neural Information Processing. Springer, 558–569

2021
[16]

Vijay Konda and John Tsitsiklis. 1999. Actor-critic algorithms.Advances in neural information processing systems12 (1999)

1999
[17]

Walid Krichene and Steffen Rendle. 2020. On sampled metrics for item recom- mendation. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1748–1757

2020
[18]

Qidong Liu, Fan Yan, Xiangyu Zhao, Zhaocheng Du, Huifeng Guo, Ruiming Tang, and Feng Tian. 2023. Diffusion augmentation for sequential recommenda- tion. InProceedings of the 32nd ACM International conference on information and knowledge management. 1576–1586

2023
[19]

Zhiwei Liu, Yongjun Chen, Jia Li, Philip S Yu, Julian McAuley, and Caiming Xiong. 2021. Contrastive self-supervised sequential recommendation with robust augmentation.arXiv preprint arXiv:2108.06479(2021)

work page arXiv 2021
[20]

Zhiwei Liu, Ziwei Fan, Yu Wang, and Philip S Yu. 2021. Augmenting sequential recommendation with pseudo-prior items via reversely pre-training transformer. InProceedings of the 44th international ACM SIGIR conference on Research and development in information retrieval. 1608–1612

2021
[21]

Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring networks of substitutable and complementary products. InProceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 785– 794

2015
[22]

Chang Meng, Chenhao Zhai, Yu Yang, Hengyu Zhang, and Xiu Li. 2023. Parallel knowledge enhancement based framework for multi-behavior recommenda- tion. InProceedings of the 32nd ACM international conference on information and knowledge management. 1797–1806

2023
[23]

Zhaochun Ren, Na Huang, Yidan Wang, Pengjie Ren, Jun Ma, Jiahuan Lei, Xinlei Shi, Hengliang Luo, Joemon Jose, and Xin Xin. 2023. Contrastive state augmen- tations for reinforcement learning-based recommender systems. InProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval. 922–931

2023
[24]

Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor- izing personalized markov chains for next-basket recommendation. InProceedings of the 19th international conference on World wide web. 811–820

2010
[25]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
[26]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Divya Shanmugam, Davis Blalock, Guha Balakrishnan, and John Guttag. 2020. When and why test-time augmentation works.arXiv preprint arXiv:2011.111561, 3 (2020), 4

work page arXiv 2020
[28]

Divya Shanmugam, Davis Blalock, Guha Balakrishnan, and John Guttag. 2021. Better aggregation in test-time augmentation. InProceedings of the IEEE/CVF international conference on computer vision. 1214–1223

2021
[29]

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang
[30]

InProceedings of the 28th ACM international conference on information and knowledge management

BERT4Rec: Sequential recommendation with bidirectional encoder rep- resentations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management. 1441–1450
[31]

Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda- tion via convolutional sequence embedding. InProceedings of the eleventh ACM international conference on web search and data mining. 565–573

2018
[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[33]

Zhenlei Wang, Jingsen Zhang, Hongteng Xu, Xu Chen, Yongfeng Zhang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. Counterfactual data-augmented se- quential recommendation. InProceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 347–356

2021
[34]

Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin Ding, and Bin Cui. 2022. Contrastive learning for sequential recommendation. In 2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 1259– 1273

2022
[35]

Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xi- angnan He. 2019. A simple convolutional generative network for next item recommendation. InProceedings of the twelfth ACM international conference on web search and data mining. 582–590

2019
[36]

Chenhao Zhai, Chang Meng, Yu Yang, Kexin Zhang, Xuhao Zhao, and Xiu Li. 2025. Combinatorial Optimization Perspective based Framework for Multi-behavior Recommendation.arXiv preprint arXiv:2502.02232(2025)

work page arXiv 2025
[37]

Marvin Zhang, Sergey Levine, and Chelsea Finn. 2022. Memo: Test time robust- ness via adaptation and augmentation.Advances in neural information processing systems35 (2022), 38629–38642

2022
[38]

Chuang Zhao, Xinyu Li, Ming He, Hongke Zhao, and Jianping Fan. 2023. Sequen- tial recommendation via an adaptive cross-domain knowledge decomposition. In Proceedings of the 32nd ACM international conference on information and knowl- edge management. 3453–3463

2023
[39]

Chuang Zhao, Hongke Zhao, Ming He, Jian Zhang, and Jianping Fan. 2023. Cross- domain recommendation via user interest alignment. InProceedings of the ACM web conference 2023. 887–896

2023
[40]

Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for se- quential recommendation with mutual information maximization. InProceedings of the 29th ACM international conference on information & knowledge management. 1893–1902

2020
[41]

Kun Zhou, Hui Yu, Wayne Xin Zhao, and Ji-Rong Wen. 2022. Filter-enhanced MLP is all you need for sequential recommendation. InProceedings of the ACM web conference 2022. 2388–2399

2022