pith. machine review for the scientific record. sign in

arxiv: 2605.10991 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords test-time personalizationprobabilistic reward modelbest-of-n scalingscaling lawuser-level collapsequery-level reward hackingLLM personalizationinference scaling
0
0 comments X

The pith

Probabilistic personalized reward models with learned variance fix scaling failures in test-time personalization, enabling logarithmic utility growth with more samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sampling many output candidates from a personalized policy model at inference time and selecting the best one with a reward model can improve results, but only if the reward model performs well. They prove that perfect selection would make expected utility grow logarithmically with the number of candidates. Standard reward models fall short due to two problems they diagnose: user-level collapse where scores stay nearly constant for some users, and query-level reward hacking where scores correlate negatively with actual quality for some queries. They derive a scaling law that breaks any Best-of-N curve into four measurable quantities to identify these issues, then introduce a probabilistic reward model that learns its own variance to reduce both failures. Experiments on personalized text generation tasks confirm that this approach produces reliable scaling gains across different policy models.

Core claim

We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. TTP is

What carries the argument

The unified scaling law decomposing any reward model's Best-of-N curve into four measurable quantities, which diagnoses failures and guides the probabilistic personalized reward model whose learned variance mitigates user-level collapse and query-level reward hacking.

If this is right

  • Oracle selection produces expected utility that grows logarithmically with the number of sampled candidates.
  • The probabilistic reward model mitigates both user-level collapse and query-level reward hacking.
  • TTP produces consistent scaling improvements across multiple policy models and personalized text generation tasks.
  • The scaling law closely matches observed Best-of-N curves for different reward model variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Uncertainty estimation via learned variance could improve reward models in broader LLM applications such as alignment or safety evaluation.
  • The diagnostic decomposition approach might extend to analyzing scaling in non-personalized generation or other output modalities.
  • Test-time methods of this form may combine with training-time personalization to produce larger overall gains for individual users.

Load-bearing premise

The decomposition of any reward model's Best-of-N curve into four measurable quantities is accurate and independent, and that learning variance in the probabilistic reward model will mitigate user-level collapse and query-level reward hacking without introducing new biases or overfitting to the measured quantities.

What would settle it

If the probabilistic reward model's Best-of-N curves in personalized text generation experiments do not improve with additional samples or deviate from the predictions of the derived scaling law.

Figures

Figures reproduced from arXiv: 2605.10991 by Linhai Zhang, Yulan He.

Figure 1
Figure 1. Figure 1: Test-time personalization on LaMP-4: News Headline Genera￾tion. Oracle shows logarithmic scaling that surpasses training-based baselines, while standard Reward Models (RMs) fail to scale, performing close to, or worse than, random selection. To diagnose this gap, we develop an analytical framework that connects a reward model’s correlation with the golden score to its Best-of-N scaling behavior (Section 4)… view at source ↗
Figure 2
Figure 2. Figure 2: Oracle scaling on (a) LaMP-5 and (b) LongLaMP-Product. Oracle selection (red solid) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scaling curves for standard reward models. (a) On LaMP-5, Global RM performs no better [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Correlation diagnostics on LaMP-5 and LongLaMP-Product Review. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Main TTP results under the RAG policy across five tasks (a)–(e). [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure-mode analysis on LaMP-4. (a) Ground-truth ROUGE standard deviation distin￾guishes collapsed users (ρu <0.1 under Det RM) from normal users (ρu >0.5). (b) Score histogram of a representative collapsed user: GT scores cluster near zero, Det RM degenerates to near-constant predictions (ρ=−0.10), while Prob RM preserves meaningful variation (ρ= 0.78). (c) Per-query correlation scatter (Det vs Prob): gr… view at source ↗
read the original abstract

Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward-model variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies test-time personalization (TTP) for LLMs by sampling N candidates from a personalized policy model and selecting the best via a personalized reward model. It proves that oracle selection produces expected utility that grows logarithmically with N. It derives a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities, identifying two failure modes (user-level collapse and query-level reward hacking). A probabilistic personalized reward model with learned variance is proposed to mitigate these modes, and experiments are reported to confirm consistent scaling across policy models and tasks as well as close agreement between the derived law and observed curves.

Significance. If the logarithmic bound and the exact algebraic decomposition hold, the work supplies both a theoretical ceiling for test-time scaling in personalization and a practical diagnostic tool that can guide reward-model design. The combination of a parameter-free theoretical result, a falsifiable scaling law, and empirical validation across multiple models would constitute a substantive contribution to understanding inference-time compute in personalized generation.

major comments (2)
  1. [§2] §2 (theoretical analysis): the claim that oracle selection yields logarithmic growth in expected utility is load-bearing for the entire framework, yet the derivation steps, the precise definition of utility, and the distributional assumptions required to obtain the log(N) form are not supplied in sufficient detail to allow independent verification.
  2. [§3] §3, Eq. (scaling-law decomposition): the unified scaling law is presented as decomposing Best-of-N performance into four measurable quantities that diagnose the two failure modes. However, algebraic independence of these four quantities from N and from one another is not demonstrated; without this, the diagnosis risks being circular or approximate rather than exact.
minor comments (2)
  1. [§5] §5 (experiments): the description of the personalized text-generation tasks, the policy models, and the exact procedure for measuring the four quantities should be expanded to support reproducibility.
  2. Notation for the four quantities in the scaling law is introduced without an explicit table or equation block that lists their definitions and measurement protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to improve clarity in the theoretical sections, and we will revise the paper to address them directly while preserving the core contributions.

read point-by-point responses
  1. Referee: [§2] §2 (theoretical analysis): the claim that oracle selection yields logarithmic growth in expected utility is load-bearing for the entire framework, yet the derivation steps, the precise definition of utility, and the distributional assumptions required to obtain the log(N) form are not supplied in sufficient detail to allow independent verification.

    Authors: We agree that additional detail will strengthen verifiability. In the revised manuscript we will expand §2 with the full derivation: utility is defined as the expected personalized reward E[r(y)|x,u] for response y given query x and user u. Under the assumption that the N candidate utilities are i.i.d. draws from a distribution whose upper tail satisfies the conditions for extreme-value convergence (e.g., Gumbel domain of attraction with finite mean), the expectation of the maximum is shown to be log(N) + γ + o(1), where γ is the Euler-Mascheroni constant. We will include the integral representation of E[max U_i], the asymptotic expansion, and the precise regularity conditions on the utility distribution. This will allow independent verification without altering the result. revision: yes

  2. Referee: [§3] §3, Eq. (scaling-law decomposition): the unified scaling law is presented as decomposing Best-of-N performance into four measurable quantities that diagnose the two failure modes. However, algebraic independence of these four quantities from N and from one another is not demonstrated; without this, the diagnosis risks being circular or approximate rather than exact.

    Authors: We thank the referee for this point. The four quantities (user-level mean reward, user-level reward variance, query-level correlation between predicted and true utility, and a collapse indicator) are defined as expectations over the fixed data distribution and are therefore independent of N by construction. The decomposition itself follows from the law of total expectation applied to the utility of the argmax-selected sample, separating the reward-model statistics from the selection operator. In the revision we will add an appendix that derives the identity algebraically, showing term-by-term independence from N and mutual separation. This establishes that the diagnosis is exact rather than approximate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains self-contained

full rationale

The abstract states a proof that oracle selection yields logarithmic expected utility growth with N candidates, followed by derivation of a unified scaling law decomposing Best-of-N curves into four measurable quantities. No equations are shown that reduce the law or the probabilistic reward model back to fitted parameters from the same data by construction. No self-citations are invoked as load-bearing for uniqueness or ansatz. The four-quantity decomposition is presented as derived rather than assumed or fitted, and the variance mitigation is proposed as a guided extension rather than a renaming of observed patterns. The chain therefore does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the probabilistic reward model introduces learned variance whose fitting details are not described.

pith-pipeline@v0.9.0 · 5489 in / 1334 out tokens · 126762 ms · 2026-05-13T06:26:36.476154+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

  1. [1]

    A survey of personalized large language models: Progress and future directions

    Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Wenhao Yu, Jieming Zhu, Minda Hu, Menglin Yang, Tat-Seng Chua, and Irwin King. A survey of personalized large language models: Progress and future directions, 2025. URLhttps://arxiv.org/abs/2502.11528

  2. [2]

    Integrating summarization and retrieval for enhanced personalization via large language models, 2023

    Chris Richardson, Yao Zhang, Kellen Gillespie, Sudipta Kar, Arshdeep Singh, Zeynab Raeesy, Omar Zia Khan, and Abhinav Sethy. Integrating summarization and retrieval for enhanced personalization via large language models, 2023. URL https://arxiv.org/abs/2310. 20081

  3. [3]

    L a MP : When Large Language Models Meet Personalization

    Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models meet personalization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7370–7392, Bangkok, Thailand, August 2024. Association ...

  4. [4]

    Democratiz- ing large language models via personalized parameter-efficient fine-tuning

    Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democratiz- ing large language models via personalized parameter-efficient fine-tuning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6476–6491, Miami, Florida, USA, November

  5. [5]

    doi: 10.18653/v1/2024.emnlp-main.372

    Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.372. URLhttps://aclanthology.org/2024.emnlp-main.372/

  6. [6]

    PROPER: A progressive learning framework for personalized large language models with group-level adaptation

    Linhai Zhang, Jialong Wu, Deyu Zhou, and Yulan He. PROPER: A progressive learning framework for personalized large language models with group-level adaptation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  7. [7]

    Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging

    Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging. InAdaptive Foundation Models: Evolving AI for Personalized and Efficient Learning, 2024. URL https://openreview...

  8. [8]

    Aligning LLMs by predicting preferences from user writing samples

    Stéphane Aroca-Ouellette, Natalie Mackraz, Barry-John Theobald, and Katherine Metcalf. Aligning LLMs by predicting preferences from user writing samples. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=eUMGCipgtE

  9. [9]

    Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving

    Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=VNckp7JEHn

  10. [10]

    Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=4FWAwZtd2n

  11. [11]

    Thinking vs

    Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar. Thinking vs. doing: Agents that reason by scaling test-time interaction. InWorkshop on Scaling Environments for Agents, 2025. URLhttps://openreview.net/forum?id=uhigrPHBm5

  12. [12]

    T-pop: Test-time personalization with online preference feedback, 2025

    Zikun Qu, Min Zhang, Mingze Kong, Xiang Li, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Shuang Qiu, Yao Shu, and Zhongxiang Dai. T-pop: Test-time personalization with online preference feedback, 2025. URLhttps://arxiv.org/abs/2509.24696. 10

  13. [13]

    P-genRM: Personalized generative reward model with test-time user-based scaling

    Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Xu Ze, Fei Huang, Yongbin Li, and Kai Zhang. P-genRM: Personalized generative reward model with test-time user-based scaling. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=hXNApWLBZG

  14. [14]

    arXiv preprint arXiv:2407.11016 , year=

    Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, and Hamed Zamani. Longlamp: A benchmark for personalized long-form text generation, 2024. URL https://arxiv.org/abs/2407.11016

  15. [15]

    Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang

    Michael J. Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang. SynthesizeMe! inducing persona-guided prompts for personalized reward models in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vo...

  16. [16]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  17. [17]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  18. [18]

    Fdlora: Personalized federated learning of large language model via dual lora tuning.arXiv preprint arXiv:2406.07925, 2024

    Jiaxing Qi, Zhongzhi Luan, Shaohan Huang, Carol Fung, Hailong Yang, and Depei Qian. Fdlora: Personalized federated learning of large language model via dual lora tuning.arXiv preprint arXiv:2406.07925, 2024

  19. [19]

    Rewarded soups: towards pareto-optimal alignment by inter- polating weights fine-tuned on diverse rewards

    Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by inter- polating weights fine-tuned on diverse rewards. InThirty-seventh Conference on Neural Informa- tion Processing Systems, 2023. URLhttps://openreview.net/forum?id=lSbbC2VyCu

  20. [20]

    Uncertainty- aware reward model: Teaching reward models to know what is unknown, 2025

    Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, and Junge Zhang. Uncertainty- aware reward model: Teaching reward models to know what is unknown, 2025. URL https: //arxiv.org/abs/2410.00847

  21. [21]

    Ronald J

    Yuanzhao Zhai, Yu Lei, Han Zhang, Yue Yu, Kele Xu, Dawei Feng, Bo Ding, and Huaimin Wang. Uncertainty-penalized reinforcement learning from human feedback with diversified reward lora ensembles.Information Processing and Management, 63(3):104548, 2026. ISSN 0306-4573. doi: https://doi.org/10.1016/j.ipm.2025.104548. URL https://www.sciencedirect.com/ scien...

  22. [22]

    Probabilistic uncertain reward model, 2025

    Wangtao Sun, Xiang Cheng, Xing Yu, Haotian Xu, Zhao Yang, Shizhu He, Jun Zhao, and Kang Liu. Probabilistic uncertain reward model, 2025. URL https://arxiv.org/abs/2503. 22480

  23. [23]

    Test-time preference optimization: On-the-fly alignment via iterative textual feedback.arXiv preprint arXiv:2501.12895, 2025

    Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, and Yu Cheng. Test-time preference optimization: On-the-fly alignment via iterative textual feedback, 2025. URL https://arxiv.org/abs/ 2501.12895. 11

  24. [24]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,...

  25. [25]

    In: Proceedings of the 29th Symposium on Operating Systems Principles

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machin...

  26. [26]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=mZn2Xyh9Ec

  27. [27]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7. 12 A Theoretical Proofs In this section, we provide formal proofs or theoretical analysis for all theoretical results presented in the main text. Our theoretical framework r...

  28. [28]

    Theorem 3.1 (Oracle Scaling Law, Section A.1): Establishes the theoretical ceiling for TTP, expected utility grows asO( √ lnN)with oracle selection

  29. [29]

    Lemma 4.2 (Correlation-Scaling Relationship, Section A.2): Shows that reward model correlation directly determines scaling behavior, providing a diagnostic tool for analyzing RM quality

  30. [30]

    Proposition 4.4 (Unified Scaling Law, Section A.3): Derives how the two failure modes, collapse rateαand hacking rateβ, jointly determine population-level scaling

  31. [31]

    selecting worse candidates

    Lemmas 5.1 and 5.2 (Gradient Buffering & Implicit Regularization, Section A.4): Explains the mechanisms by which probabilistic reward modeling reduces both failure modes. Throughout the proofs, we introduce necessary assumptions and provide remarks connecting the- oretical insights to empirical observations. Table A1 summarizes the key assumptions used in...

  32. [32]

    59 ms), owing to the smaller backbone (1.5B vs

    Training efficiency: The reward model trains 3.1× faster per sample than the policy model (19 ms vs. 59 ms), owing to the smaller backbone (1.5B vs. 4B parameters) and shorter sequence length

  33. [33]

    Learned Token Pruning for Transformers

    Inference cost structure: The RM scoring cost is negligible compared to generation cost. Scoring a single candidate takes only 2.3 ms, while generating one response takes 1498 ms—a 650× difference. Even with N= 30 candidates, the total scoring time (69 ms) remains less than 5% of a single generation. These findings suggest that the computational bottlenec...

  34. [34]

    Sample N candidates from the pre-generated candidate pool (up to 30 candidates per query)

  35. [35]

    Score each candidate using the user-specific reward modelR u

  36. [36]

    For probabilistic User RM, we use only the predicted meanµ(x, y)for selection

    Select the candidate with the highest predicted reward:ˆy= arg max y∈YN Ru(x, y). For probabilistic User RM, we use only the predicted meanµ(x, y)for selection. Evaluation Protocol.Table A17 summarizes the inference configuration. For each (user, query, N) combination, we repeat the random candidate sampling 3 times and report the average to reduce varian...