arxiv: 2605.10991 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures

Linhai Zhang , Yulan He

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords test-time personalizationprobabilistic reward modelbest-of-n scalingscaling lawuser-level collapsequery-level reward hackingLLM personalizationinference scaling

0 comments

The pith

Probabilistic personalized reward models with learned variance fix scaling failures in test-time personalization, enabling logarithmic utility growth with more samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sampling many output candidates from a personalized policy model at inference time and selecting the best one with a reward model can improve results, but only if the reward model performs well. They prove that perfect selection would make expected utility grow logarithmically with the number of candidates. Standard reward models fall short due to two problems they diagnose: user-level collapse where scores stay nearly constant for some users, and query-level reward hacking where scores correlate negatively with actual quality for some queries. They derive a scaling law that breaks any Best-of-N curve into four measurable quantities to identify these issues, then introduce a probabilistic reward model that learns its own variance to reduce both failures. Experiments on personalized text generation tasks confirm that this approach produces reliable scaling gains across different policy models.

Core claim

We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. TTP is

What carries the argument

The unified scaling law decomposing any reward model's Best-of-N curve into four measurable quantities, which diagnoses failures and guides the probabilistic personalized reward model whose learned variance mitigates user-level collapse and query-level reward hacking.

If this is right

Oracle selection produces expected utility that grows logarithmically with the number of sampled candidates.
The probabilistic reward model mitigates both user-level collapse and query-level reward hacking.
TTP produces consistent scaling improvements across multiple policy models and personalized text generation tasks.
The scaling law closely matches observed Best-of-N curves for different reward model variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Uncertainty estimation via learned variance could improve reward models in broader LLM applications such as alignment or safety evaluation.
The diagnostic decomposition approach might extend to analyzing scaling in non-personalized generation or other output modalities.
Test-time methods of this form may combine with training-time personalization to produce larger overall gains for individual users.

Load-bearing premise

The decomposition of any reward model's Best-of-N curve into four measurable quantities is accurate and independent, and that learning variance in the probabilistic reward model will mitigate user-level collapse and query-level reward hacking without introducing new biases or overfitting to the measured quantities.

What would settle it

If the probabilistic reward model's Best-of-N curves in personalized text generation experiments do not improve with additional samples or deviate from the predictions of the derived scaling law.

Figures

Figures reproduced from arXiv: 2605.10991 by Linhai Zhang, Yulan He.

**Figure 1.** Figure 1: Test-time personalization on LaMP-4: News Headline Generation. Oracle shows logarithmic scaling that surpasses training-based baselines, while standard Reward Models (RMs) fail to scale, performing close to, or worse than, random selection. To diagnose this gap, we develop an analytical framework that connects a reward model’s correlation with the golden score to its Best-of-N scaling behavior (Section 4)… view at source ↗

**Figure 2.** Figure 2: Oracle scaling on (a) LaMP-5 and (b) LongLaMP-Product. Oracle selection (red solid) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Scaling curves for standard reward models. (a) On LaMP-5, Global RM performs no better [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Correlation diagnostics on LaMP-5 and LongLaMP-Product Review. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Main TTP results under the RAG policy across five tasks (a)–(e). [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Failure-mode analysis on LaMP-4. (a) Ground-truth ROUGE standard deviation distinguishes collapsed users (ρu <0.1 under Det RM) from normal users (ρu >0.5). (b) Score histogram of a representative collapsed user: GT scores cluster near zero, Det RM degenerates to near-constant predictions (ρ=−0.10), while Prob RM preserves meaningful variation (ρ= 0.78). (c) Per-query correlation scatter (Det vs Prob): gr… view at source ↗

read the original abstract

Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward-model variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames test-time sampling plus a variance-aware reward model as a practical route to better LLM personalization, but the abstract gives no equations or experimental details so the scaling law and its fixes are hard to evaluate.

read the letter

The main takeaway is that this work treats inference-time compute as an underused lever for personalization: sample N candidates from a personalized policy and pick the best with a reward model. They claim oracle selection gives log(N) expected utility growth and then derive a four-quantity decomposition of any Best-of-N curve to explain why ordinary reward models fall short. The two named failure modes are user-level collapse, where the model gives near-constant scores for some users, and query-level reward hacking, where scores correlate negatively with true quality on some queries. Their proposed fix is a probabilistic personalized reward model that learns per-prediction variance to mitigate both issues. Experiments are said to show consistent scaling across policy models and tasks and that the law matches observed curves. That framing is new enough to be worth attention; most personalization papers focus on training or prompting rather than scaling selection at inference. The identification of the two failure modes is concrete and useful for anyone running reward models on diverse users. The soft spots are straightforward. The abstract asserts a proof of logarithmic growth and that the four quantities are measurable and diagnostic, yet supplies neither the derivation steps nor the functional form, so it is impossible to check whether the decomposition is algebraically exact or whether the quantities are independent of N. The claim that learned variance fixes the failures without new biases or overfitting also rests on the same missing details. Experiments are described only at the level of “confirm both elements,” with no numbers, baselines, or ablation on the variance term. Because the full manuscript was not supplied for this read, these gaps remain open. The paper is aimed at researchers working on inference scaling, reward modeling, and practical LLM deployment. A reader who already cares about Best-of-N selection or personalization trade-offs will find the diagnostic language and the variance idea worth testing, even if they end up rewriting the math. I would send it to peer review; the angle is practical and the claims are falsifiable once the equations and data are shown, so referees can do the necessary verification.

Referee Report

2 major / 2 minor

Summary. The paper studies test-time personalization (TTP) for LLMs by sampling N candidates from a personalized policy model and selecting the best via a personalized reward model. It proves that oracle selection produces expected utility that grows logarithmically with N. It derives a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities, identifying two failure modes (user-level collapse and query-level reward hacking). A probabilistic personalized reward model with learned variance is proposed to mitigate these modes, and experiments are reported to confirm consistent scaling across policy models and tasks as well as close agreement between the derived law and observed curves.

Significance. If the logarithmic bound and the exact algebraic decomposition hold, the work supplies both a theoretical ceiling for test-time scaling in personalization and a practical diagnostic tool that can guide reward-model design. The combination of a parameter-free theoretical result, a falsifiable scaling law, and empirical validation across multiple models would constitute a substantive contribution to understanding inference-time compute in personalized generation.

major comments (2)

[§2] §2 (theoretical analysis): the claim that oracle selection yields logarithmic growth in expected utility is load-bearing for the entire framework, yet the derivation steps, the precise definition of utility, and the distributional assumptions required to obtain the log(N) form are not supplied in sufficient detail to allow independent verification.
[§3] §3, Eq. (scaling-law decomposition): the unified scaling law is presented as decomposing Best-of-N performance into four measurable quantities that diagnose the two failure modes. However, algebraic independence of these four quantities from N and from one another is not demonstrated; without this, the diagnosis risks being circular or approximate rather than exact.

minor comments (2)

[§5] §5 (experiments): the description of the personalized text-generation tasks, the policy models, and the exact procedure for measuring the four quantities should be expanded to support reproducibility.
Notation for the four quantities in the scaling law is introduced without an explicit table or equation block that lists their definitions and measurement protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to improve clarity in the theoretical sections, and we will revise the paper to address them directly while preserving the core contributions.

read point-by-point responses

Referee: [§2] §2 (theoretical analysis): the claim that oracle selection yields logarithmic growth in expected utility is load-bearing for the entire framework, yet the derivation steps, the precise definition of utility, and the distributional assumptions required to obtain the log(N) form are not supplied in sufficient detail to allow independent verification.

Authors: We agree that additional detail will strengthen verifiability. In the revised manuscript we will expand §2 with the full derivation: utility is defined as the expected personalized reward E[r(y)|x,u] for response y given query x and user u. Under the assumption that the N candidate utilities are i.i.d. draws from a distribution whose upper tail satisfies the conditions for extreme-value convergence (e.g., Gumbel domain of attraction with finite mean), the expectation of the maximum is shown to be log(N) + γ + o(1), where γ is the Euler-Mascheroni constant. We will include the integral representation of E[max U_i], the asymptotic expansion, and the precise regularity conditions on the utility distribution. This will allow independent verification without altering the result. revision: yes
Referee: [§3] §3, Eq. (scaling-law decomposition): the unified scaling law is presented as decomposing Best-of-N performance into four measurable quantities that diagnose the two failure modes. However, algebraic independence of these four quantities from N and from one another is not demonstrated; without this, the diagnosis risks being circular or approximate rather than exact.

Authors: We thank the referee for this point. The four quantities (user-level mean reward, user-level reward variance, query-level correlation between predicted and true utility, and a collapse indicator) are defined as expectations over the fixed data distribution and are therefore independent of N by construction. The decomposition itself follows from the law of total expectation applied to the utility of the argmax-selected sample, separating the reward-model statistics from the selection operator. In the revision we will add an appendix that derives the identity algebraically, showing term-by-term independence from N and mutual separation. This establishes that the diagnosis is exact rather than approximate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains self-contained

full rationale

The abstract states a proof that oracle selection yields logarithmic expected utility growth with N candidates, followed by derivation of a unified scaling law decomposing Best-of-N curves into four measurable quantities. No equations are shown that reduce the law or the probabilistic reward model back to fitted parameters from the same data by construction. No self-citations are invoked as load-bearing for uniqueness or ansatz. The four-quantity decomposition is presented as derived rather than assumed or fitted, and the variance mitigation is proposed as a guided extension rather than a renaming of observed patterns. The chain therefore does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the probabilistic reward model introduces learned variance whose fitting details are not described.

pith-pipeline@v0.9.0 · 5489 in / 1334 out tokens · 126762 ms · 2026-05-13T06:26:36.476154+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1 (Oracle Scaling Law) ... expected maximum ... O(√ln N) ... sub-Gaussian
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 4.4 (Unified Scaling Law) ... (1−α)·[(1−β)ρ̄+ − β|ρ̄−|] ... four measurable quantities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[1]

A survey of personalized large language models: Progress and future directions

Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Wenhao Yu, Jieming Zhu, Minda Hu, Menglin Yang, Tat-Seng Chua, and Irwin King. A survey of personalized large language models: Progress and future directions, 2025. URLhttps://arxiv.org/abs/2502.11528

work page arXiv 2025
[2]

Integrating summarization and retrieval for enhanced personalization via large language models, 2023

Chris Richardson, Yao Zhang, Kellen Gillespie, Sudipta Kar, Arshdeep Singh, Zeynab Raeesy, Omar Zia Khan, and Abhinav Sethy. Integrating summarization and retrieval for enhanced personalization via large language models, 2023. URL https://arxiv.org/abs/2310. 20081

work page 2023
[3]

L a MP : When Large Language Models Meet Personalization

Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models meet personalization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7370–7392, Bangkok, Thailand, August 2024. Association ...

work page doi:10.18653/v1/2024.acl-long.399 2024
[4]

Democratiz- ing large language models via personalized parameter-efficient fine-tuning

Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democratiz- ing large language models via personalized parameter-efficient fine-tuning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6476–6491, Miami, Florida, USA, November

work page 2024
[5]

doi: 10.18653/v1/2024.emnlp-main.372

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.372. URLhttps://aclanthology.org/2024.emnlp-main.372/

work page doi:10.18653/v1/2024.emnlp-main.372 2024
[6]

PROPER: A progressive learning framework for personalized large language models with group-level adaptation

Linhai Zhang, Jialong Wu, Deyu Zhou, and Yulan He. PROPER: A progressive learning framework for personalized large language models with group-level adaptation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page doi:10.18653/v1/2025.acl-long.800 2025
[7]

Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging. InAdaptive Foundation Models: Evolving AI for Personalized and Efficient Learning, 2024. URL https://openreview...

work page 2024
[8]

Aligning LLMs by predicting preferences from user writing samples

Stéphane Aroca-Ouellette, Natalie Mackraz, Barry-John Theobald, and Katherine Metcalf. Aligning LLMs by predicting preferences from user writing samples. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=eUMGCipgtE

work page 2025
[9]

Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=VNckp7JEHn

work page 2025
[10]

Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=4FWAwZtd2n

work page 2025
[11]

Thinking vs

Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar. Thinking vs. doing: Agents that reason by scaling test-time interaction. InWorkshop on Scaling Environments for Agents, 2025. URLhttps://openreview.net/forum?id=uhigrPHBm5

work page 2025
[12]

T-pop: Test-time personalization with online preference feedback, 2025

Zikun Qu, Min Zhang, Mingze Kong, Xiang Li, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Shuang Qiu, Yao Shu, and Zhongxiang Dai. T-pop: Test-time personalization with online preference feedback, 2025. URLhttps://arxiv.org/abs/2509.24696. 10

work page arXiv 2025
[13]

P-genRM: Personalized generative reward model with test-time user-based scaling

Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Xu Ze, Fei Huang, Yongbin Li, and Kai Zhang. P-genRM: Personalized generative reward model with test-time user-based scaling. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=hXNApWLBZG

work page 2026
[14]

arXiv preprint arXiv:2407.11016 , year=

Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, and Hamed Zamani. Longlamp: A benchmark for personalized long-form text generation, 2024. URL https://arxiv.org/abs/2407.11016

work page arXiv 2024
[15]

Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang

Michael J. Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang. SynthesizeMe! inducing persona-guided prompts for personalized reward models in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vo...

work page doi:10.18653/v1/2025.acl-long.397 2025
[16]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Fdlora: Personalized federated learning of large language model via dual lora tuning.arXiv preprint arXiv:2406.07925, 2024

Jiaxing Qi, Zhongzhi Luan, Shaohan Huang, Carol Fung, Hailong Yang, and Depei Qian. Fdlora: Personalized federated learning of large language model via dual lora tuning.arXiv preprint arXiv:2406.07925, 2024

work page arXiv 2024
[19]

Rewarded soups: towards pareto-optimal alignment by inter- polating weights fine-tuned on diverse rewards

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by inter- polating weights fine-tuned on diverse rewards. InThirty-seventh Conference on Neural Informa- tion Processing Systems, 2023. URLhttps://openreview.net/forum?id=lSbbC2VyCu

work page 2023
[20]

Uncertainty- aware reward model: Teaching reward models to know what is unknown, 2025

Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, and Junge Zhang. Uncertainty- aware reward model: Teaching reward models to know what is unknown, 2025. URL https: //arxiv.org/abs/2410.00847

work page arXiv 2025
[21]

Ronald J

Yuanzhao Zhai, Yu Lei, Han Zhang, Yue Yu, Kele Xu, Dawei Feng, Bo Ding, and Huaimin Wang. Uncertainty-penalized reinforcement learning from human feedback with diversified reward lora ensembles.Information Processing and Management, 63(3):104548, 2026. ISSN 0306-4573. doi: https://doi.org/10.1016/j.ipm.2025.104548. URL https://www.sciencedirect.com/ scien...

work page doi:10.1016/j.ipm.2025.104548 2026
[22]

Probabilistic uncertain reward model, 2025

Wangtao Sun, Xiang Cheng, Xing Yu, Haotian Xu, Zhao Yang, Shizhu He, Jun Zhao, and Kang Liu. Probabilistic uncertain reward model, 2025. URL https://arxiv.org/abs/2503. 22480

work page 2025
[23]

Test-time preference optimization: On-the-fly alignment via iterative textual feedback.arXiv preprint arXiv:2501.12895, 2025

Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, and Yu Cheng. Test-time preference optimization: On-the-fly alignment via iterative textual feedback, 2025. URL https://arxiv.org/abs/ 2501.12895. 11

work page arXiv 2025
[24]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,...

work page doi:10.18653/v1/d19-1410 2019
[25]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machin...

work page doi:10.1145/3600006.3613165 2023
[26]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=mZn2Xyh9Ec

work page 2024
[27]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7. 12 A Theoretical Proofs In this section, we provide formal proofs or theoretical analysis for all theoretical results presented in the main text. Our theoretical framework r...

work page 2019
[28]

Theorem 3.1 (Oracle Scaling Law, Section A.1): Establishes the theoretical ceiling for TTP, expected utility grows asO( √ lnN)with oracle selection

work page
[29]

Lemma 4.2 (Correlation-Scaling Relationship, Section A.2): Shows that reward model correlation directly determines scaling behavior, providing a diagnostic tool for analyzing RM quality

work page
[30]

Proposition 4.4 (Unified Scaling Law, Section A.3): Derives how the two failure modes, collapse rateαand hacking rateβ, jointly determine population-level scaling

work page
[31]

selecting worse candidates

Lemmas 5.1 and 5.2 (Gradient Buffering & Implicit Regularization, Section A.4): Explains the mechanisms by which probabilistic reward modeling reduces both failure modes. Throughout the proofs, we introduce necessary assumptions and provide remarks connecting the- oretical insights to empirical observations. Table A1 summarizes the key assumptions used in...

work page arXiv 1940
[32]

59 ms), owing to the smaller backbone (1.5B vs

Training efficiency: The reward model trains 3.1× faster per sample than the policy model (19 ms vs. 59 ms), owing to the smaller backbone (1.5B vs. 4B parameters) and shorter sequence length

work page
[33]

Learned Token Pruning for Transformers

Inference cost structure: The RM scoring cost is negligible compared to generation cost. Scoring a single candidate takes only 2.3 ms, while generating one response takes 1498 ms—a 650× difference. Even with N= 30 candidates, the total scoring time (69 ms) remains less than 5% of a single generation. These findings suggest that the computational bottlenec...

work page
[34]

Sample N candidates from the pre-generated candidate pool (up to 30 candidates per query)

work page
[35]

Score each candidate using the user-specific reward modelR u

work page
[36]

For probabilistic User RM, we use only the predicted meanµ(x, y)for selection

Select the candidate with the highest predicted reward:ˆy= arg max y∈YN Ru(x, y). For probabilistic User RM, we use only the predicted meanµ(x, y)for selection. Evaluation Protocol.Table A17 summarizes the inference configuration. For each (user, query, N) combination, we repeat the random candidate sampling 3 times and report the average to reduce varian...

work page