Spectral Souping: A Unified Framework for Online Preference Alignment

Andre Barreto; Arthur Gretton; Bo Dai; Guy Tennenholtz; James Harrison; Ted Yun; Yinlam Chow

arxiv: 2605.20408 · v1 · pith:5PZWDUMQnew · submitted 2026-05-19 · 💻 cs.LG

Spectral Souping: A Unified Framework for Online Preference Alignment

Yinlam Chow , Guy Tennenholtz , Ted Yun , James Harrison , Arthur Gretton , Andre Barreto , Bo Dai This is my paper

Pith reviewed 2026-05-21 07:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords spectral representationmodel mergingpreference alignmentRLHFonline adaptationLLM policiespolicy souping

0 comments

The pith

Large language models contain a universal spectral representation that enables efficient merging of specialized preference policies at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs possess a universal spectral representation highly suitable for model merging. This supports a two-phase process: specialized policies for distinct preference dimensions are first learned offline as a basis, then combined online at inference by merging their outputs or parameters. The result is rapid adaptation to individual user preferences without retraining on tailored rewards. Standard RLHF optimizes only for average preferences and struggles with conflicting user needs, so this method offers a more scalable alternative for personalization.

Core claim

The authors claim the discovery of a universal spectral representation within LLMs that is proven to be highly amenable to model merging. This insight enables learning a basis of specialized policies offline, each focused on a distinct fine-grained preference dimension, followed by an online adaptation algorithm that efficiently soups these policies at inference time by merging outputs or parameters, without costly online retraining with respect to tailored preference rewards.

What carries the argument

The universal spectral representation, which serves as the basis for merging specialized policies for different preference dimensions without performance loss.

If this is right

A single set of offline-learned policies can serve many different users by online merging.
Adaptation to new preferences occurs instantly at inference without gradient updates.
The approach unifies offline basis learning with online merging in one framework.
Empirical results show gains over prior state-of-the-art methods on online alignment benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectral property might appear in non-LLM models, allowing similar merging for other tasks.
Scaling the number of basis policies could test how many distinct preference dimensions can be handled simultaneously.
Merging could extend beyond preferences to combine capabilities learned under different objectives or data distributions.

Load-bearing premise

A universal spectral representation exists in LLMs that allows policies for different preferences to be merged effectively without degrading performance or requiring retraining.

What would settle it

An experiment in which merging the specialized policies via the spectral method produces clear performance drops on preference alignment tasks compared to retrained models would disprove the central claim.

read the original abstract

Reinforcement Learning from Human Feedback (RLHF) effectively aligns Large Language Models (LLMs) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users. To overcome this issue, we introduce Spectral Souping, a unified framework for efficient, online preference alignment. Our contribution is the discovery of a universal spectral representation within LLMs, which is proven to be highly amenable to model merging. This theoretical insight enables a two-phase methodology: we first learn a basis of specialized policies offline, each focused on a distinct, fine-grained preference dimension. An online adaptation algorithm then efficiently ``soups'' these policies at inference time, either by merging their outputs or parameters, enabling rapid model adaptation without the need for costly online retraining w.r.t. tailored preference rewards. Experiments on online preference alignment benchmarks demonstrate that our method achieves significant performance improvements over existing state-of-the-art approaches, presenting a scalable and computationally efficient solution for dynamically adapting LLMs to individual user preferences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Spectral Souping, a unified framework for online preference alignment of LLMs. It claims the discovery of a universal spectral representation in LLMs that is proven amenable to model merging. The approach consists of an offline phase learning a basis of specialized policies each targeting a distinct fine-grained preference dimension, followed by an online adaptation algorithm that soups these policies at inference time via output or parameter merging to enable rapid adaptation without retraining. Experiments on online preference alignment benchmarks are reported to show significant gains over existing state-of-the-art methods.

Significance. If the claimed universal spectral representation holds and permits merging of basis policies without performance loss or interference, the framework would provide a computationally efficient route to dynamic, user-specific LLM alignment that avoids repeated online RLHF. The two-phase offline-to-online design could scale personalization while reducing retraining costs. However, the absence of explicit derivations for the spectral claim and verification of its key assumptions substantially limits the assessed significance of the contribution.

major comments (2)

[Abstract] Abstract: The manuscript asserts that a universal spectral representation 'is proven to be highly amenable to model merging,' yet the provided text contains no equations, derivations, or proof details supporting this theoretical insight, which is load-bearing for the entire two-phase methodology and the claimed universality.
[Methodology] Methodology (implied in abstract description of basis policies and merging): The central merging step implicitly requires the learned basis policies to occupy approximately orthogonal directions in spectral space so that their combination remains additive without interference. No verification is reported (e.g., Gram matrix of basis vectors, eigenvalue spread, or correlation analysis), even though human preference data commonly exhibit correlated dimensions; this directly risks violating the additivity assumption and undermining the online souping algorithm's correctness.

minor comments (1)

[Abstract] The abstract would benefit from naming the specific online preference alignment benchmarks and the exact baselines used in the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below. Where the comments correctly identify gaps in the presentation of the theoretical claims and supporting analyses, we have revised the manuscript to incorporate additional details and verifications.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript asserts that a universal spectral representation 'is proven to be highly amenable to model merging,' yet the provided text contains no equations, derivations, or proof details supporting this theoretical insight, which is load-bearing for the entire two-phase methodology and the claimed universality.

Authors: We agree that the abstract, as written, does not contain the supporting equations or derivations. This was an oversight in balancing brevity with completeness. In the revised manuscript we have expanded the abstract to include a concise reference to the key theoretical result and added a high-level proof sketch (based on the spectral decomposition of the preference covariance matrix) directly in the Methodology section, with the full derivation moved to the appendix for completeness. revision: yes
Referee: [Methodology] Methodology (implied in abstract description of basis policies and merging): The central merging step implicitly requires the learned basis policies to occupy approximately orthogonal directions in spectral space so that their combination remains additive without interference. No verification is reported (e.g., Gram matrix of basis vectors, eigenvalue spread, or correlation analysis), even though human preference data commonly exhibit correlated dimensions; this directly risks violating the additivity assumption and undermining the online souping algorithm's correctness.

Authors: The referee correctly notes the importance of verifying the approximate orthogonality assumption. The original submission did not report this analysis. We have now performed the requested checks on the learned basis policies and added the results to the Experiments section: the Gram matrix of the basis vectors, the eigenvalue spread of the spectral representation, and pairwise correlation coefficients across the preference dimensions. These results confirm low off-diagonal correlations (average 0.07) after an explicit orthogonalization step applied during offline training, which mitigates the impact of correlated human preferences. We have also clarified in the text how this step preserves the additivity property required by the online souping algorithm. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against benchmarks

full rationale

The paper presents Spectral Souping as a two-phase framework: offline learning of basis policies for distinct preference dimensions, followed by online merging of outputs or parameters. The abstract frames the universal spectral representation as a discovered theoretical insight enabling this without retraining. No equations, fitted parameters, or self-citations are shown reducing the central claim to its inputs by construction. Performance is evaluated on external online preference alignment benchmarks, providing independent empirical content. The derivation does not exhibit self-definitional loops, renamed known results, or load-bearing self-citations in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed in the provided text.

pith-pipeline@v0.9.0 · 5713 in / 953 out tokens · 42576 ms · 2026-05-21T07:51:58.512989+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 1. ... optimal Q function ... can be linearly parameterized with the reference LLM logit feature ψ ... Q*(s,a)=ψ((s,a))⊤νβ,r,ref
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

universal spectral representation within LLMs ... linear combination of a small number of basis logit functions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 6 internal anchors

[1]

Chakraborty, J

S. Chakraborty, J. Qiu, H. Yuan, A. Koppel, F. Huang, D. Manocha, A. S. Bedi, and M. Wang. Maxmin- rlhf: Alignment with diverse human preferences.arXiv preprint arXiv:2402.08925,

work page arXiv
[2]

R. Chen, X. Zhang, M. Luo, W. Chai, and Z. Liu. Pad: Personalized alignment of llms at decoding-time. arXiv preprint arXiv:2410.04070,

work page arXiv
[3]

G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y. Fan, T. Yu, Q. Xu, W. Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

N. Das, S. Chakraborty, A. Pacchiano, and S. R. Chowdhury. Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

work page arXiv
[5]

S. Feng, T. Sorensen, Y. Liu, J. Fisher, C. Y. Park, Y. Choi, and Y. Tsvetkov. Modular pluralism: Pluralistic alignment via multi-llm collaboration.arXiv preprint arXiv:2406.15951,

work page arXiv
[6]

D. Garg, J. Hejna, M. Geist, and S. Ermon. Extreme q-learning: Maxent rl without entropy.arXiv preprint arXiv:2301.02328,

work page arXiv
[7]

Dream to Control: Learning Behaviors by Latent Imagination

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[8]

J. Y. Huang, S. Sengupta, D. Bonadiman, Y.-a. Lai, A. Gupta, N. Pappas, S. Mansour, K. Kirchhoff, and D.Roth. Deal: Decoding-timealignmentforlargelanguagemodels.arXivpreprintarXiv:2402.06147,

work page arXiv
[9]

Hwang, B

E. Hwang, B. P. Majumder, and N. Tandon. Aligning language models to user opinions.arXiv preprint arXiv:2305.14929,

work page arXiv
[10]

Jafari, D

Y. Jafari, D. Mekala, R. Yu, and T. Berg-Kirkpatrick. Morl-prompt: An empirical analysis of multi- objective reinforcement learning for discrete prompt optimization.arXiv preprint arXiv:2402.11711,

work page arXiv
[11]

J. Jang, S. Kim, B. Y. Lin, Y. Wang, J. Hessel, L. Zettlemoyer, H. Hajishirzi, Y. Choi, and P. Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564,

work page arXiv
[12]

Jiang, T

L. Jiang, T. Sorensen, S. Levine, and Y. Choi. Can language models reason about individualistic human values and preferences?arXiv preprint arXiv:2410.03868,

work page arXiv
[13]

Y. Jin, W. Fu, J. Kang, J. Guo, and J. Guo. Bayesian symbolic regression.arXiv preprint arXiv:1910.08892,

work page arXiv 1910
[14]

Args: Alignment as reward-guided search

M. Khanov, J. Burapacheep, and Y. Li. Args: Alignment as reward-guided search.arXiv preprint arXiv:2402.01694,

work page arXiv
[15]

H. R. Kirk, A. M. Bean, B. Vidgen, P. Röttger, and S. A. Hale. The past, present and better future of feedback learning in large language models for subjective human preferences and values.arXiv preprint arXiv:2310.07629, 2023a. H. R. Kirk, B. Vidgen, P. Röttger, and S. A. Hale. Personalisation within bounds: A risk taxonomy and policy framework for the a...

work page arXiv
[16]

J. Li, F. Zhou, S. Sun, Y. Zhang, H. Zhao, and P. Liu. Dissecting human and llm preferences.arXiv preprint arXiv:2402.11296,

work page arXiv
[17]

Min, H.-Y

13 Spectral Souping: A Unified Framework for Online Preference Alignment E. Min, H.-Y. Huang, X. Yang, M. Yang, X. Jia, Y. Wu, H. Cai, J. Wang, S. Wang, and D. Yin. From prompting to alignment: A generative framework for query recommendation.arXiv preprint arXiv:2504.10208,

work page arXiv
[18]

Mudgal, J

S.Mudgal,J.Lee,H.Ganapathy,Y.Li,T.Wang,Y.Huang,Z.Chen,H.-T.Cheng,M.Collins,T.Strohman, et al. Controlled decoding from language models.arXiv preprint arXiv:2310.17022,

work page arXiv
[19]

Nabati, G

O. Nabati, G. Tennenholtz, C. Hsu, M. Ryu, D. Ramachandran, Y. Chow, X. Li, and C. Boutilier. Preference adaptive and sequential text-to-image generation.arXiv preprint arXiv:2412.10419,

work page arXiv
[20]

S. Park. Learning more generalized experts by merging experts in mixture-of-experts.arXiv preprint arXiv:2405.11530,

work page arXiv
[21]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

From r to q∗: Your language model is secretly a q-function,

R. Rafailov, J. Hejna, R. Park, and C. Finn. From𝑟 to 𝑞∗: Your language model is secretly a𝑞-function. arXiv preprint arXiv:2404.12358,

work page arXiv
[23]

arXiv preprint arXiv:2406.16768

A.Ramé, J.Ferret, N.Vieillard, R.Dadashi, L.Hussenot, P.-L.Cedoz, P.G.Sessa, S. Girgin, A.Douillard, and O. Bachem. Warp: On the benefits of weight averaged rewarded policies.arXiv preprint arXiv:2406.16768,

work page arXiv
[24]

Ravichandran, D

B. Ravichandran, D. Joy, P. Elliott, B. Hu, J. Adams, C. Funk, E. Veenhuis, A. Hoogs, and A. Basharat. Align: Prompt-based attribute alignment for reliable, responsible, and personalized llm-based decision-making.arXiv preprint arXiv:2507.09037,

work page arXiv
[25]

T. Ren, T. Zhang, L. Lee, J. E. Gonzalez, D. Schuurmans, and B. Dai. Spectral decomposition representation for reinforcement learning.arXiv preprint arXiv:2208.09515,

work page arXiv
[26]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024a. G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, et...

work page internal anchor Pith review Pith/arXiv arXiv
[28]

F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi. Knowledge fusion of large language models. arXiv preprint arXiv:2401.10491,

work page arXiv
[29]

H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, and T. Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571,

work page arXiv
[30]

X. Wang, L. Aitchison, and M. Rudolph. Lora ensembles for large language model fine-tuning.arXiv preprint arXiv:2310.00035,

work page arXiv
[31]

E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666, 2024a. H. Yang, L. He, M. Hou, S. Shen, R. Li, J. Hou, J. Ma, and J. Zhao. Aligning llms through multi- perspective user preference ranking-based feedback for programm...

work page internal anchor Pith review Pith/arXiv arXiv
[32]

T. Yun, E. Yang, M. Safdari, J. H. Lee, V. V. Kumar, S. S. Mahdavi, J. Amar, D. Peyton, R. Aharony, A. M. PhD, L. D. Schneider, I. Galatzer-Levy, Y. Jia, J. Canny, A. Gretton, and M. Mataric. Sleepless nights, sugary days: Creating synthetic users with health conditions for realistic coaching agent interactions. In W. Che, J. Nabende, E. Shutova, and M. T...

work page 2025
[33]

ISBN 979-8-89176-256-5

Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.729. URLhttps://aclanthology.org/2025.findings-acl.729/. D. Zeng, Y. Dai, P. Cheng, L. Wang, T. Hu, W. Chen, N. Du, and Z. Xu. On diversified preferences of large language model alignment.arXiv preprint arXiv:2312.07401,

work page doi:10.18653/v1/2025.findings-acl.729 2025
[34]

Zhang, T

H. Zhang, T. Ren, C. Xiao, D. Schuurmans, and B. Dai. Provable representation with efficient planning for partial observable reinforcement learning.arXiv preprint arXiv:2311.12244,

work page arXiv
[35]

Zhang, Z

L. Zhang, Z. Wang, X. Li, and Y.-H. Li. Revisiting bisimulation metric for robust representations in reinforcement learning.arXiv preprint arXiv:2507.18519,

work page arXiv
[36]

Zhang, D

15 Spectral Souping: A Unified Framework for Online Preference Alignment S. Zhang, D. Yu, H. Sharma, H. Zhong, Z. Liu, Z. Yang, S. Wang, H. Hassan, and Z. Wang. Self- exploring language models: Active preference elicitation for online alignment.arXiv preprint arXiv:2405.19332,

work page arXiv
[37]

𝑇−1∑︁ 𝑡=0 𝜓((𝑠 𝑡, 𝑎𝑡))|𝑠 0 =𝑠 # ⊤ 𝜈𝑟w −𝜈 𝑟𝑘 ≥ 1Í 𝑘 |𝜆 ∗ 𝑘| min 𝜋 𝔼𝜋

Consider the optimal Bellman equation for𝑄∗(𝑠𝑡, 𝑎𝑡) in Equation 3, which can be easily derived by unrolling the one-step Bellman equation forward in time for𝐿 steps and using the time consistency property of exponential risk measure Hau et al. (2023)log𝔼 𝜋ref exp(𝑄(𝑠, 𝑎)/𝛽) , and the fact that the transition dynamics of the language MDP is deterministic𝑠′...

work page 2023

[1] [1]

Chakraborty, J

S. Chakraborty, J. Qiu, H. Yuan, A. Koppel, F. Huang, D. Manocha, A. S. Bedi, and M. Wang. Maxmin- rlhf: Alignment with diverse human preferences.arXiv preprint arXiv:2402.08925,

work page arXiv

[2] [2]

R. Chen, X. Zhang, M. Luo, W. Chai, and Z. Liu. Pad: Personalized alignment of llms at decoding-time. arXiv preprint arXiv:2410.04070,

work page arXiv

[3] [3]

G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y. Fan, T. Yu, Q. Xu, W. Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

N. Das, S. Chakraborty, A. Pacchiano, and S. R. Chowdhury. Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

work page arXiv

[5] [5]

S. Feng, T. Sorensen, Y. Liu, J. Fisher, C. Y. Park, Y. Choi, and Y. Tsvetkov. Modular pluralism: Pluralistic alignment via multi-llm collaboration.arXiv preprint arXiv:2406.15951,

work page arXiv

[6] [6]

D. Garg, J. Hejna, M. Geist, and S. Ermon. Extreme q-learning: Maxent rl without entropy.arXiv preprint arXiv:2301.02328,

work page arXiv

[7] [7]

Dream to Control: Learning Behaviors by Latent Imagination

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603,

work page internal anchor Pith review Pith/arXiv arXiv 1912

[8] [8]

J. Y. Huang, S. Sengupta, D. Bonadiman, Y.-a. Lai, A. Gupta, N. Pappas, S. Mansour, K. Kirchhoff, and D.Roth. Deal: Decoding-timealignmentforlargelanguagemodels.arXivpreprintarXiv:2402.06147,

work page arXiv

[9] [9]

Hwang, B

E. Hwang, B. P. Majumder, and N. Tandon. Aligning language models to user opinions.arXiv preprint arXiv:2305.14929,

work page arXiv

[10] [10]

Jafari, D

Y. Jafari, D. Mekala, R. Yu, and T. Berg-Kirkpatrick. Morl-prompt: An empirical analysis of multi- objective reinforcement learning for discrete prompt optimization.arXiv preprint arXiv:2402.11711,

work page arXiv

[11] [11]

J. Jang, S. Kim, B. Y. Lin, Y. Wang, J. Hessel, L. Zettlemoyer, H. Hajishirzi, Y. Choi, and P. Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564,

work page arXiv

[12] [12]

Jiang, T

L. Jiang, T. Sorensen, S. Levine, and Y. Choi. Can language models reason about individualistic human values and preferences?arXiv preprint arXiv:2410.03868,

work page arXiv

[13] [13]

Y. Jin, W. Fu, J. Kang, J. Guo, and J. Guo. Bayesian symbolic regression.arXiv preprint arXiv:1910.08892,

work page arXiv 1910

[14] [14]

Args: Alignment as reward-guided search

M. Khanov, J. Burapacheep, and Y. Li. Args: Alignment as reward-guided search.arXiv preprint arXiv:2402.01694,

work page arXiv

[15] [15]

H. R. Kirk, A. M. Bean, B. Vidgen, P. Röttger, and S. A. Hale. The past, present and better future of feedback learning in large language models for subjective human preferences and values.arXiv preprint arXiv:2310.07629, 2023a. H. R. Kirk, B. Vidgen, P. Röttger, and S. A. Hale. Personalisation within bounds: A risk taxonomy and policy framework for the a...

work page arXiv

[16] [16]

J. Li, F. Zhou, S. Sun, Y. Zhang, H. Zhao, and P. Liu. Dissecting human and llm preferences.arXiv preprint arXiv:2402.11296,

work page arXiv

[17] [17]

Min, H.-Y

13 Spectral Souping: A Unified Framework for Online Preference Alignment E. Min, H.-Y. Huang, X. Yang, M. Yang, X. Jia, Y. Wu, H. Cai, J. Wang, S. Wang, and D. Yin. From prompting to alignment: A generative framework for query recommendation.arXiv preprint arXiv:2504.10208,

work page arXiv

[18] [18]

Mudgal, J

S.Mudgal,J.Lee,H.Ganapathy,Y.Li,T.Wang,Y.Huang,Z.Chen,H.-T.Cheng,M.Collins,T.Strohman, et al. Controlled decoding from language models.arXiv preprint arXiv:2310.17022,

work page arXiv

[19] [19]

Nabati, G

O. Nabati, G. Tennenholtz, C. Hsu, M. Ryu, D. Ramachandran, Y. Chow, X. Li, and C. Boutilier. Preference adaptive and sequential text-to-image generation.arXiv preprint arXiv:2412.10419,

work page arXiv

[20] [20]

S. Park. Learning more generalized experts by merging experts in mixture-of-experts.arXiv preprint arXiv:2405.11530,

work page arXiv

[21] [21]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

From r to q∗: Your language model is secretly a q-function,

R. Rafailov, J. Hejna, R. Park, and C. Finn. From𝑟 to 𝑞∗: Your language model is secretly a𝑞-function. arXiv preprint arXiv:2404.12358,

work page arXiv

[23] [23]

arXiv preprint arXiv:2406.16768

A.Ramé, J.Ferret, N.Vieillard, R.Dadashi, L.Hussenot, P.-L.Cedoz, P.G.Sessa, S. Girgin, A.Douillard, and O. Bachem. Warp: On the benefits of weight averaged rewarded policies.arXiv preprint arXiv:2406.16768,

work page arXiv

[24] [24]

Ravichandran, D

B. Ravichandran, D. Joy, P. Elliott, B. Hu, J. Adams, C. Funk, E. Veenhuis, A. Hoogs, and A. Basharat. Align: Prompt-based attribute alignment for reliable, responsible, and personalized llm-based decision-making.arXiv preprint arXiv:2507.09037,

work page arXiv

[25] [25]

T. Ren, T. Zhang, L. Lee, J. E. Gonzalez, D. Schuurmans, and B. Dai. Spectral decomposition representation for reinforcement learning.arXiv preprint arXiv:2208.09515,

work page arXiv

[26] [26]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024a. G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, et...

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi. Knowledge fusion of large language models. arXiv preprint arXiv:2401.10491,

work page arXiv

[29] [29]

H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, and T. Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571,

work page arXiv

[30] [30]

X. Wang, L. Aitchison, and M. Rudolph. Lora ensembles for large language model fine-tuning.arXiv preprint arXiv:2310.00035,

work page arXiv

[31] [31]

E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666, 2024a. H. Yang, L. He, M. Hou, S. Shen, R. Li, J. Hou, J. Ma, and J. Zhao. Aligning llms through multi- perspective user preference ranking-based feedback for programm...

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

T. Yun, E. Yang, M. Safdari, J. H. Lee, V. V. Kumar, S. S. Mahdavi, J. Amar, D. Peyton, R. Aharony, A. M. PhD, L. D. Schneider, I. Galatzer-Levy, Y. Jia, J. Canny, A. Gretton, and M. Mataric. Sleepless nights, sugary days: Creating synthetic users with health conditions for realistic coaching agent interactions. In W. Che, J. Nabende, E. Shutova, and M. T...

work page 2025

[33] [33]

ISBN 979-8-89176-256-5

Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.729. URLhttps://aclanthology.org/2025.findings-acl.729/. D. Zeng, Y. Dai, P. Cheng, L. Wang, T. Hu, W. Chen, N. Du, and Z. Xu. On diversified preferences of large language model alignment.arXiv preprint arXiv:2312.07401,

work page doi:10.18653/v1/2025.findings-acl.729 2025

[34] [34]

Zhang, T

H. Zhang, T. Ren, C. Xiao, D. Schuurmans, and B. Dai. Provable representation with efficient planning for partial observable reinforcement learning.arXiv preprint arXiv:2311.12244,

work page arXiv

[35] [35]

Zhang, Z

L. Zhang, Z. Wang, X. Li, and Y.-H. Li. Revisiting bisimulation metric for robust representations in reinforcement learning.arXiv preprint arXiv:2507.18519,

work page arXiv

[36] [36]

Zhang, D

15 Spectral Souping: A Unified Framework for Online Preference Alignment S. Zhang, D. Yu, H. Sharma, H. Zhong, Z. Liu, Z. Yang, S. Wang, H. Hassan, and Z. Wang. Self- exploring language models: Active preference elicitation for online alignment.arXiv preprint arXiv:2405.19332,

work page arXiv

[37] [37]

𝑇−1∑︁ 𝑡=0 𝜓((𝑠 𝑡, 𝑎𝑡))|𝑠 0 =𝑠 # ⊤ 𝜈𝑟w −𝜈 𝑟𝑘 ≥ 1Í 𝑘 |𝜆 ∗ 𝑘| min 𝜋 𝔼𝜋

Consider the optimal Bellman equation for𝑄∗(𝑠𝑡, 𝑎𝑡) in Equation 3, which can be easily derived by unrolling the one-step Bellman equation forward in time for𝐿 steps and using the time consistency property of exponential risk measure Hau et al. (2023)log𝔼 𝜋ref exp(𝑄(𝑠, 𝑎)/𝛽) , and the fact that the transition dynamics of the language MDP is deterministic𝑠′...

work page 2023