From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

Chen Zhao; Qi Sun; Ru Peng; Siyue Zhang; Yulin Chen; Yuxiang Xue

arxiv: 2606.00357 · v2 · pith:BM4SWQDEnew · submitted 2026-05-29 · 💻 cs.AI

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

Qi Sun , Siyue Zhang , Yulin Chen , Yuxiang Xue , Ru Peng , Chen Zhao This is my paper

Pith reviewed 2026-06-28 22:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords preference delta aggregationLoRA mergingweak signalspreference optimizationgeometric alignmentLLM fine-tuningmodel compositionknowledge reasoning

0 comments

The pith

Aggregating multiple weak preference signals from model pairs strengthens large language models beyond any single signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that preference data from weak-weaker model pairs, even when individual responses are limited in quality, supplies usable relative quality signals for training stronger models. Each such signal is converted into its own LoRA adapter through preference optimization, after which the adapters are aligned geometrically and merged. This aggregation yields higher performance than any one signal alone, and the gains continue as more signals are included. The improvements appear on knowledge reasoning and agentic search tasks, where the merged result exceeds both single-delta and alternative multi-delta baselines.

Core claim

Preference deltas extracted from distinct weak-weaker model pairs are each turned into a LoRA adapter via preference optimization; these adapters are then aligned in their subspaces by Geometric Alignment Merging before being combined, producing a composite update that encodes complementary capabilities and improves the target strong model more than any individual delta.

What carries the argument

Preference Delta Aggregation (PDA) with Geometric Alignment Merging (GAM), which extracts deltas from weak pairs, represents each as a LoRA, aligns subspaces, and merges to compose signals.

If this is right

Performance on knowledge reasoning rises by 6.8 points on average for the strong model.
Performance on agentic search rises by 7.3 points on average for the strong model.
The merged result exceeds the best single-delta baseline by 2.1 points on reasoning and 4.3 points on search.
Performance continues to rise as more distinct weak signals are incorporated.
The gains are attributed to the composition of complementary capabilities across the separate deltas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce dependence on scarce high-quality preference data by scaling up the number of lower-quality pairs.
Geometric alignment before merging may prove useful in other adapter-combination settings where directional conflicts arise.
The same aggregation logic could be tested on tasks outside reasoning and search to check how broadly the complementarity holds.
If the alignment step proves robust, even weaker base models might supply usable signals when enough pairs are available.

Load-bearing premise

The preference deltas from different weak-weaker pairs contain complementary capabilities whose LoRA forms can be aligned and merged without substantial loss or conflict.

What would settle it

If the merged adapter shows no improvement over the single best delta on the knowledge reasoning or agentic search benchmarks, or if adding further signals produces no additional gain or causes decline.

Figures

Figures reproduced from arXiv: 2606.00357 by Chen Zhao, Qi Sun, Ru Peng, Siyue Zhang, Yulin Chen, Yuxiang Xue.

**Figure 1.** Figure 1: (a) Preference Delta Aggregation (PDA) independently preference-tunes a strong student model on preference datasets from different weak-weaker model pairs using LoRA fine-tuning, then merges the resulting adapters. (b) In parameter space, each adapter induces a distinct update direction, and PDA aggregates these deltas to compose complementary improvements, moving the student toward a better solution. (c) … view at source ↗

**Figure 2.** Figure 2: Head-level Directional Consistency prior to aggregation. Heatmaps display the subspace [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

read the original abstract

Training strong large language models (LLMs) requires high-quality supervision, which is often scarce. Recent work shows that paired preference data from weak-weaker model pairs (e.g., Qwen3 4B over 1.7B), despite the limited quality of individual responses, can provide an effective supervision signal through relative quality deltas, which we term a "weak" signal. This motivates a key research question: can multiple "weak" signals be constructively aggregated for improving strong models (e.g., Qwen3 8B)? To this end, we propose Preference Delta Aggregation (PDA), the first framework that derives a preference delta from each weak-weaker model pair, instantiates it as a LoRA adapter learned through preference optimization, and aggregates the resulting deltas via LoRA merging. To further mitigate directional interference during LoRA merging, we introduce Geometric Alignment Merging (GAM), a geometry-aware merging method that aligns adapter subspaces before aggregation, enabling more robust composition of diverse deltas. Evaluations on knowledge reasoning and agentic search benchmarks show that aggregating multiple "weak" signals pushes performance beyond any single signal, with further gains as additional signals are incorporated. Correspondingly, PDA with GAM improves the strong model by 6.8 and 7.3 points on average for knowledge reasoning and agentic search, respectively. It outperforms all single-delta and multi-delta baselines, exceeding the best single-delta baseline by 2.1 and 4.3 points. Further analysis attributes these gains to the effective composition of complementary capabilities encoded across distinct preference deltas.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PDA with GAM aggregates multiple weak preference deltas via LoRA and geometric alignment to beat single-delta baselines on reasoning and search tasks.

read the letter

The main takeaway is that this paper gives a concrete way to turn several weak-weaker model pairs into separate preference deltas, each turned into a LoRA, then merged with a new geometric alignment step so the combination improves a stronger model more than any one delta alone.

What stands out is the framing of PDA as a framework that derives, optimizes, and aggregates these deltas, plus the GAM method that aligns subspaces before merging to cut down on interference. The results show consistent gains as more signals are added, with average lifts of 6.8 points on knowledge reasoning and 7.3 on agentic search, and the merged version beating the best single-delta baseline by 2.1 and 4.3 points. That pattern matches the claim that the deltas carry complementary information.

The soft spot is the reliance on the deltas being sufficiently distinct and mergeable; GAM is introduced to handle alignment, but the abstract leaves open how sensitive the gains are to pair selection, exact merging weights, or whether the baselines were implemented with the same optimization details. Without seeing the full ablations or variance numbers, the size of the advantage is hard to judge precisely.

This is aimed at people working on data-efficient alignment and LoRA-based merging for LLMs. Anyone already using preference optimization on weaker models would find the aggregation angle useful to test.

I would send it for peer review. The construction is new enough and the empirical pattern clear enough that referees should check the experiments and code.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes Preference Delta Aggregation (PDA), a framework that extracts a preference delta from each weak-weaker model pair (e.g., Qwen3 4B over 1.7B), instantiates the delta as a LoRA adapter via preference optimization, and aggregates multiple such deltas through LoRA merging. To mitigate directional interference, it introduces Geometric Alignment Merging (GAM), a geometry-aware method that aligns adapter subspaces prior to aggregation. On knowledge reasoning and agentic search benchmarks, PDA with GAM improves the strong model (e.g., Qwen3 8B) by 6.8 and 7.3 points on average, respectively, outperforming all single-delta and multi-delta baselines (exceeding the best single-delta baseline by 2.1 and 4.3 points) with further gains observed as more signals are added; the gains are attributed to complementary capabilities encoded in the distinct deltas.

Significance. If the reported empirical results hold under full experimental scrutiny, the work is significant because it demonstrates a practical route to improving strong models by constructively aggregating abundant but individually weak preference signals, reducing reliance on scarce high-quality supervision. The introduction of GAM for subspace alignment during merging is a concrete technical contribution to the adapter-merging literature. Credit is due for the paper's focus on empirical aggregation of non-redundant deltas and the explicit outperformance claims over baselines.

minor comments (2)

[Abstract] Abstract: the statement that 'further gains as additional signals are incorporated' would benefit from a brief quantitative indication of the scaling (e.g., number of deltas tested and marginal improvement per added signal) to strengthen the central aggregation claim.
The manuscript would be clearer if the precise definition of 'preference delta' (relative quality difference between weak and weaker responses) and the exact preference-optimization objective used to train each LoRA were stated in the first section that introduces PDA.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The description accurately captures the PDA framework, GAM method, and empirical gains on reasoning and search benchmarks.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on empirical results from aggregating preference deltas via LoRA adapters and the introduced GAM merging method, with reported gains over single-delta baselines on knowledge reasoning and agentic search benchmarks. No equations, derivations, or self-citations are present in the provided text that reduce performance improvements to quantities defined by fitted parameters from the same data or to self-referential constructions. The framework is presented as a new empirical aggregation approach without load-bearing uniqueness theorems or ansatzes imported from prior author work. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, domain axioms, or invented physical entities are stated; the work introduces two new named procedures (PDA and GAM) whose internal hyperparameters and assumptions are not detailed.

pith-pipeline@v0.9.1-grok · 5832 in / 1258 out tokens · 32780 ms · 2026-06-28T22:02:01.294285+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 3 canonical work pages

[1]

2025 , eprint=

The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains , author=. 2025 , eprint=

2025
[2]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[3]

International Conference on Learning Representations (ICLR) , url=

Discovering Latent Knowledge in Language Models Without Supervision , author=. International Conference on Learning Representations (ICLR) , url=
[4]

and Finn, Chelsea , title =

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D. and Finn, Chelsea , title =. 2023 , booktitle =

2023
[5]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[6]

2025 , eprint=

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play , author=. 2025 , eprint=

2025
[7]

2025 , eprint=

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning , author=. 2025 , eprint=

2025
[8]

2022 , eprint=

MuSiQue: Multihop Questions via Single-hop Question Composition , author=. 2022 , eprint=

2022
[9]

2024 , eprint=

ORPO: Monolithic Preference Optimization without Reference Model , author=. 2024 , eprint=

2024
[10]

Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton ...

2024
[11]

Smith and Hannaneh Hajishirzi , booktitle=

Evan Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Allyson Ettinger and Michal Guerqu...

2025
[12]

The Thirteenth International Conference on Learning Representations , year=

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model , author=. The Thirteenth International Conference on Learning Representations , year=
[13]

Varying Shades of Wrong: Aligning

Jihan Yao and Wenxuan Ding and Shangbin Feng and Lucy Lu Wang and Yulia Tsvetkov , booktitle=. Varying Shades of Wrong: Aligning. 2025 , url=

2025
[14]

2018 , eprint=

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. 2018 , eprint=

2018
[15]

2020 , eprint=

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. 2020 , eprint=

2020
[16]

2023 , eprint=

Measuring and Narrowing the Compositionality Gap in Language Models , author=. 2023 , eprint=

2023
[17]

2023 , eprint=

TIES-Merging: Resolving Interference When Merging Models , author=. 2023 , eprint=

2023
[18]

2025 , eprint=

Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. 2025 , eprint=

2025
[19]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

work page doi:10.1162/tacl_a_00276 2019
[20]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[21]

2023 , eprint=

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author=. 2023 , eprint=

2023
[22]

2026 , eprint=

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them , author=. 2026 , eprint=

2026
[23]

2025 , eprint=

ZeroSearch: Incentivize the Search Capability of LLMs without Searching , author=. 2025 , eprint=

2025
[24]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022
[25]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

2022
[26]

Nature , author=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , url=. Nature , author=
[27]

2023 , eprint=

Editing Models with Task Arithmetic , author=. 2023 , eprint=

2023
[28]

2022 , eprint=

WebGPT: Browser-assisted question-answering with human feedback , author=. 2022 , eprint=

2022
[29]

2023 , eprint=

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision , author=. 2023 , eprint=

2023
[30]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

2024
[31]

2025 , eprint=

Search-o1: Agentic Search-Enhanced Large Reasoning Models , author=. 2025 , eprint=

2025
[32]

2024 , eprint=

Model merging with SVD to tie the Knots , author=. 2024 , eprint=

2024
[33]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021
[34]

2017 , eprint=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. 2017 , eprint=

2017
[35]

2023 , eprint=

Let's Verify Step by Step , author=. 2023 , eprint=

2023
[36]

2023 , eprint=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

2023
[37]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

2021
[38]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021
[39]

LoraHub: Efficient Cross-Task Generalization via Dynamic Lo

Chengsong Huang and Qian Liu and Bill Yuchen Lin and Tianyu Pang and Chao Du and Min Lin , booktitle=. LoraHub: Efficient Cross-Task Generalization via Dynamic Lo. 2024 , url=

2024
[40]

The Fourteenth International Conference on Learning Representations , year=

Weak-to-Strong Generalization with Failure Trajectories , author=. The Fourteenth International Conference on Learning Representations , year=
[41]

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Hase, Peter and Bansal, Mohit and Clark, Peter and Wiegreffe, Sarah. The Unreasonable Effectiveness of Easy Training Data for Hard Tasks. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

2024
[42]

Meta-Rewarding Language Models: Self-Improving Alignment with LLM -as-a-Meta-Judge

Wu, Tianhao and Yuan, Weizhe and Golovneva, Olga and Xu, Jing and Tian, Yuandong and Jiao, Jiantao and Weston, Jason E and Sukhbaatar, Sainbayar. Meta-Rewarding Language Models: Self-Improving Alignment with LLM -as-a-Meta-Judge. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

2025
[43]

2020 , booktitle =

Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea , title =. 2020 , booktitle =

2020
[44]

Yuchen Fan and Kaiyan Zhang and Heng Zhou and Yuxin Zuo and Yu Fu and Yanxu Chen and Xinwei Long and Xuekai Zhu and Che Jiang and Yuchen Zhang and Li Kang and Cheng Huang and Gang Chen and Zhizhou He and Bingning Wang and LEI BAI and Ning Ding and Bowen Zhou , year=
[45]

2024 , url=

Nemotron-4 340B Technical Report , author=. 2024 , url=

2024
[46]

2026 , url=

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning , author=. 2026 , url=

2026
[47]

2025 , url=

ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset , author=. 2025 , url=

2025
[48]

The FineWeb datasets: decanting the web for the finest text data at scale , year =

Penedo, Guilherme and Kydl\'. The FineWeb datasets: decanting the web for the finest text data at scale , year =. Proceedings of the 38th International Conference on Neural Information Processing Systems , url=
[49]

The Fourteenth International Conference on Learning Representations , year=

Mapping Post-Training Forgetting in Language Models at Scale , author=. The Fourteenth International Conference on Learning Representations , year=
[50]

Diffusion vs

Zhang, Siyue and Zhao, Yilun and Geng, Liyuan and Cohan, Arman and Luu, Anh Tuan and Zhao, Chen. Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

2025
[51]

2026 , url=

Siyue Zhang and Yuan Gao and Xiao Zhou and Yilun Zhao and Tingyu Song and Arman Cohan and Anh Tuan Luu and Chen Zhao , booktitle=. 2026 , url=

2026
[52]

2024 , eprint=

HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback , author=. 2024 , eprint=

2024
[53]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023
[54]

2022 , eprint=

A Contrastive Framework for Neural Text Generation , author=. 2022 , eprint=

2022
[55]

2022 , eprint=

Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=

2022
[56]

2025 , eprint=

Self-Rewarding Language Models , author=. 2025 , eprint=

2025
[57]

2022 , eprint=

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. 2022 , eprint=

2022
[58]

2022 , eprint=

Merging Models with Fisher-Weighted Averaging , author=. 2022 , eprint=

2022
[59]

2025 , eprint=

Dataless Knowledge Fusion by Merging Weights of Language Models , author=. 2025 , eprint=

2025
[60]

2024 , eprint=

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch , author=. 2024 , eprint=

2024
[61]

Continual Learning and Catastrophic Forgetting

Chen, Zhiyuan and Liu, Bing. Continual Learning and Catastrophic Forgetting. Lifelong Machine Learning. 2018

2018
[62]

2025 , eprint=

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework , author=. 2025 , eprint=

2025
[63]

2024 , eprint=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. 2024 , eprint=

2024
[64]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

2023
[65]

1998 , eprint=

The Geometry of Algorithms with Orthogonality Constraints , author=. 1998 , eprint=

1998
[66]

and Cattell, Raymond B

Hurley, John R. and Cattell, Raymond B. , title =. Behavioral Science , volume =. doi:https://doi.org/10.1002/bs.3830070216 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1002/bs.3830070216 , year =

work page doi:10.1002/bs.3830070216

[1] [1]

2025 , eprint=

The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains , author=. 2025 , eprint=

2025

[2] [2]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[3] [3]

International Conference on Learning Representations (ICLR) , url=

Discovering Latent Knowledge in Language Models Without Supervision , author=. International Conference on Learning Representations (ICLR) , url=

[4] [4]

and Finn, Chelsea , title =

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D. and Finn, Chelsea , title =. 2023 , booktitle =

2023

[5] [5]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[6] [6]

2025 , eprint=

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play , author=. 2025 , eprint=

2025

[7] [7]

2025 , eprint=

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning , author=. 2025 , eprint=

2025

[8] [8]

2022 , eprint=

MuSiQue: Multihop Questions via Single-hop Question Composition , author=. 2022 , eprint=

2022

[9] [9]

2024 , eprint=

ORPO: Monolithic Preference Optimization without Reference Model , author=. 2024 , eprint=

2024

[10] [10]

Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton ...

2024

[11] [11]

Smith and Hannaneh Hajishirzi , booktitle=

Evan Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Allyson Ettinger and Michal Guerqu...

2025

[12] [12]

The Thirteenth International Conference on Learning Representations , year=

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model , author=. The Thirteenth International Conference on Learning Representations , year=

[13] [13]

Varying Shades of Wrong: Aligning

Jihan Yao and Wenxuan Ding and Shangbin Feng and Lucy Lu Wang and Yulia Tsvetkov , booktitle=. Varying Shades of Wrong: Aligning. 2025 , url=

2025

[14] [14]

2018 , eprint=

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. 2018 , eprint=

2018

[15] [15]

2020 , eprint=

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. 2020 , eprint=

2020

[16] [16]

2023 , eprint=

Measuring and Narrowing the Compositionality Gap in Language Models , author=. 2023 , eprint=

2023

[17] [17]

2023 , eprint=

TIES-Merging: Resolving Interference When Merging Models , author=. 2023 , eprint=

2023

[18] [18]

2025 , eprint=

Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. 2025 , eprint=

2025

[19] [19]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

work page doi:10.1162/tacl_a_00276 2019

[20] [20]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017

[21] [21]

2023 , eprint=

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author=. 2023 , eprint=

2023

[22] [22]

2026 , eprint=

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them , author=. 2026 , eprint=

2026

[23] [23]

2025 , eprint=

ZeroSearch: Incentivize the Search Capability of LLMs without Searching , author=. 2025 , eprint=

2025

[24] [24]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022

[25] [25]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

2022

[26] [26]

Nature , author=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , url=. Nature , author=

[27] [27]

2023 , eprint=

Editing Models with Task Arithmetic , author=. 2023 , eprint=

2023

[28] [28]

2022 , eprint=

WebGPT: Browser-assisted question-answering with human feedback , author=. 2022 , eprint=

2022

[29] [29]

2023 , eprint=

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision , author=. 2023 , eprint=

2023

[30] [30]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

2024

[31] [31]

2025 , eprint=

Search-o1: Agentic Search-Enhanced Large Reasoning Models , author=. 2025 , eprint=

2025

[32] [32]

2024 , eprint=

Model merging with SVD to tie the Knots , author=. 2024 , eprint=

2024

[33] [33]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021

[34] [34]

2017 , eprint=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. 2017 , eprint=

2017

[35] [35]

2023 , eprint=

Let's Verify Step by Step , author=. 2023 , eprint=

2023

[36] [36]

2023 , eprint=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

2023

[37] [37]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

2021

[38] [38]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021

[39] [39]

LoraHub: Efficient Cross-Task Generalization via Dynamic Lo

Chengsong Huang and Qian Liu and Bill Yuchen Lin and Tianyu Pang and Chao Du and Min Lin , booktitle=. LoraHub: Efficient Cross-Task Generalization via Dynamic Lo. 2024 , url=

2024

[40] [40]

The Fourteenth International Conference on Learning Representations , year=

Weak-to-Strong Generalization with Failure Trajectories , author=. The Fourteenth International Conference on Learning Representations , year=

[41] [41]

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Hase, Peter and Bansal, Mohit and Clark, Peter and Wiegreffe, Sarah. The Unreasonable Effectiveness of Easy Training Data for Hard Tasks. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

2024

[42] [42]

Meta-Rewarding Language Models: Self-Improving Alignment with LLM -as-a-Meta-Judge

Wu, Tianhao and Yuan, Weizhe and Golovneva, Olga and Xu, Jing and Tian, Yuandong and Jiao, Jiantao and Weston, Jason E and Sukhbaatar, Sainbayar. Meta-Rewarding Language Models: Self-Improving Alignment with LLM -as-a-Meta-Judge. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

2025

[43] [43]

2020 , booktitle =

Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea , title =. 2020 , booktitle =

2020

[44] [44]

Yuchen Fan and Kaiyan Zhang and Heng Zhou and Yuxin Zuo and Yu Fu and Yanxu Chen and Xinwei Long and Xuekai Zhu and Che Jiang and Yuchen Zhang and Li Kang and Cheng Huang and Gang Chen and Zhizhou He and Bingning Wang and LEI BAI and Ning Ding and Bowen Zhou , year=

[45] [45]

2024 , url=

Nemotron-4 340B Technical Report , author=. 2024 , url=

2024

[46] [46]

2026 , url=

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning , author=. 2026 , url=

2026

[47] [47]

2025 , url=

ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset , author=. 2025 , url=

2025

[48] [48]

The FineWeb datasets: decanting the web for the finest text data at scale , year =

Penedo, Guilherme and Kydl\'. The FineWeb datasets: decanting the web for the finest text data at scale , year =. Proceedings of the 38th International Conference on Neural Information Processing Systems , url=

[49] [49]

The Fourteenth International Conference on Learning Representations , year=

Mapping Post-Training Forgetting in Language Models at Scale , author=. The Fourteenth International Conference on Learning Representations , year=

[50] [50]

Diffusion vs

Zhang, Siyue and Zhao, Yilun and Geng, Liyuan and Cohan, Arman and Luu, Anh Tuan and Zhao, Chen. Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

2025

[51] [51]

2026 , url=

Siyue Zhang and Yuan Gao and Xiao Zhou and Yilun Zhao and Tingyu Song and Arman Cohan and Anh Tuan Luu and Chen Zhao , booktitle=. 2026 , url=

2026

[52] [52]

2024 , eprint=

HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback , author=. 2024 , eprint=

2024

[53] [53]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023

[54] [54]

2022 , eprint=

A Contrastive Framework for Neural Text Generation , author=. 2022 , eprint=

2022

[55] [55]

2022 , eprint=

Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=

2022

[56] [56]

2025 , eprint=

Self-Rewarding Language Models , author=. 2025 , eprint=

2025

[57] [57]

2022 , eprint=

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. 2022 , eprint=

2022

[58] [58]

2022 , eprint=

Merging Models with Fisher-Weighted Averaging , author=. 2022 , eprint=

2022

[59] [59]

2025 , eprint=

Dataless Knowledge Fusion by Merging Weights of Language Models , author=. 2025 , eprint=

2025

[60] [60]

2024 , eprint=

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch , author=. 2024 , eprint=

2024

[61] [61]

Continual Learning and Catastrophic Forgetting

Chen, Zhiyuan and Liu, Bing. Continual Learning and Catastrophic Forgetting. Lifelong Machine Learning. 2018

2018

[62] [62]

2025 , eprint=

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework , author=. 2025 , eprint=

2025

[63] [63]

2024 , eprint=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. 2024 , eprint=

2024

[64] [64]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

2023

[65] [65]

1998 , eprint=

The Geometry of Algorithms with Orthogonality Constraints , author=. 1998 , eprint=

1998

[66] [66]

and Cattell, Raymond B

Hurley, John R. and Cattell, Raymond B. , title =. Behavioral Science , volume =. doi:https://doi.org/10.1002/bs.3830070216 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1002/bs.3830070216 , year =

work page doi:10.1002/bs.3830070216