pith. machine review for the scientific record. sign in

arxiv: 2605.10784 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:11 UTC · model grok-4.3

classification 💻 cs.LG
keywords preference optimizationdirect preference optimizationactive sample selectionmulti-negative DPOPlackett-Luce modelFisher informationnegative samplingpolicy alignment
0
0 comments X

The pith

MASS-DPO selects compact negative subsets for multi-negative DPO using a Plackett-Luce Fisher-information log-determinant objective that preserves full-pool information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MASS-DPO to address the high cost of optimizing over large pools of negative responses in multi-negative preference optimization. It derives a specific objective from the Fisher information matrix under the Plackett-Luce model to choose small subsets of negatives whose gradients provide complementary directions for policy updates. This selection reduces redundancy while keeping the training signal of the entire pool intact. Experiments across recommendation and QA benchmarks with multiple model families show the approach matches or exceeds full-pool performance in accuracy and alignment metrics but with far fewer negatives.

Core claim

MASS-DPO derives a log-determinant objective from the Plackett-Luce Fisher information to select negatives that contribute complementary information for policy updates, yielding compact subsets that retain the full pool's information while reducing redundancy from similar candidates.

What carries the argument

The PL-specific Fisher-information log-determinant objective for active negative selection, which identifies subsets whose gradients span diverse update directions.

If this is right

  • Training uses substantially fewer negative samples per prompt while matching or exceeding baseline accuracy.
  • Recall and NDCG metrics improve along with margin-based optimization dynamics.
  • Stronger alignment results hold across recommendation and multiple-choice QA tasks.
  • The gains appear consistently across three model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selection criterion may generalize to other preference models if a similar Fisher matrix can be derived.
  • Large-scale alignment pipelines could adopt this to lower memory and compute demands during preference tuning.
  • Similar active selection might reduce redundancy in positive sample pools or in other stages of RLHF.
  • Typical negative pools likely contain systematic gradient overlaps that this method systematically removes.

Load-bearing premise

The Fisher-information objective selects negatives whose gradients cover complementary update directions while retaining the full pool's information for policy updates.

What would settle it

Training identical models on the same benchmarks using the full negative pool versus the MASS-DPO selected subsets and checking whether the subset version shows lower Recall, NDCG, or alignment scores.

Figures

Figures reproduced from arXiv: 2605.10784 by Chuhan Wang, Jingbo Shang, Julian McAuley, Junda Wu, Prithviraj Ammanabrolu, Rohan Surana, Sheldon Yu, Tong Yu, Xintong Li, Yiran Jenny Shen.

Figure 1
Figure 1. Figure 1: Overview of MASS-DPO’s D-optimal selection. Each candidate is scored using the feature diffϕϕ() ϕ( ∗ ) d liffbl( ∗ | ) l(| ) Figure 1: Overview of MASS-DPO’s D-optimal selection. Each candidate is scored using the feature [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics on LastFM and MedMCQA. MASS-DPO (solid) achieves larger margins, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Margin, accuracy, and chosen reward comparisons on MovieLens and QASC datasets. [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
read the original abstract

Multi-negative preference optimization under the Plackett--Luce (PL) model extends Direct Preference Optimization (DPO) by leveraging comparative signals across one preferred and multiple rejected responses. However, optimizing over large negative pools is costly, and many candidates contribute redundant gradients due to their similar effects on policy updates. We introduce MASS-DPO, a multi-negative active sample selection method that derives a PL-specific Fisher-information objective for selecting compact, informative negative subsets within each prompt. The resulting log-determinant objective selects negatives that contribute complementary information for policy updates, yielding compact subsets that retain the full pool's information while reducing redundancy. In practice, this favors negatives whose gradients cover different update directions, reducing redundant signal from near-duplicate candidates while preserving the most useful training information. Across four benchmarks spanning recommendation and multiple-choice QA and three model families, MASS-DPO consistently exceeds or matches existing methods in accuracy, improves Recall/NDCG and margin-based optimization dynamics, and delivers stronger alignment with substantially fewer negatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces MASS-DPO, a multi-negative active sample selection method for Direct Preference Optimization under the Plackett-Luce (PL) model. It derives a PL-specific Fisher-information log-determinant objective to select compact subsets of negative responses that contribute complementary information for policy updates, reducing redundancy while aiming to retain the information content of the full negative pool. The approach favors negatives whose gradients cover different update directions. Experiments across four benchmarks (recommendation and multiple-choice QA) and three model families show that MASS-DPO consistently matches or exceeds existing methods in accuracy, improves Recall/NDCG and margin-based dynamics, and achieves stronger alignment using substantially fewer negatives.

Significance. If the derivation of the log-determinant objective is sound and the selection mechanism indeed preserves full-pool information while eliminating redundant gradients, the method could meaningfully improve the scalability of multi-negative DPO by lowering computational cost without sacrificing alignment quality. The reported consistent empirical gains across tasks and model families provide supporting evidence for practical utility in LLM preference optimization.

minor comments (3)
  1. The abstract states that the log-determinant objective 'selects negatives that contribute complementary information' and 'favors negatives whose gradients cover different update directions,' but the precise connection between the Fisher matrix and gradient complementarity should be stated more explicitly with a short derivation sketch or reference to the relevant equation in the main text.
  2. The manuscript should clarify the exact criteria used for data exclusion or negative pool construction in the experiments, as these details affect reproducibility of the reported Recall/NDCG improvements.
  3. Figure captions and axis labels for the optimization dynamics plots could be expanded to indicate the exact margin metric and number of negatives used in each curve for easier comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of MASS-DPO, including the recognition of the Plackett-Luce-specific log-determinant Fisher information objective and the consistent empirical results across benchmarks and model families. The recommendation for minor revision is noted, and we are encouraged by the assessment of potential practical utility.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper frames MASS-DPO as deriving a PL-specific Fisher-information objective (log-determinant form) directly from the Plackett-Luce model to select informative negatives. This is a standard information-theoretic construction for active selection and D-optimal design, not a self-referential definition or a fitted parameter relabeled as a prediction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatz smuggling are indicated in the provided text. The central claims rest on the derivation plus empirical validation across benchmarks, keeping the chain self-contained and independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the Plackett-Luce preference model and the assumption that Fisher information captures complementary gradient directions for policy updates.

axioms (1)
  • domain assumption Plackett-Luce model governs comparative preference signals across one preferred and multiple rejected responses
    Invoked to extend standard DPO to multi-negative setting and derive the selection objective.

pith-pipeline@v0.9.0 · 5508 in / 1109 out tokens · 58826 ms · 2026-05-12T04:11:09.631097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 2 internal anchors

  1. [1]

    Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

    Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

  2. [2]

    Near-optimal discrete optimiza- tion for experimental design: A regret minimization approach.Mathematical Programming, 186:439–478, 2021

    Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal discrete optimiza- tion for experimental design: A regret minimization approach.Mathematical Programming, 186:439–478, 2021

  3. [3]

    arXiv preprint arXiv:2402.10571 , year=

    Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024

  4. [4]

    Gone fishing: Neural active learning with fisher embeddings.Advances in Neural Information Processing Systems, 34:8927–8939, 2021

    Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, and Sham Kakade. Gone fishing: Neural active learning with fisher embeddings.Advances in Neural Information Processing Systems, 34:8927–8939, 2021

  5. [5]

    arXiv preprint arXiv:1906.03671 , year=

    Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agar- wal. Deep batch active learning by diverse, uncertain gradient lower bounds.arXiv preprint arXiv:1906.03671, 2019

  6. [6]

    Finetuning large language model for personalized ranking.arXiv preprint arXiv:2405.16127, 2024

    Zhuoxi Bai, Ning Wu, Fengyu Cai, Xinyi Zhu, and Yun Xiong. Finetuning large language model for personalized ranking.arXiv preprint arXiv:2405.16127, 2024

  7. [7]

    SmolLM3: smol, multilingual, long-context reasoner.https://huggingface.co/blog/smollm3, 2025

    Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Car- los Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gal- louédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raffel, ...

  8. [8]

    Ellis, Brian Whitman, and Paul Lamere

    Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. InProceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011

  9. [9]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952. ISSN 00063444, 14643510. URLhttp://www.jstor.org/stable/2334029

  10. [10]

    Bayesian experimental design: A review.Statistical science, pages 273–304, 1995

    Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.Statistical science, pages 273–304, 1995

  11. [11]

    On softmax direct preference optimization for recommendation.arXiv preprint arXiv:2406.09215, 2024

    Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, and Tat-Seng Chua. On softmax direct preference optimization for recommendation.arXiv preprint arXiv:2406.09215, 2024

  12. [12]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4302–4310, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

  13. [13]

    Neural network exploration using optimal experiment design.Advances in neural information processing systems, 6, 1993

    David Cohn. Neural network exploration using optimal experiment design.Advances in neural information processing systems, 6, 1993

  14. [14]

    Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500, 2024

    Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500, 2024

  15. [15]

    Neighborhood-based hard negative mining for sequential recommendation

    Lu Fan, Jiashu Pu, Rongsheng Zhang, and Xiao-Ming Wu. Neighborhood-based hard negative mining for sequential recommendation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2042–2046, 2023

  16. [16]

    R. A. Fisher and Edward John Russell. On the mathematical foundations of theoreti- cal statistics.Philosophical Transactions of the Royal Society of London. Series A, Con- taining Papers of a Mathematical or Physical Character, 222(594-604):309–368, 1922. doi: 10.1098/rsta.1922.0009. URL https://royalsocietypublishing.org/doi/abs/ 10.1098/rsta.1922.0009

  17. [17]

    Robust design of biological experiments

    Patrick Flaherty, Adam Arkin, and Michael Jordan. Robust design of biological experiments. Advances in neural information processing systems, 18, 2005. 10

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    Efficient dynamic hard negative sampling for dialogue selection

    Janghoon Han, Dongkyu Lee, Joongbo Shin, Hyunkyung Bae, Jeesoo Bang, Seonghwan Kim, Stanley Jungkyu Choi, and Honglak Lee. Efficient dynamic hard negative sampling for dialogue selection. In Elnaz Nouri, Abhinav Rastogi, Georgios Spithourakis, Bing Liu, Yun-Nung Chen, Yu Li, Alon Albalak, Hiromi Wakaki, and Alexandros Papangelis, editors, Proceedings of t...

  20. [20]

    The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

    F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

  21. [21]

    Reclaif: Reinforcement learning from ai feedback for recommendation systems

    Xiaoxin He, Nurendra Choudhary, Jieyi Jiang, Edward W Huang, Bryan Hooi, Xavier Bresson, and Karthik Subbian. Reclaif: Reinforcement learning from ai feedback for recommendation systems. 2025

  22. [22]

    Pluralistic off-policy evaluation and alignment

    Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, and Lina Yao. Pluralistic off-policy evaluation and alignment.arXiv preprint arXiv:2509.19333, 2025

  23. [23]

    Listwise preference diffusion optimization for user behavior trajectories prediction.Advances in Neural Information Processing Systems, 38:159383–159408, 2026

    Hongtao Huang, Chengkai Huang, Junda Wu, Tong Yu, Julian McAuley, and Lina Yao. Listwise preference diffusion optimization for user behavior trajectories prediction.Advances in Neural Information Processing Systems, 38:159383–159408, 2026

  24. [24]

    Traceable and explainable multimodal large language models: An information-theoretic view

    Zihan Huang, Junda Wu, Rohan Surana, Raghav Jain, Tong Yu, Raghavendra Addanki, David Arbour, Sungchul Kim, and Julian McAuley. Traceable and explainable multimodal large language models: An information-theoretic view. InSecond Conference on Language Modeling, 2025

  25. [25]

    Image difference captioning via adversarial preference optimization

    Zihan Huang, Junda Wu, Rohan Surana, Tong Yu, David Arbour, Ritwik Sinha, and Julian McAuley. Image difference captioning via adversarial preference optimization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33746– 33758, 2025

  26. [26]

    arXiv preprint arXiv:2602.12533 , year=

    Zihan Huang, Xintong Li, Rohan Surana, Tong Yu, Rui Wang, Julian McAuley, Jingbo Shang, and Junda Wu. Amps: Adaptive modality preference steering via functional entropy.arXiv preprint arXiv:2602.12533, 2026

  27. [27]

    Optimal design of experiments for optimization-based model calibration using fisher information matrix.Reliability Engineering & System Safety, 216: 107968, 2021

    Yongsu Jung and Ikjin Lee. Optimal design of experiments for optimization-based model calibration using fisher information matrix.Reliability Engineering & System Safety, 216: 107968, 2021. ISSN 0951-8320. doi: https://doi.org/10.1016/j.ress.2021.107968. URL https: //www.sciencedirect.com/science/article/pii/S0951832021004798

  28. [28]

    Hard negative mixing for contrastive learning.Advances in neural information processing systems, 33:21798–21809, 2020

    Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning.Advances in neural information processing systems, 33:21798–21809, 2020

  29. [29]

    Qasc: A dataset for question answering via sentence composition.Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8082–8090, Apr

    Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. Qasc: A dataset for question answering via sentence composition.Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8082–8090, Apr. 2020. doi: 10.1609/aaai.v34i05.6319. URL https://ojs.aaai.org/index.php/AAAI/article/view/6319

  30. [30]

    Optimum experimental designs.Journal of the Royal Statistical Society: Series B (Methodological), 21(2):272–304, 1959

    Jack Kiefer. Optimum experimental designs.Journal of the Royal Statistical Society: Series B (Methodological), 21(2):272–304, 1959

  31. [31]

    Unifying approaches in active learning and active sampling via fisher information and information-theoretic quantities.arXiv preprint arXiv:2208.00549, 2022

    Andreas Kirsch and Yarin Gal. Unifying approaches in active learning and active sampling via fisher information and information-theoretic quantities.arXiv preprint arXiv:2208.00549, 2022

  32. [32]

    Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 32, 2019

    Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 32, 2019

  33. [33]

    Near-optimal nonmyopic value of information in graphical models.arXiv preprint arXiv:1207.1394, 2012

    Andreas Krause and Carlos E Guestrin. Near-optimal nonmyopic value of information in graphical models.arXiv preprint arXiv:1207.1394, 2012. 11

  34. [34]

    Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies.Journal of Machine Learning Research, 9(2), 2008

    Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies.Journal of Machine Learning Research, 9(2), 2008

  35. [35]

    Active learning for direct preference optimization

    Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, and Tong Yu. Active learning for direct preference optimization.arXiv preprint arXiv:2503.01076, 2025

  36. [36]

    Importance sampling for multi-negative multimodal direct preference optimization

    Xintong Li, Chuhan Wang, Junda Wu, Rohan Surana, Tong Yu, Julian McAuley, and Jingbo Shang. Importance sampling for multi-negative multimodal direct preference optimization. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=HEFPwoGtTj

  37. [37]

    arXiv preprint arXiv:2410.12854 , year=

    Weibin Liao, Xu Chu, and Yasha Wang. Tpo: Aligning large language models with multi-branch & multi-step preference trees.arXiv preprint arXiv:2410.12854, 2024

  38. [38]

    arXiv preprint arXiv:2410.02504 , year=

    Pangpang Liu, Chengchun Shi, and Will Wei Sun. Dual active learning for reinforcement learning from human feedback.arXiv preprint arXiv:2410.02504, 2024

  39. [39]

    Wiley New York, 1959

    R Duncan Luce et al.Individual choice behavior, volume 4. Wiley New York, 1959

  40. [40]

    Negative sampling in recommendation: A survey and future directions.arXiv preprint arXiv:2409.07237, 2024

    Haokai Ma, Ruobing Xie, Lei Meng, Fuli Feng, Xiaoyu Du, Xingwu Sun, Zhanhui Kang, and Xiangxu Meng. Negative sampling in recommendation: A survey and future directions.arXiv preprint arXiv:2409.07237, 2024

  41. [41]

    Optimal design for human preference elicitation.Advances in Neural Information Processing Systems, 37:90132–90159, 2024

    Subhojyoti Mukherjee, Anusha Lalitha, Kousha Kalantari, Aniket Anand Deshmukh, Ge Liu, Yifei Ma, and Branislav Kveton. Optimal design for human preference elicitation.Advances in Neural Information Processing Systems, 37:90132–90159, 2024

  42. [42]

    Ws-grpo: Weakly-supervised group-relative policy optimization for rollout-efficient reasoning

    Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, et al. Ws-grpo: Weakly-supervised group- relative policy optimization for rollout-efficient reasoning.arXiv preprint arXiv:2602.17025, 2026

  43. [43]

    Neilsen, David F

    Tracianne B. Neilsen, David F. Van Komen, Mark K. Transtrum, Makenzie B. Allen, and David P. Knobles. Optimal experimental design for machine learning using the fisher information. Proceedings of Meetings on Acoustics, 35(1):055004, 01 2019. ISSN 1939-800X. doi: 10.1121/ 2.0000953. URLhttps://doi.org/10.1121/2.0000953

  44. [44]

    G. L. Nemhauser and L. A. Wolsey. Best algorithms for approximating the maximum of a submodular set function.Mathematics of Operations Research, 3(3):177–188, 1978. ISSN 0364765X, 15265471. URLhttp://www.jstor.org/stable/3689488

  45. [45]

    A survey on llm-based conversational user simulation

    Bo Ni, Yu Wang, Leyao Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Luera, Samyadeep Basu, Subhojyoti Mukherjee, et al. A survey on llm-based conversational user simulation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4266–4301, 2026

  46. [46]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  47. [47]

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann, editors,Proceedings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learni...

  48. [48]

    The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

    Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

  49. [49]

    SIAM, 2006

    Friedrich Pukelsheim.Optimal design of experiments. SIAM, 2006

  50. [50]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023. 12

  51. [51]

    Riquelme, G

    Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling.arXiv preprint arXiv:1802.09127, 2018

  52. [52]

    Contrastive learning with hard negative samples,

    Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

  53. [53]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017

  54. [54]

    Asymp- totic analysis of objectives based on fisher information in active learning.Journal of Machine Learning Research, 18(34):1–41, 2017

    Jamshid Sourati, Murat Akcakaya, Todd K Leen, Deniz Erdogmus, and Jennifer G Dy. Asymp- totic analysis of objectives based on fisher information in active learning.Journal of Machine Learning Research, 18(34):1–41, 2017

  55. [55]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA, 2020. Curran Associates Inc. ISBN 9781713829546

  56. [56]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

  57. [57]

    Di- rect preference optimization for llm-enhanced recommendation systems.arXiv preprint arXiv:2410.05939, 2024

    Chao Sun, Yaobo Liang, Yaming Yang, Shilin Xu, Tianmeng Yang, and Yunhai Tong. Di- rect preference optimization for llm-enhanced recommendation systems.arXiv preprint arXiv:2410.05939, 2024

  58. [58]

    Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, et al. Generate, filter, control, replay: A comprehensive survey of rollout strategies for llm reinforcement learning.arXiv preprint arXiv:2605.02913, 2026

  59. [59]

    Qwen3, April 2025

    Qwen Team. Qwen3, April 2025. URLhttps://qwenlm.github.io/blog/qwen3/

  60. [60]

    Comparing few to rank many: Active human preference learning using randomized frank-wolfe.arXiv preprint arXiv:2412.19396, 2024

    Kiran Koshy Thekumparampil, Gaurush Hiranandani, Kousha Kalantari, Shoham Sabach, and Branislav Kveton. Comparing few to rank many: Active human preference learning using randomized frank-wolfe.arXiv preprint arXiv:2412.19396, 2024

  61. [61]

    Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes

    Chuhan Wang, Xintong Li, Jennifer Yuntong Zhang, Junda Wu, Chengkai Huang, Lina Yao, Julian McAuley, and Jingbo Shang. Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes.arXiv preprint arXiv:2601.05600, 2026

  62. [62]

    Accelerating direct preference optimization with prefix sharing.arXiv preprint arXiv:2410.20305, 2024

    Franklin Wang and Sumanth Hegde. Accelerating direct preference optimization with prefix sharing.arXiv preprint arXiv:2410.20305, 2024

  63. [63]

    William J. Welch. Branch-and-bound search for experimental designs based on d optimality and other criteria.Technometrics, 24(1):41–48, 1982. ISSN 00401706. URL http://www. jstor.org/stable/1267576

  64. [64]

    Infoprompt: Information-theoretic soft prompt tuning for natural language understanding.Advances in neural information processing systems, 36:61060–61084, 2023

    Junda Wu, Tong Yu, Rui Wang, Zhao Song, Ruiyi Zhang, Handong Zhao, Chaochao Lu, Shuai Li, and Ricardo Henao. Infoprompt: Information-theoretic soft prompt tuning for natural language understanding.Advances in neural information processing systems, 36:61060–61084, 2023

  65. [65]

    Ocean: Offline chain-of-thought evaluation and alignment in large language models

    Junda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, et al. Ocean: Offline chain-of-thought evaluation and alignment in large language models. InInternational Conference on Learning Representations, volume 2025, pages 100570–100589, 2025

  66. [66]

    Rossi, Prithviraj Ammanabrolu, and Julian McAuley

    Junda Wu, Rohan Surana, Zhouhang Xie, Yiran Shen, Yu Xia, Tong Yu, Ryan A. Rossi, Prithviraj Ammanabrolu, and Julian McAuley. In-context ranking preference optimization. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum? id=L2NPhLAKEd

  67. [67]

    Mppo: Multi pair-wise preference optimization for llms with arbitrary negative samples.arXiv preprint arXiv:2412.15244, 2024

    Shuo Xie, Fangzhi Zhu, Jiahui Wang, Lulu Wen, Wei Dai, Xiaowei Chen, Junxiong Zhu, Kai Zhou, and Bo Zheng. Mppo: Multi pair-wise preference optimization for llms with arbitrary negative samples.arXiv preprint arXiv:2412.15244, 2024

  68. [68]

    Rossi, Tong Yu, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, Prithviraj 13 Ammanabrolu, and Julian McAuley

    Zhouhang Xie, Junda Wu, Yiran Shen, Raghav Jain, Yu Xia, Xintong Li, Aaron Chang, Ryan A. Rossi, Tong Yu, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, Prithviraj 13 Ammanabrolu, and Julian McAuley. A survey on personalized and pluralistic preference alignment in large language models. InSecond Conference on Language Modeling, 2025. URL https:/...

  69. [69]

    Ganrec: A negative sampling model with generative adversarial network for recommendation.Expert Systems with Applications, 214:119155, 2023

    Zhi Yang, Jiwei Qin, Chuan Lin, Yanping Chen, Ruizhang Huang, and Yongbin Qin. Ganrec: A negative sampling model with generative adversarial network for recommendation.Expert Systems with Applications, 214:119155, 2023. ISSN 0957-4174. doi: https://doi.org/10.1016/ j.eswa.2022.119155. URL https://www.sciencedirect.com/science/article/pii/ S095741742202173X

  70. [70]

    Optimizing dense retrieval model training with hard negatives, 2021

    Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. Optimizing dense retrieval model training with hard negatives, 2021. URL https://arxiv.org/abs/ 2104.08051

  71. [71]

    False: False negative samples aware contrastive learning for semantic segmentation of high-resolution remote sensing image.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022

    Zhaoyang Zhang, Xuying Wang, Xiaoming Mei, Chao Tao, and Haifeng Li. False: False negative samples aware contrastive learning for semantic segmentation of high-resolution remote sensing image.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022. A Appendix Lemma A.1(Gradient Derivation).Consider the loss for a single sample L(θ) =−logσ Z(θ) ,withZ(θ) ...

  72. [72]

    Proof.Let xk = (v0 ik)⊤(H0 k−1)−1v0 ik

    Under Assumptions C.1–C.4, for any candidate i∈ C and any subset Sn of sizenproduced by Algorithm 1, (ϕi − ¯ϕ0)⊤Σ−1 n (ϕi − ¯ϕ0)≤ κ ρq0 min · 1 +α 0L0 v/γ α0 · d n log 1 + α0nL0 v γd . Proof.Let xk = (v0 ik)⊤(H0 k−1)−1v0 ik . By the max-marginal construction in Algorithm 1, xk = maxj /∈Ik−1(v0 j )⊤(H0 k−1)−1v0 j . By Assump- tion C.3, for anyi∈ C, (v0 i )...