arxiv: 2605.10784 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

Rohan Surana , Xintong Li , Sheldon Yu , Yiran Jenny Shen , Chuhan Wang , Tong Yu , Prithviraj Ammanabrolu , Jingbo Shang

show 2 more authors

Julian McAuley Junda Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords preference optimizationdirect preference optimizationactive sample selectionmulti-negative DPOPlackett-Luce modelFisher informationnegative samplingpolicy alignment

0 comments

The pith

MASS-DPO selects compact negative subsets for multi-negative DPO using a Plackett-Luce Fisher-information log-determinant objective that preserves full-pool information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MASS-DPO to address the high cost of optimizing over large pools of negative responses in multi-negative preference optimization. It derives a specific objective from the Fisher information matrix under the Plackett-Luce model to choose small subsets of negatives whose gradients provide complementary directions for policy updates. This selection reduces redundancy while keeping the training signal of the entire pool intact. Experiments across recommendation and QA benchmarks with multiple model families show the approach matches or exceeds full-pool performance in accuracy and alignment metrics but with far fewer negatives.

Core claim

MASS-DPO derives a log-determinant objective from the Plackett-Luce Fisher information to select negatives that contribute complementary information for policy updates, yielding compact subsets that retain the full pool's information while reducing redundancy from similar candidates.

What carries the argument

The PL-specific Fisher-information log-determinant objective for active negative selection, which identifies subsets whose gradients span diverse update directions.

If this is right

Training uses substantially fewer negative samples per prompt while matching or exceeding baseline accuracy.
Recall and NDCG metrics improve along with margin-based optimization dynamics.
Stronger alignment results hold across recommendation and multiple-choice QA tasks.
The gains appear consistently across three model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selection criterion may generalize to other preference models if a similar Fisher matrix can be derived.
Large-scale alignment pipelines could adopt this to lower memory and compute demands during preference tuning.
Similar active selection might reduce redundancy in positive sample pools or in other stages of RLHF.
Typical negative pools likely contain systematic gradient overlaps that this method systematically removes.

Load-bearing premise

The Fisher-information objective selects negatives whose gradients cover complementary update directions while retaining the full pool's information for policy updates.

What would settle it

Training identical models on the same benchmarks using the full negative pool versus the MASS-DPO selected subsets and checking whether the subset version shows lower Recall, NDCG, or alignment scores.

Figures

Figures reproduced from arXiv: 2605.10784 by Chuhan Wang, Jingbo Shang, Julian McAuley, Junda Wu, Prithviraj Ammanabrolu, Rohan Surana, Sheldon Yu, Tong Yu, Xintong Li, Yiran Jenny Shen.

**Figure 1.** Figure 1: Overview of MASS-DPO’s D-optimal selection. Each candidate is scored using the feature diffϕϕ() ϕ( ∗ ) d liffbl( ∗ | ) l(| ) Figure 1: Overview of MASS-DPO’s D-optimal selection. Each candidate is scored using the feature [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Training dynamics on LastFM and MedMCQA. MASS-DPO (solid) achieves larger margins, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Margin, accuracy, and chosen reward comparisons on MovieLens and QASC datasets. [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

read the original abstract

Multi-negative preference optimization under the Plackett--Luce (PL) model extends Direct Preference Optimization (DPO) by leveraging comparative signals across one preferred and multiple rejected responses. However, optimizing over large negative pools is costly, and many candidates contribute redundant gradients due to their similar effects on policy updates. We introduce MASS-DPO, a multi-negative active sample selection method that derives a PL-specific Fisher-information objective for selecting compact, informative negative subsets within each prompt. The resulting log-determinant objective selects negatives that contribute complementary information for policy updates, yielding compact subsets that retain the full pool's information while reducing redundancy. In practice, this favors negatives whose gradients cover different update directions, reducing redundant signal from near-duplicate candidates while preserving the most useful training information. Across four benchmarks spanning recommendation and multiple-choice QA and three model families, MASS-DPO consistently exceeds or matches existing methods in accuracy, improves Recall/NDCG and margin-based optimization dynamics, and delivers stronger alignment with substantially fewer negatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MASS-DPO gives a PL-specific log-det Fisher objective for picking non-redundant negatives in multi-negative DPO, with consistent empirical gains but thin derivation details.

read the letter

The main thing to know is that this paper derives a log-determinant Fisher information objective tailored to the Plackett-Luce model and uses it to select compact negative subsets for multi-negative DPO. The claim is that these subsets keep most of the information from the full pool while cutting redundancy, and the experiments back that up with better or matching accuracy plus improved recall and alignment metrics across four benchmarks and three model families, all with far fewer negatives than the baseline pools. That efficiency angle is the practical win here. They do a solid job showing the selection favors negatives with complementary gradient directions, and the results look reproducible enough on standard rec and QA tasks. The soft spots are mostly around verification. The abstract and summary give no full derivation, error bounds, or explicit rules for when the method might drop useful signal, so the central assumption about covering complementary update directions rests heavily on the reported gains rather than on visible math or ablations. The stress-test finds no internal contradiction, which is fair, but a reader still has to take the independence from prior active selection work on faith until the equations are checked. This is for people already working on DPO variants or efficient preference optimization in LLMs. Anyone running multi-negative training on recommendation or QA data would get immediate value from the selection trick. It shows clear engagement with the problem and the literature, so it deserves a serious referee even if the theory side needs tightening in revision.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces MASS-DPO, a multi-negative active sample selection method for Direct Preference Optimization under the Plackett-Luce (PL) model. It derives a PL-specific Fisher-information log-determinant objective to select compact subsets of negative responses that contribute complementary information for policy updates, reducing redundancy while aiming to retain the information content of the full negative pool. The approach favors negatives whose gradients cover different update directions. Experiments across four benchmarks (recommendation and multiple-choice QA) and three model families show that MASS-DPO consistently matches or exceeds existing methods in accuracy, improves Recall/NDCG and margin-based dynamics, and achieves stronger alignment using substantially fewer negatives.

Significance. If the derivation of the log-determinant objective is sound and the selection mechanism indeed preserves full-pool information while eliminating redundant gradients, the method could meaningfully improve the scalability of multi-negative DPO by lowering computational cost without sacrificing alignment quality. The reported consistent empirical gains across tasks and model families provide supporting evidence for practical utility in LLM preference optimization.

minor comments (3)

The abstract states that the log-determinant objective 'selects negatives that contribute complementary information' and 'favors negatives whose gradients cover different update directions,' but the precise connection between the Fisher matrix and gradient complementarity should be stated more explicitly with a short derivation sketch or reference to the relevant equation in the main text.
The manuscript should clarify the exact criteria used for data exclusion or negative pool construction in the experiments, as these details affect reproducibility of the reported Recall/NDCG improvements.
Figure captions and axis labels for the optimization dynamics plots could be expanded to indicate the exact margin metric and number of negatives used in each curve for easier comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of MASS-DPO, including the recognition of the Plackett-Luce-specific log-determinant Fisher information objective and the consistent empirical results across benchmarks and model families. The recommendation for minor revision is noted, and we are encouraged by the assessment of potential practical utility.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper frames MASS-DPO as deriving a PL-specific Fisher-information objective (log-determinant form) directly from the Plackett-Luce model to select informative negatives. This is a standard information-theoretic construction for active selection and D-optimal design, not a self-referential definition or a fitted parameter relabeled as a prediction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatz smuggling are indicated in the provided text. The central claims rest on the derivation plus empirical validation across benchmarks, keeping the chain self-contained and independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the Plackett-Luce preference model and the assumption that Fisher information captures complementary gradient directions for policy updates.

axioms (1)

domain assumption Plackett-Luce model governs comparative preference signals across one preferred and multiple rejected responses
Invoked to extend standard DPO to multi-negative setting and derive the selection objective.

pith-pipeline@v0.9.0 · 5508 in / 1109 out tokens · 58826 ms · 2026-05-12T04:11:09.631097+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 2 internal anchors

[1]

Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

work page 2011
[2]

Near-optimal discrete optimiza- tion for experimental design: A regret minimization approach.Mathematical Programming, 186:439–478, 2021

Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal discrete optimiza- tion for experimental design: A regret minimization approach.Mathematical Programming, 186:439–478, 2021

work page 2021
[3]

arXiv preprint arXiv:2402.10571 , year=

Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024

work page arXiv 2024
[4]

Gone fishing: Neural active learning with fisher embeddings.Advances in Neural Information Processing Systems, 34:8927–8939, 2021

Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, and Sham Kakade. Gone fishing: Neural active learning with fisher embeddings.Advances in Neural Information Processing Systems, 34:8927–8939, 2021

work page 2021
[5]

arXiv preprint arXiv:1906.03671 , year=

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agar- wal. Deep batch active learning by diverse, uncertain gradient lower bounds.arXiv preprint arXiv:1906.03671, 2019

work page arXiv 1906
[6]

Finetuning large language model for personalized ranking.arXiv preprint arXiv:2405.16127, 2024

Zhuoxi Bai, Ning Wu, Fengyu Cai, Xinyi Zhu, and Yun Xiong. Finetuning large language model for personalized ranking.arXiv preprint arXiv:2405.16127, 2024

work page arXiv 2024
[7]

SmolLM3: smol, multilingual, long-context reasoner.https://huggingface.co/blog/smollm3, 2025

Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Car- los Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gal- louédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raffel, ...

work page 2025
[8]

Ellis, Brian Whitman, and Paul Lamere

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. InProceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011

work page 2011
[9]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952. ISSN 00063444, 14643510. URLhttp://www.jstor.org/stable/2334029

work page arXiv 1952
[10]

Bayesian experimental design: A review.Statistical science, pages 273–304, 1995

Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.Statistical science, pages 273–304, 1995

work page 1995
[11]

On softmax direct preference optimization for recommendation.arXiv preprint arXiv:2406.09215, 2024

Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, and Tat-Seng Chua. On softmax direct preference optimization for recommendation.arXiv preprint arXiv:2406.09215, 2024

work page arXiv 2024
[12]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4302–4310, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

work page 2017
[13]

Neural network exploration using optimal experiment design.Advances in neural information processing systems, 6, 1993

David Cohn. Neural network exploration using optimal experiment design.Advances in neural information processing systems, 6, 1993

work page 1993
[14]

Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500, 2024

Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500, 2024

work page arXiv 2024
[15]

Neighborhood-based hard negative mining for sequential recommendation

Lu Fan, Jiashu Pu, Rongsheng Zhang, and Xiao-Ming Wu. Neighborhood-based hard negative mining for sequential recommendation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2042–2046, 2023

work page 2042
[16]

R. A. Fisher and Edward John Russell. On the mathematical foundations of theoreti- cal statistics.Philosophical Transactions of the Royal Society of London. Series A, Con- taining Papers of a Mathematical or Physical Character, 222(594-604):309–368, 1922. doi: 10.1098/rsta.1922.0009. URL https://royalsocietypublishing.org/doi/abs/ 10.1098/rsta.1922.0009

work page doi:10.1098/rsta.1922.0009 1922
[17]

Robust design of biological experiments

Patrick Flaherty, Adam Arkin, and Michael Jordan. Robust design of biological experiments. Advances in neural information processing systems, 18, 2005. 10

work page 2005
[18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Efficient dynamic hard negative sampling for dialogue selection

Janghoon Han, Dongkyu Lee, Joongbo Shin, Hyunkyung Bae, Jeesoo Bang, Seonghwan Kim, Stanley Jungkyu Choi, and Honglak Lee. Efficient dynamic hard negative sampling for dialogue selection. In Elnaz Nouri, Abhinav Rastogi, Georgios Spithourakis, Bing Liu, Yun-Nung Chen, Yu Li, Alon Albalak, Hiromi Wakaki, and Alexandros Papangelis, editors, Proceedings of t...

work page 2024
[20]

The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

work page 2015
[21]

Reclaif: Reinforcement learning from ai feedback for recommendation systems

Xiaoxin He, Nurendra Choudhary, Jieyi Jiang, Edward W Huang, Bryan Hooi, Xavier Bresson, and Karthik Subbian. Reclaif: Reinforcement learning from ai feedback for recommendation systems. 2025

work page 2025
[22]

Pluralistic off-policy evaluation and alignment

Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, and Lina Yao. Pluralistic off-policy evaluation and alignment.arXiv preprint arXiv:2509.19333, 2025

work page arXiv 2025
[23]

Listwise preference diffusion optimization for user behavior trajectories prediction.Advances in Neural Information Processing Systems, 38:159383–159408, 2026

Hongtao Huang, Chengkai Huang, Junda Wu, Tong Yu, Julian McAuley, and Lina Yao. Listwise preference diffusion optimization for user behavior trajectories prediction.Advances in Neural Information Processing Systems, 38:159383–159408, 2026

work page 2026
[24]

Traceable and explainable multimodal large language models: An information-theoretic view

Zihan Huang, Junda Wu, Rohan Surana, Raghav Jain, Tong Yu, Raghavendra Addanki, David Arbour, Sungchul Kim, and Julian McAuley. Traceable and explainable multimodal large language models: An information-theoretic view. InSecond Conference on Language Modeling, 2025

work page 2025
[25]

Image difference captioning via adversarial preference optimization

Zihan Huang, Junda Wu, Rohan Surana, Tong Yu, David Arbour, Ritwik Sinha, and Julian McAuley. Image difference captioning via adversarial preference optimization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33746– 33758, 2025

work page 2025
[26]

arXiv preprint arXiv:2602.12533 , year=

Zihan Huang, Xintong Li, Rohan Surana, Tong Yu, Rui Wang, Julian McAuley, Jingbo Shang, and Junda Wu. Amps: Adaptive modality preference steering via functional entropy.arXiv preprint arXiv:2602.12533, 2026

work page arXiv 2026
[27]

Optimal design of experiments for optimization-based model calibration using fisher information matrix.Reliability Engineering & System Safety, 216: 107968, 2021

Yongsu Jung and Ikjin Lee. Optimal design of experiments for optimization-based model calibration using fisher information matrix.Reliability Engineering & System Safety, 216: 107968, 2021. ISSN 0951-8320. doi: https://doi.org/10.1016/j.ress.2021.107968. URL https: //www.sciencedirect.com/science/article/pii/S0951832021004798

work page doi:10.1016/j.ress.2021.107968 2021
[28]

Hard negative mixing for contrastive learning.Advances in neural information processing systems, 33:21798–21809, 2020

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning.Advances in neural information processing systems, 33:21798–21809, 2020

work page 2020
[29]

Qasc: A dataset for question answering via sentence composition.Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8082–8090, Apr

Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. Qasc: A dataset for question answering via sentence composition.Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8082–8090, Apr. 2020. doi: 10.1609/aaai.v34i05.6319. URL https://ojs.aaai.org/index.php/AAAI/article/view/6319

work page doi:10.1609/aaai.v34i05.6319 2020
[30]

Optimum experimental designs.Journal of the Royal Statistical Society: Series B (Methodological), 21(2):272–304, 1959

Jack Kiefer. Optimum experimental designs.Journal of the Royal Statistical Society: Series B (Methodological), 21(2):272–304, 1959

work page 1959
[31]

Unifying approaches in active learning and active sampling via fisher information and information-theoretic quantities.arXiv preprint arXiv:2208.00549, 2022

Andreas Kirsch and Yarin Gal. Unifying approaches in active learning and active sampling via fisher information and information-theoretic quantities.arXiv preprint arXiv:2208.00549, 2022

work page arXiv 2022
[32]

Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 32, 2019

Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 32, 2019

work page 2019
[33]

Near-optimal nonmyopic value of information in graphical models.arXiv preprint arXiv:1207.1394, 2012

Andreas Krause and Carlos E Guestrin. Near-optimal nonmyopic value of information in graphical models.arXiv preprint arXiv:1207.1394, 2012. 11

work page arXiv 2012
[34]

Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies.Journal of Machine Learning Research, 9(2), 2008

Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies.Journal of Machine Learning Research, 9(2), 2008

work page 2008
[35]

Active learning for direct preference optimization

Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, and Tong Yu. Active learning for direct preference optimization.arXiv preprint arXiv:2503.01076, 2025

work page arXiv 2025
[36]

Importance sampling for multi-negative multimodal direct preference optimization

Xintong Li, Chuhan Wang, Junda Wu, Rohan Surana, Tong Yu, Julian McAuley, and Jingbo Shang. Importance sampling for multi-negative multimodal direct preference optimization. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=HEFPwoGtTj

work page 2026
[37]

arXiv preprint arXiv:2410.12854 , year=

Weibin Liao, Xu Chu, and Yasha Wang. Tpo: Aligning large language models with multi-branch & multi-step preference trees.arXiv preprint arXiv:2410.12854, 2024

work page arXiv 2024
[38]

arXiv preprint arXiv:2410.02504 , year=

Pangpang Liu, Chengchun Shi, and Will Wei Sun. Dual active learning for reinforcement learning from human feedback.arXiv preprint arXiv:2410.02504, 2024

work page arXiv 2024
[39]

Wiley New York, 1959

R Duncan Luce et al.Individual choice behavior, volume 4. Wiley New York, 1959

work page 1959
[40]

Negative sampling in recommendation: A survey and future directions.arXiv preprint arXiv:2409.07237, 2024

Haokai Ma, Ruobing Xie, Lei Meng, Fuli Feng, Xiaoyu Du, Xingwu Sun, Zhanhui Kang, and Xiangxu Meng. Negative sampling in recommendation: A survey and future directions.arXiv preprint arXiv:2409.07237, 2024

work page arXiv 2024
[41]

Optimal design for human preference elicitation.Advances in Neural Information Processing Systems, 37:90132–90159, 2024

Subhojyoti Mukherjee, Anusha Lalitha, Kousha Kalantari, Aniket Anand Deshmukh, Ge Liu, Yifei Ma, and Branislav Kveton. Optimal design for human preference elicitation.Advances in Neural Information Processing Systems, 37:90132–90159, 2024

work page 2024
[42]

Ws-grpo: Weakly-supervised group-relative policy optimization for rollout-efficient reasoning

Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, et al. Ws-grpo: Weakly-supervised group- relative policy optimization for rollout-efficient reasoning.arXiv preprint arXiv:2602.17025, 2026

work page arXiv 2026
[43]

Neilsen, David F

Tracianne B. Neilsen, David F. Van Komen, Mark K. Transtrum, Makenzie B. Allen, and David P. Knobles. Optimal experimental design for machine learning using the fisher information. Proceedings of Meetings on Acoustics, 35(1):055004, 01 2019. ISSN 1939-800X. doi: 10.1121/ 2.0000953. URLhttps://doi.org/10.1121/2.0000953

work page doi:10.1121/2.0000953 2019
[44]

G. L. Nemhauser and L. A. Wolsey. Best algorithms for approximating the maximum of a submodular set function.Mathematics of Operations Research, 3(3):177–188, 1978. ISSN 0364765X, 15265471. URLhttp://www.jstor.org/stable/3689488

work page arXiv 1978
[45]

A survey on llm-based conversational user simulation

Bo Ni, Yu Wang, Leyao Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Luera, Samyadeep Basu, Subhojyoti Mukherjee, et al. A survey on llm-based conversational user simulation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4266–4301, 2026

work page 2026
[46]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[47]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann, editors,Proceedings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learni...

work page 2022
[48]

The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

work page 1975
[49]

SIAM, 2006

Friedrich Pukelsheim.Optimal design of experiments. SIAM, 2006

work page 2006
[50]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023. 12

work page 2023
[51]

Riquelme, G

Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling.arXiv preprint arXiv:1802.09127, 2018

work page arXiv 2018
[52]

Contrastive learning with hard negative samples,

Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

work page arXiv 2010
[53]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017

work page Pith review arXiv 2017
[54]

Asymp- totic analysis of objectives based on fisher information in active learning.Journal of Machine Learning Research, 18(34):1–41, 2017

Jamshid Sourati, Murat Akcakaya, Todd K Leen, Deniz Erdogmus, and Jennifer G Dy. Asymp- totic analysis of objectives based on fisher information in active learning.Journal of Machine Learning Research, 18(34):1–41, 2017

work page 2017
[55]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA, 2020. Curran Associates Inc. ISBN 9781713829546

work page 2020
[56]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

work page 2020
[57]

Di- rect preference optimization for llm-enhanced recommendation systems.arXiv preprint arXiv:2410.05939, 2024

Chao Sun, Yaobo Liang, Yaming Yang, Shilin Xu, Tianmeng Yang, and Yunhai Tong. Di- rect preference optimization for llm-enhanced recommendation systems.arXiv preprint arXiv:2410.05939, 2024

work page arXiv 2024
[58]

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, et al. Generate, filter, control, replay: A comprehensive survey of rollout strategies for llm reinforcement learning.arXiv preprint arXiv:2605.02913, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Qwen3, April 2025

Qwen Team. Qwen3, April 2025. URLhttps://qwenlm.github.io/blog/qwen3/

work page 2025
[60]

Comparing few to rank many: Active human preference learning using randomized frank-wolfe.arXiv preprint arXiv:2412.19396, 2024

Kiran Koshy Thekumparampil, Gaurush Hiranandani, Kousha Kalantari, Shoham Sabach, and Branislav Kveton. Comparing few to rank many: Active human preference learning using randomized frank-wolfe.arXiv preprint arXiv:2412.19396, 2024

work page arXiv 2024
[61]

Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes

Chuhan Wang, Xintong Li, Jennifer Yuntong Zhang, Junda Wu, Chengkai Huang, Lina Yao, Julian McAuley, and Jingbo Shang. Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes.arXiv preprint arXiv:2601.05600, 2026

work page arXiv 2026
[62]

Accelerating direct preference optimization with prefix sharing.arXiv preprint arXiv:2410.20305, 2024

Franklin Wang and Sumanth Hegde. Accelerating direct preference optimization with prefix sharing.arXiv preprint arXiv:2410.20305, 2024

work page arXiv 2024
[63]

William J. Welch. Branch-and-bound search for experimental designs based on d optimality and other criteria.Technometrics, 24(1):41–48, 1982. ISSN 00401706. URL http://www. jstor.org/stable/1267576

work page arXiv 1982
[64]

Infoprompt: Information-theoretic soft prompt tuning for natural language understanding.Advances in neural information processing systems, 36:61060–61084, 2023

Junda Wu, Tong Yu, Rui Wang, Zhao Song, Ruiyi Zhang, Handong Zhao, Chaochao Lu, Shuai Li, and Ricardo Henao. Infoprompt: Information-theoretic soft prompt tuning for natural language understanding.Advances in neural information processing systems, 36:61060–61084, 2023

work page 2023
[65]

Ocean: Offline chain-of-thought evaluation and alignment in large language models

Junda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, et al. Ocean: Offline chain-of-thought evaluation and alignment in large language models. InInternational Conference on Learning Representations, volume 2025, pages 100570–100589, 2025

work page 2025
[66]

Rossi, Prithviraj Ammanabrolu, and Julian McAuley

Junda Wu, Rohan Surana, Zhouhang Xie, Yiran Shen, Yu Xia, Tong Yu, Ryan A. Rossi, Prithviraj Ammanabrolu, and Julian McAuley. In-context ranking preference optimization. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum? id=L2NPhLAKEd

work page 2025
[67]

Mppo: Multi pair-wise preference optimization for llms with arbitrary negative samples.arXiv preprint arXiv:2412.15244, 2024

Shuo Xie, Fangzhi Zhu, Jiahui Wang, Lulu Wen, Wei Dai, Xiaowei Chen, Junxiong Zhu, Kai Zhou, and Bo Zheng. Mppo: Multi pair-wise preference optimization for llms with arbitrary negative samples.arXiv preprint arXiv:2412.15244, 2024

work page arXiv 2024
[68]

Rossi, Tong Yu, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, Prithviraj 13 Ammanabrolu, and Julian McAuley

Zhouhang Xie, Junda Wu, Yiran Shen, Raghav Jain, Yu Xia, Xintong Li, Aaron Chang, Ryan A. Rossi, Tong Yu, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, Prithviraj 13 Ammanabrolu, and Julian McAuley. A survey on personalized and pluralistic preference alignment in large language models. InSecond Conference on Language Modeling, 2025. URL https:/...

work page 2025
[69]

Ganrec: A negative sampling model with generative adversarial network for recommendation.Expert Systems with Applications, 214:119155, 2023

Zhi Yang, Jiwei Qin, Chuan Lin, Yanping Chen, Ruizhang Huang, and Yongbin Qin. Ganrec: A negative sampling model with generative adversarial network for recommendation.Expert Systems with Applications, 214:119155, 2023. ISSN 0957-4174. doi: https://doi.org/10.1016/ j.eswa.2022.119155. URL https://www.sciencedirect.com/science/article/pii/ S095741742202173X

work page arXiv 2023
[70]

Optimizing dense retrieval model training with hard negatives, 2021

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. Optimizing dense retrieval model training with hard negatives, 2021. URL https://arxiv.org/abs/ 2104.08051

work page arXiv 2021
[71]

False: False negative samples aware contrastive learning for semantic segmentation of high-resolution remote sensing image.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022

Zhaoyang Zhang, Xuying Wang, Xiaoming Mei, Chao Tao, and Haifeng Li. False: False negative samples aware contrastive learning for semantic segmentation of high-resolution remote sensing image.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022. A Appendix Lemma A.1(Gradient Derivation).Consider the loss for a single sample L(θ) =−logσ Z(θ) ,withZ(θ) ...

work page 2022
[72]

Proof.Let xk = (v0 ik)⊤(H0 k−1)−1v0 ik

Under Assumptions C.1–C.4, for any candidate i∈ C and any subset Sn of sizenproduced by Algorithm 1, (ϕi − ¯ϕ0)⊤Σ−1 n (ϕi − ¯ϕ0)≤ κ ρq0 min · 1 +α 0L0 v/γ α0 · d n log 1 + α0nL0 v γd . Proof.Let xk = (v0 ik)⊤(H0 k−1)−1v0 ik . By the max-marginal construction in Algorithm 1, xk = maxj /∈Ik−1(v0 j )⊤(H0 k−1)−1v0 j . By Assump- tion C.3, for anyi∈ C, (v0 i )...

work page