PolyAlign: Conditional Human-Distribution Alignment

L. D. M. S. Sai Teja; Muhammad Haris Khan; Sathira Silva; Ufaq Khan; Xiao Wu

arxiv: 2606.13227 · v1 · pith:LCKXNUXCnew · submitted 2026-06-11 · 💻 cs.CL

PolyAlign: Conditional Human-Distribution Alignment

L. D. M. S. Sai Teja , Ufaq Khan , Sathira Silva , Xiao Wu , Muhammad Haris Khan This is my paper

Pith reviewed 2026-06-27 06:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords conditional alignmenthuman distribution alignmentpreference optimizationbilingual language modelsdistributional faithfulnessnaturalnessPolyAlign

0 comments

The pith

Language models should match context-specific human response distributions rather than a single global style.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard post-training aligns models to one uniform assistant behavior, which reduces natural variation in responses across languages, tasks, and dialogue turns. The paper frames the alternative as conditional human-distribution alignment, where the model learns to reproduce the human response distribution appropriate to the current context. PolyAlign implements this by partitioning bilingual data into buckets defined by language, interaction track, response family, and length, then applying bucket-aware supervised fine-tuning and preference optimization regularized by distance to the bucket-specific human support. Bilingual evaluations show gains in conditional naturalness and distributional faithfulness with no loss in task utility. A reader would care because the work indicates that alignment objectives can be made interaction-aware instead of globally uniform.

Core claim

PolyAlign organizes bilingual interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length. It combines Bucket-Aware SFT, which balances optimization across buckets, with Human-Distribution Preference Optimization that regularizes preference learning using critic-estimated distance to the bucket-specific human support. This produces models that improve conditional naturalness and distributional faithfulness across English and Chinese single- and multi-turn settings while preserving competitive task utility.

What carries the argument

Bucket-specific human reference distributions (defined by language, interaction track, response family, and length) together with Human-Distribution Preference Optimization (HDPO) that uses critic-estimated distance to those distributions as a regularization signal.

If this is right

Models trained this way reproduce human response variation across languages and interaction types instead of converging to one style.
Regularization toward bucket-specific human supports increases distributional faithfulness without harming average task performance.
Post-training objectives can be made conditional on observable context attributes rather than global.
Bilingual settings benefit when buckets explicitly separate English and Chinese distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bucketing logic could be applied to additional languages or modalities if comparable human reference data exist.
Explicit bucket definitions may reduce unwanted homogenization effects that appear in globally aligned models.
Extending the bucket criteria to include domain or user intent could further sharpen the alignment signal.

Load-bearing premise

That bucket-specific human reference distributions can be reliably built from data and that critic-estimated distance to those distributions supplies a stable regularization signal during preference optimization.

What would settle it

An evaluation in which PolyAlign fails to raise conditional naturalness or distributional faithfulness scores relative to standard SFT plus preference optimization on the same bilingual single- and multi-turn test suite, or in which task utility drops by more than a small margin.

Figures

Figures reproduced from arXiv: 2606.13227 by L. D. M. S. Sai Teja, Muhammad Haris Khan, Sathira Silva, Ufaq Khan, Xiao Wu.

**Figure 1.** Figure 1: PolyAlign vs. global alignment. Unlike standard RLHF/DPO-style post-training, which can collapse diverse contexts into a generic assistant style, PolyAlign aligns responses to human distributions for more natural, situation-appropriate generation. can perform a wide range of tasks from prompts alone (Brown et al., 2020), while instruction tuning showed that SFT on diverse task instructions can substantia… view at source ↗

**Figure 2.** Figure 2: PolyAlign pipeline. PolyAlign organizes bilingual interaction data into bucket-specific human distributions, then aligns models through bucket-weighted SFT, critic-based distribution training, and HDPO. The final model is evaluated for task utility and conditional naturalness using QA-F1, BNG-Macro, G-MAUVE, and NUF. 3 Methodology 3.1 Problem Setup and Bucketed Human Reference Distributions Rather than le… view at source ↗

**Figure 3.** Figure 3: Combined benchmark quality and detector human-likeness scores across English and Chinese models. Bars show model-level scores, the black line reports the method mean, and the highlighted region marks the distribution-conditioned method in each comparison. specific: Bucket-SFT anchors models to human distributions, while HDPO refines language-specific naturalness. The Human-Likeliness and the benchmark q… view at source ↗

**Figure 4.** Figure 4: UMAP projections of Qwen2.5-1.5B answer embeddings across training methods. Each panel compares two response distributions, with stars denoting centroids and gray lines showing paired sample shifts; smaller centroid distances indicate closer embeddinglevel alignment. duct a focused human evaluation on held-out English and Chinese examples. The evaluation subsets are sampled from the same benchmark sour… view at source ↗

read the original abstract

Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress the natural variation of human responses across languages, tasks, and dialogue settings. We study this problem as conditional human-distribution alignment: models should match the human response distribution appropriate to the current interaction context, rather than a universal response style. We introduce PolyAlign, a distribution-aware alignment framework that organizes bilingual interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length. PolyAlign combines Bucket-Aware SFT, which balances optimization across heterogeneous buckets, with Human-Distribution Preference Optimization (HDPO), which regularizes preference learning using critic-estimated distance to bucket-specific human support. Across a bilingual evaluation suite covering English and Chinese single- and multi-turn settings, PolyAlign improves conditional naturalness and distributional faithfulness while preserving competitive task utility. The results suggest that post-training should move beyond global alignment objectives toward interaction-aware alignment with human response distributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PolyAlign names a bucketed conditional alignment setup but the abstract supplies no numbers or checks, so the gains remain unverified.

read the letter

The core idea is to stop forcing one global response style and instead match the human distribution that fits the current language, track, response family, and length. They split the bilingual data into those buckets, run a balanced SFT across them, then add HDPO that uses a critic to pull the model toward the bucket-specific human support.

That motivation is solid. Global alignment really does flatten natural variation, and bilingual single- and multi-turn settings are a fair place to test whether context-sensitive objectives can recover some of it without killing task performance.

The problem is the evidence. The abstract states improvements in conditional naturalness and distributional faithfulness but lists no baselines, metrics, sample sizes, or ablations. The stress-test note is on target: everything rests on the buckets producing stable reference distributions and the critic distance being a trustworthy signal. If response-family labels are noisy or some buckets are thin, the reported gains could just reflect how the buckets were built. Without those checks visible, it is impossible to tell whether the method actually works or whether the critic was validated against fresh human judgments.

This is for alignment researchers who already think global objectives are too blunt. A reader who wants concrete numbers or a clear comparison to prior conditional or multi-objective work will not get much from the abstract alone. The paper deserves a referee because the framing is coherent and the limitation it targets is real, but any review would have to start by demanding the full experimental section and the bucket-construction details before the claims can be taken seriously.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes PolyAlign as a distribution-aware post-training framework for conditional human-distribution alignment. It partitions bilingual (English/Chinese) interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length; combines Bucket-Aware SFT (to balance optimization across buckets) with Human-Distribution Preference Optimization (HDPO) that regularizes preference learning via critic-estimated distance to the bucket-specific human support; and reports gains in conditional naturalness and distributional faithfulness across single- and multi-turn settings while preserving task utility.

Significance. If the empirical claims hold after proper validation of the bucket construction and critic signal, the work would be significant for shifting post-training objectives from global single-style alignment toward context- and language-aware matching of human response distributions, potentially yielding more natural multilingual dialogue models.

major comments (2)

[HDPO and bucket construction sections] The load-bearing assumption is that bucket-specific human reference distributions (defined by language + track + response family + length) can be reliably estimated from data and that critic distance supplies a stable regularization signal in HDPO. The manuscript must demonstrate sufficient bucket occupancy, low label noise in response-family annotations, and independent validation of the critic against held-out human judgments; without these, reported gains in conditional naturalness could be artifacts of bucket construction rather than evidence of successful distributional alignment.
[Abstract and Evaluation sections] The abstract asserts performance improvements in conditional naturalness and distributional faithfulness but supplies no experimental details, baselines, metrics, statistical tests, or ablation results. The full manuscript must include these (including quantitative tables and significance tests) so that the central claim can be verified against the data.

minor comments (1)

[Introduction and Method] Notation for 'response family' and 'interaction track' should be defined explicitly on first use with an example to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. We address each major comment below with clarifications and commitments to strengthen the manuscript where needed.

read point-by-point responses

Referee: [HDPO and bucket construction sections] The load-bearing assumption is that bucket-specific human reference distributions (defined by language + track + response family + length) can be reliably estimated from data and that critic distance supplies a stable regularization signal in HDPO. The manuscript must demonstrate sufficient bucket occupancy, low label noise in response-family annotations, and independent validation of the critic against held-out human judgments; without these, reported gains in conditional naturalness could be artifacts of bucket construction rather than evidence of successful distributional alignment.

Authors: We agree these validations are necessary to substantiate the core assumptions. The manuscript provides initial bucket occupancy statistics and annotation procedures, but does not include the full set of requested analyses (e.g., inter-annotator agreement metrics or critic correlation with held-out judgments). In the revised version we will add a dedicated subsection with occupancy tables, label noise quantification, and independent critic validation results against human ratings to rule out construction artifacts. revision: yes
Referee: [Abstract and Evaluation sections] The abstract asserts performance improvements in conditional naturalness and distributional faithfulness but supplies no experimental details, baselines, metrics, statistical tests, or ablation results. The full manuscript must include these (including quantitative tables and significance tests) so that the central claim can be verified against the data.

Authors: The Evaluation section of the full manuscript already presents the requested elements: quantitative tables comparing against baselines (SFT, DPO), metrics for conditional naturalness and distributional faithfulness, ablation studies, and statistical significance testing. The abstract is intentionally concise and does not enumerate these details. We will add a brief reference to the evaluation protocol in the abstract and ensure all tables and tests are prominently cross-referenced. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces PolyAlign as a framework combining Bucket-Aware SFT and HDPO regularization based on bucket-specific human reference distributions. No equations, derivations, or self-citations appear in the abstract or described content that reduce any claimed prediction or result to its inputs by construction. The central claims rest on empirical improvements in naturalness and faithfulness across bilingual settings rather than self-definitional or fitted-input mechanisms. This aligns with the absence of load-bearing self-referential steps, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5726 in / 1030 out tokens · 30691 ms · 2026-06-27T06:37:18.580784+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 9 linked inside Pith

[1]

Advances in Neural Information Processing Systems , volume=

Language Models are Few-Shot Learners , author=. Advances in Neural Information Processing Systems , volume=
[2]

International Conference on Learning Representations , year=

Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=
[3]

International Conference on Learning Representations , year=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=
[4]

Journal of Machine Learning Research , volume=

Scaling Instruction-Finetuned Language Models , author=. Journal of Machine Learning Research , volume=
[5]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[6]

Advances in Neural Information Processing Systems , volume=

Training Language Models to Follow Instructions with Human Feedback , author=. Advances in Neural Information Processing Systems , volume=
[7]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=. Constitutional
[9]

Advances in Neural Information Processing Systems , volume=

K. Advances in Neural Information Processing Systems , volume=
[10]

Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srinivasan and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and others , journal=
[11]

Advances in Neural Information Processing Systems , volume=

Deep Reinforcement Learning from Human Preferences , author=. Advances in Neural Information Processing Systems , volume=
[12]

Advances in Neural Information Processing Systems , volume=

Learning to Summarize with Human Feedback , author=. Advances in Neural Information Processing Systems , volume=
[13]

Advances in Neural Information Processing Systems , volume=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in Neural Information Processing Systems , volume=
[14]

Yuan, Zheng and Yuan, Hongyi and Tan, Chuanqi and Wang, Wei and Huang, Songfang and Huang, Fei , journal=
[15]

Hong, Jiwoo and Lee, Noah and Thorne, James , booktitle=
[16]

Meng, Yu and Xia, Mengzhou and Chen, Danqi , journal=
[17]

Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , journal=
[18]

International Conference on Machine Learning , pages=

Pretraining Language Models with Human Preferences , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[19]

Dong, Yi and Wang, Zhilin and Sreedhar, Makesh and Wu, Xianchao and Kuchaiev, Oleksii , booktitle=
[20]

Wang, Zhilin and Dong, Yi and Zeng, Jiaqi and Adams, Virginia and Sreedhar, Makesh Narsimhan and Egert, Daniel and Delalleau, Olivier and Scowcroft, Jane and Kant, Neel and Swope, Aidan and others , booktitle=
[21]

and Sreedhar, Makesh Narsimhan and Kuchaiev, Oleksii , journal=

Wang, Zhilin and Dong, Yi and Delalleau, Olivier and Zeng, Jiaqi and Shen, Gerald and Egert, Daniel and Zhang, Jimmy J. and Sreedhar, Makesh Narsimhan and Kuchaiev, Oleksii , journal=
[22]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

On Diversified Preferences of Large Language Model Alignment , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[23]

and Xiong, Caiming and Socher, Richard , journal=

Keskar, Nitish Shirish and McCann, Bryan and Varshney, Lav R. and Xiong, Caiming and Socher, Richard , journal=
[24]

International Conference on Learning Representations , year=

Plug and Play Language Models: A Simple Approach to Controlled Text Generation , author=. International Conference on Learning Representations , year=
[25]

Krause, Ben and Gotmare, Akhilesh Deepak and McCann, Bryan and Keskar, Nitish Shirish and Joty, Shafiq and Socher, Richard and Rajani, Nazneen Fatema , booktitle=
[26]

Yang, Kevin and Klein, Dan , booktitle=
[27]

Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[28]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

The Power of Scale for Parameter-Efficient Prompt Tuning , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021
[29]

Entropic Distribution Matching in Supervised Fine-Tuning of

Li, Ziniu and Chen, Congliang and Xu, Tian and Qin, Zeyu and Xiao, Jiancong and Sun, Ruoyu and Luo, Zhi-Quan , booktitle=. Entropic Distribution Matching in Supervised Fine-Tuning of
[30]

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , journal=
[32]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The
[33]

arXiv preprint arXiv:2502.02737 , year=

Pith/arXiv arXiv
[34]

Spotting

Hans, Abhimanyu and Schwarzschild, Avi and Cherepanova, Valeriia and Kazemi, Hamid and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , journal=. Spotting
[35]

Bao, Guangsheng and Zhao, Yanbin and Teng, Zhiyang and Yang, Linyi and Zhang, Yue , booktitle=. Fast-
[36]

and Finn, Chelsea , booktitle=

Mitchell, Eric and Lee, Yoonho and Khazatsky, Alexander and Manning, Christopher D. and Finn, Chelsea , booktitle=. 2023 , organization=

2023
[37]

Hu, Xiaomeng and Chen, Pin-Yu and Ho, Tsung-Yi , journal=
[38]

and others , journal=

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and others , journal=. Judging
[39]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, and 1 others. 2022 a . Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

Pith/arXiv arXiv 2022
[40]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, and 1 others. 2022 b . Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073

Pith/arXiv arXiv 2022
[41]

Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2024. Fast- DetectGPT : Efficient zero-shot detection of machine-generated text via conditional probability curvature. In International Conference on Learning Representations

2024
[42]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877--1901

2020
[43]

Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30

2017
[44]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, and 1 others. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1--53

2024
[45]

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations

2020
[46]

Yi Dong, Zhilin Wang, Makesh Sreedhar, Xianchao Wu, and Oleksii Kuchaiev. 2023. SteerLM : Attribute conditioned SFT as an user-steerable alternative to RLHF . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11275--11288

2023
[47]

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. KTO : Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306

Pith/arXiv arXiv 2024
[48]

Gemma Team , Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , and 1 others. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118

Pith/arXiv arXiv 2024
[49]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024
[50]

Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. Spotting LLMs with binoculars: Zero-shot detection of machine-generated text. arXiv preprint arXiv:2401.12070

arXiv 2024
[51]

Jiwoo Hong, Noah Lee, and James Thorne. 2024. ORPO : Monolithic preference optimization without reference model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170--11189

2024
[52]

Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. 2023. RADAR : Robust AI -text detection via adversarial learning. Advances in Neural Information Processing Systems, 36:15077--15095

2023
[53]

Varshney, Caiming Xiong, and Richard Socher

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. CTRL : A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858

Pith/arXiv arXiv 2019
[54]

o pf, Yannic Kilcher, Dimitri von R \

Andreas K \"o pf, Yannic Kilcher, Dimitri von R \"u tte , Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Rich \'a rd Nagyfi, and 1 others. 2023. OpenAssistant conversations: Democratizing large language model alignment. Advances in Neural Information Processing Systems, 36:47669--47681

2023
[55]

Bowman, and Ethan Perez

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R. Bowman, and Ethan Perez. 2023. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506--17533. PMLR

2023
[56]

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2021. GeDi : Generative discriminator guided sequence generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4929--4952

2021
[57]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045--3059

2021
[58]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582--4597

2021
[59]

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Ruoyu Sun, and Zhi-Quan Luo. 2024. Entropic distribution matching in supervised fine-tuning of LLMs : Less overfitting and better diversity. In NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability

2024
[60]

Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO : Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37:124198--124235

2024
[61]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730--27744

2022
[62]

Qwen Team , An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, and 24 others. 2024. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

Pith/arXiv arXiv 2024
[63]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728--53741

2023
[64]

Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao , Arun Raja, and 1 others

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao , Arun Raja, and 1 others. 2022. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations

2022
[65]

Christiano

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008--3021

2020
[66]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484--13508

2023
[67]

Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev

Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. 2024 a . HelpSteer2 : Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673

arXiv 2024
[68]

Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Scowcroft, Neel Kant, Aidan Swope, and 1 others. 2024 b . HelpSteer : Multi-attribute helpfulness dataset for SteerLM . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Huma...

2024
[69]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned language models are zero-shot learners. In International Conference on Learning Representations

2022
[70]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025
[71]

Kevin Yang and Dan Klein. 2021. FUDGE : Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511--3535

2021
[72]

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. RRHF : Rank responses to align language models with human feedback without tears. Advances in Neural Information Processing Systems, 36:10935--10950

2023
[73]

Dun Zeng, Yong Dai, Pengyu Cheng, Longyue Wang, Tianhao Hu, Wanshun Chen, Nan Du, and Zenglin Xu. 2024. On diversified preferences of large language model alignment. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9194--9210

2024
[74]

Xing, and 1 others

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, and 1 others. 2023. Judging LLM -as-a-judge with MT-Bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595--46623

2023
[75]

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, and 1 others. 2023. LIMA : Less is more for alignment. Advances in Neural Information Processing Systems, 36:55006--55021

2023

[1] [1]

Advances in Neural Information Processing Systems , volume=

Language Models are Few-Shot Learners , author=. Advances in Neural Information Processing Systems , volume=

[2] [2]

International Conference on Learning Representations , year=

Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

[3] [3]

International Conference on Learning Representations , year=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=

[4] [4]

Journal of Machine Learning Research , volume=

Scaling Instruction-Finetuned Language Models , author=. Journal of Machine Learning Research , volume=

[5] [5]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[6] [6]

Advances in Neural Information Processing Systems , volume=

Training Language Models to Follow Instructions with Human Feedback , author=. Advances in Neural Information Processing Systems , volume=

[7] [7]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=. Constitutional

[8] [9]

Advances in Neural Information Processing Systems , volume=

K. Advances in Neural Information Processing Systems , volume=

[9] [10]

Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srinivasan and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and others , journal=

[10] [11]

Advances in Neural Information Processing Systems , volume=

Deep Reinforcement Learning from Human Preferences , author=. Advances in Neural Information Processing Systems , volume=

[11] [12]

Advances in Neural Information Processing Systems , volume=

Learning to Summarize with Human Feedback , author=. Advances in Neural Information Processing Systems , volume=

[12] [13]

Advances in Neural Information Processing Systems , volume=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in Neural Information Processing Systems , volume=

[13] [14]

Yuan, Zheng and Yuan, Hongyi and Tan, Chuanqi and Wang, Wei and Huang, Songfang and Huang, Fei , journal=

[14] [15]

Hong, Jiwoo and Lee, Noah and Thorne, James , booktitle=

[15] [16]

Meng, Yu and Xia, Mengzhou and Chen, Danqi , journal=

[16] [17]

Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , journal=

[17] [18]

International Conference on Machine Learning , pages=

Pretraining Language Models with Human Preferences , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[18] [19]

Dong, Yi and Wang, Zhilin and Sreedhar, Makesh and Wu, Xianchao and Kuchaiev, Oleksii , booktitle=

[19] [20]

Wang, Zhilin and Dong, Yi and Zeng, Jiaqi and Adams, Virginia and Sreedhar, Makesh Narsimhan and Egert, Daniel and Delalleau, Olivier and Scowcroft, Jane and Kant, Neel and Swope, Aidan and others , booktitle=

[20] [21]

and Sreedhar, Makesh Narsimhan and Kuchaiev, Oleksii , journal=

Wang, Zhilin and Dong, Yi and Delalleau, Olivier and Zeng, Jiaqi and Shen, Gerald and Egert, Daniel and Zhang, Jimmy J. and Sreedhar, Makesh Narsimhan and Kuchaiev, Oleksii , journal=

[21] [22]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

On Diversified Preferences of Large Language Model Alignment , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024

[22] [23]

and Xiong, Caiming and Socher, Richard , journal=

Keskar, Nitish Shirish and McCann, Bryan and Varshney, Lav R. and Xiong, Caiming and Socher, Richard , journal=

[23] [24]

International Conference on Learning Representations , year=

Plug and Play Language Models: A Simple Approach to Controlled Text Generation , author=. International Conference on Learning Representations , year=

[24] [25]

Krause, Ben and Gotmare, Akhilesh Deepak and McCann, Bryan and Keskar, Nitish Shirish and Joty, Shafiq and Socher, Richard and Rajani, Nazneen Fatema , booktitle=

[25] [26]

Yang, Kevin and Klein, Dan , booktitle=

[26] [27]

Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

[27] [28]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

The Power of Scale for Parameter-Efficient Prompt Tuning , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021

[28] [29]

Entropic Distribution Matching in Supervised Fine-Tuning of

Li, Ziniu and Chen, Congliang and Xu, Tian and Qin, Zeyu and Xiao, Jiancong and Sun, Ruoyu and Luo, Zhi-Quan , booktitle=. Entropic Distribution Matching in Supervised Fine-Tuning of

[29] [30]

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , journal=

[30] [32]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The

[31] [33]

arXiv preprint arXiv:2502.02737 , year=

Pith/arXiv arXiv

[32] [34]

Spotting

Hans, Abhimanyu and Schwarzschild, Avi and Cherepanova, Valeriia and Kazemi, Hamid and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , journal=. Spotting

[33] [35]

Bao, Guangsheng and Zhao, Yanbin and Teng, Zhiyang and Yang, Linyi and Zhang, Yue , booktitle=. Fast-

[34] [36]

and Finn, Chelsea , booktitle=

Mitchell, Eric and Lee, Yoonho and Khazatsky, Alexander and Manning, Christopher D. and Finn, Chelsea , booktitle=. 2023 , organization=

2023

[35] [37]

Hu, Xiaomeng and Chen, Pin-Yu and Ho, Tsung-Yi , journal=

[36] [38]

and others , journal=

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and others , journal=. Judging

[37] [39]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, and 1 others. 2022 a . Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

Pith/arXiv arXiv 2022

[38] [40]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, and 1 others. 2022 b . Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073

Pith/arXiv arXiv 2022

[39] [41]

Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2024. Fast- DetectGPT : Efficient zero-shot detection of machine-generated text via conditional probability curvature. In International Conference on Learning Representations

2024

[40] [42]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877--1901

2020

[41] [43]

Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30

2017

[42] [44]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, and 1 others. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1--53

2024

[43] [45]

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations

2020

[44] [46]

Yi Dong, Zhilin Wang, Makesh Sreedhar, Xianchao Wu, and Oleksii Kuchaiev. 2023. SteerLM : Attribute conditioned SFT as an user-steerable alternative to RLHF . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11275--11288

2023

[45] [47]

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. KTO : Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306

Pith/arXiv arXiv 2024

[46] [48]

Gemma Team , Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , and 1 others. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118

Pith/arXiv arXiv 2024

[47] [49]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024

[48] [50]

Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. Spotting LLMs with binoculars: Zero-shot detection of machine-generated text. arXiv preprint arXiv:2401.12070

arXiv 2024

[49] [51]

Jiwoo Hong, Noah Lee, and James Thorne. 2024. ORPO : Monolithic preference optimization without reference model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170--11189

2024

[50] [52]

Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. 2023. RADAR : Robust AI -text detection via adversarial learning. Advances in Neural Information Processing Systems, 36:15077--15095

2023

[51] [53]

Varshney, Caiming Xiong, and Richard Socher

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. CTRL : A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858

Pith/arXiv arXiv 2019

[52] [54]

o pf, Yannic Kilcher, Dimitri von R \

Andreas K \"o pf, Yannic Kilcher, Dimitri von R \"u tte , Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Rich \'a rd Nagyfi, and 1 others. 2023. OpenAssistant conversations: Democratizing large language model alignment. Advances in Neural Information Processing Systems, 36:47669--47681

2023

[53] [55]

Bowman, and Ethan Perez

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R. Bowman, and Ethan Perez. 2023. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506--17533. PMLR

2023

[54] [56]

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2021. GeDi : Generative discriminator guided sequence generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4929--4952

2021

[55] [57]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045--3059

2021

[56] [58]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582--4597

2021

[57] [59]

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Ruoyu Sun, and Zhi-Quan Luo. 2024. Entropic distribution matching in supervised fine-tuning of LLMs : Less overfitting and better diversity. In NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability

2024

[58] [60]

Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO : Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37:124198--124235

2024

[59] [61]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730--27744

2022

[60] [62]

Qwen Team , An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, and 24 others. 2024. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

Pith/arXiv arXiv 2024

[61] [63]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728--53741

2023

[62] [64]

Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao , Arun Raja, and 1 others

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao , Arun Raja, and 1 others. 2022. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations

2022

[63] [65]

Christiano

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008--3021

2020

[64] [66]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484--13508

2023

[65] [67]

Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev

Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. 2024 a . HelpSteer2 : Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673

arXiv 2024

[66] [68]

Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Scowcroft, Neel Kant, Aidan Swope, and 1 others. 2024 b . HelpSteer : Multi-attribute helpfulness dataset for SteerLM . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Huma...

2024

[67] [69]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned language models are zero-shot learners. In International Conference on Learning Representations

2022

[68] [70]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025

[69] [71]

Kevin Yang and Dan Klein. 2021. FUDGE : Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511--3535

2021

[70] [72]

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. RRHF : Rank responses to align language models with human feedback without tears. Advances in Neural Information Processing Systems, 36:10935--10950

2023

[71] [73]

Dun Zeng, Yong Dai, Pengyu Cheng, Longyue Wang, Tianhao Hu, Wanshun Chen, Nan Du, and Zenglin Xu. 2024. On diversified preferences of large language model alignment. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9194--9210

2024

[72] [74]

Xing, and 1 others

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, and 1 others. 2023. Judging LLM -as-a-judge with MT-Bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595--46623

2023

[73] [75]

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, and 1 others. 2023. LIMA : Less is more for alignment. Advances in Neural Information Processing Systems, 36:55006--55021

2023