Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guarantees for capacity-bounded models via the Margin-Capacity Profile.
SLiC-HF : Sequence likelihood calibration with human feedback
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.
Anomaly Preference Optimization reformulates anomalous image synthesis as preference learning with implicit alignment from real anomalies and a time-aware capacity allocation module for diffusion models to balance diversity and fidelity.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
citing papers explorer
-
Mind the Gap: Structure-Aware Consistency in Preference Learning
Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guarantees for capacity-bounded models via the Margin-Capacity Profile.
-
KTO: Model Alignment as Prospect Theoretic Optimization
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference
CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
-
Data-dependent Exploration for Online Reinforcement Learning from Human Feedback
DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.
-
Anomaly-Preference Image Generation
Anomaly Preference Optimization reformulates anomalous image synthesis as preference learning with implicit alignment from real anomalies and a time-aware capacity allocation module for diffusion models to balance diversity and fidelity.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.