LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
Canonical reference
Title resolution pending
Canonical reference. 71% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives across four benchmarks and three model families.
Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.
STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineering evaluations.
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
PSPL maintains posteriors over reward models and dynamics to deliver the first Bayesian simple regret guarantees for PbRL and outperforms baselines on simulation and image generation tasks.
Diversity-regularized DPO fine-tuning of ProteinMPNN improves structural similarity scores by at least 8% over base model and sequence diversity by up to 20% over standard DPO for peptide inverse folding on OpenFold structures.
Presents a robust algorithm for learning any coordinate-wise non-decreasing evaluator preference function, with theoretical guarantees that it matches linear performance when linearity holds.
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
citing papers explorer
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives across four benchmarks and three model families.
-
Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games
Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.
-
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineering evaluations.
-
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
-
Bayesian Preference Learning for Test-Time Steerable Reward Models
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
-
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
-
Best Policy Learning from Trajectory Preference Feedback
PSPL maintains posteriors over reward models and dynamics to deliver the first Bayesian simple regret guarantees for PbRL and outperforms baselines on simulation and image generation tasks.
-
Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization
Diversity-regularized DPO fine-tuning of ProteinMPNN improves structural similarity scores by at least 8% over base model and sequence diversity by up to 20% over standard DPO for peptide inverse folding on OpenFold structures.
-
Learning What Evaluators Value: A Reliable Approach to Modeling Evaluator Preferences
Presents a robust algorithm for learning any coordinate-wise non-decreasing evaluator preference function, with theoretical guarantees that it matches linear performance when linearity holds.
-
Failure Modes of Maximum Entropy RLHF
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
- Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages
- Reinforcement Learning from Human Feedback