MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives across four benchmarks and three model families.
The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
dataset 3polarities
use dataset 3representative citing papers
UniRank unifies autoregressive and non-autoregressive list-wise reranking via bidirectional modeling in a confidence-ordered iterative denoising process, outperforming baselines on datasets and online tests.
Introduces OmniBehavior benchmark from real-world data and shows LLMs exhibit hyper-activity, persona homogenization, and utopian bias in behavior simulation.
RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.
ABPO combines group-relative policy optimization with anchored exposure correction and asymmetric feedback handling to enable effective continual updates for LLM recommenders under bandit feedback constraints.
citing papers explorer
-
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives across four benchmarks and three model families.
-
UniRank: Unified List-wise Reranking via Confidence-Ordered Denoising
UniRank unifies autoregressive and non-autoregressive list-wise reranking via bidirectional modeling in a confidence-ordered iterative denoising process, outperforming baselines on datasets and online tests.
-
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Introduces OmniBehavior benchmark from real-world data and shows LLMs exhibit hyper-activity, persona homogenization, and utopian bias in behavior simulation.
-
RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation
RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.
-
Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target
ABPO combines group-relative policy optimization with anchored exposure correction and asymmetric feedback handling to enable effective continual updates for LLM recommenders under bandit feedback constraints.