AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
Personafeedback: A large-scale human-annotated benchmark for personalization.arXiv preprint arXiv:2506.12915,
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
years
2026 3verdicts
UNVERDICTED 3representative citing papers
BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.
PHF applies Bourdieu's Theory of Practice to create hierarchical user models for LLM personalization and reports consistent gains on the LaMP benchmark.
citing papers explorer
-
BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces
BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.