HorizonBench: Long-Horizon Personalization with Evolving Preferences

· 2026 · cs.CL · arXiv 2604.17283

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

User preferences evolve across months of interaction, and tracking them requires inferring when a stated preference has been changed by a subsequent life event. We define this problem as long-horizon personalization and observe that progress on it is limited by data availability and measurement, with no existing resource providing both naturalistic long-horizon interactions and the ground-truth provenance needed to diagnose why models fail. We introduce a data generator that produces conversations from a structured mental state graph, yielding ground-truth provenance for every preference change across 6-month timelines, and from it construct HorizonBench, a benchmark of 4,245 items from 360 simulated users with 6-month conversation histories averaging ~4,300 turns and ~163K tokens. HorizonBench provides a testbed for long-context modeling, memory-augmented architectures, theory-of-mind reasoning, and user modeling. Across 25 frontier models, the best model reaches 52.8% and most score at or below the 20% chance baseline. When these models err on evolved preferences, over a third of the time they select the user's originally stated value without tracking the updated user state. This belief-update failure persists across context lengths and expression explicitness levels, identifying state-tracking capability as the primary bottleneck for long-horizon personalization.

representative citing papers

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

cs.AI · 2026-06-15 · unverdicted · novelty 7.0

MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

Introduces EvoArena benchmark for dynamic LLM agent environments and EvoMem memory paradigm that improves accuracy by 1.5% on EvoArena and gains on GAIA and LoCoMo.

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

MemGate is a 9M-parameter neural gate inserted between vector memory and LLM that converts similarity search into task-conditioned admission, reducing memory-induced threats across agent frameworks while preserving utility.

citing papers explorer

Showing 2 of 2 citing papers after filters.

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory cs.AI · 2026-06-15 · unverdicted · none · ref 34 · internal anchor
MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.
Beyond Similarity: Trustworthy Memory Search for Personal AI Agents cs.AI · 2026-06-04 · unverdicted · none · ref 18 · internal anchor
MemGate is a 9M-parameter neural gate inserted between vector memory and LLM that converts similarity search into task-conditioned admission, reducing memory-induced threats across agent frameworks while preserving utility.

HorizonBench: Long-Horizon Personalization with Evolving Preferences

fields

years

verdicts

representative citing papers

citing papers explorer