Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Softmax Transformers with specific parameters implement iterative weighted softmax TD learning for in-context policy evaluation, with evaluation error decaying over layers and those parameters globally minimizing pretraining loss.
A frozen average of the last two cycles matches or exceeds eight shape-learning alternatives on 97 GIFT-Eval configurations for periodic time series forecasting.
citing papers explorer
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
-
Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
Softmax Transformers with specific parameters implement iterative weighted softmax TD learning for in-context policy evaluation, with evaluation error decaying over layers and those parameters globally minimizing pretraining loss.
-
Don't Learn the Shape: Forecasting Periodic Time Series by Rank-1 Decomposition
A frozen average of the last two cycles matches or exceeds eight shape-learning alternatives on 97 GIFT-Eval configurations for periodic time series forecasting.