First shuffle-DP and joint-DP algorithms for GLM contextual bandits achieve near non-private regret without strong spectral assumptions on contexts.
In: Proceedings of the 19th international conference on World Wide Web
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
UCB-AA is a screening-enhanced UCB algorithm for bandits with arriving arms that delivers arrival-dependent regret bounds and sublinear dynamic regret under gap regularity conditions.
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
A generic conversion turns offline local search algorithms into online stochastic combinatorial bandit algorithms with O(log^3 T) approximate regret.
Constructs a time-indexed set S_t retaining the true optimal policy uniformly over time with high probability, enabling early stopping with sample complexity O((log |Π| + log log(1/Δ_min))/Δ_min²) when the optimum is unique.
A distributional framework for optimizing Lipschitz risk functionals in offline contextual bandits yields data-dependent suboptimality bounds of Õ(1/√n) that match risk-neutral rates and are minimax optimal.
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
Augmenting model-based RL agents with calibrated predictive uncertainties improves planning, sample efficiency, and exploration on continuous control tasks.
CSTS learns context-dependent weights for multiple objectives in a multi-objective contextual bandit and outperforms fixed-weight and standard contextual bandit baselines on Swiss public broadcaster programming data.
citing papers explorer
-
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.