First shuffle-DP and joint-DP algorithms for GLM contextual bandits achieve near non-private regret without strong spectral assumptions on contexts.
In: Proceedings of the 19th international conference on World Wide Web
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
UCB-AA is a screening-enhanced UCB algorithm for bandits with arriving arms that delivers arrival-dependent regret bounds and sublinear dynamic regret under gap regularity conditions.
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
A generic conversion turns offline local search algorithms into online stochastic combinatorial bandit algorithms with O(log^3 T) approximate regret.
Constructs a time-indexed set S_t retaining the true optimal policy uniformly over time with high probability, enabling early stopping with sample complexity O((log |Π| + log log(1/Δ_min))/Δ_min²) when the optimum is unique.
A distributional framework for optimizing Lipschitz risk functionals in offline contextual bandits yields data-dependent suboptimality bounds of Õ(1/√n) that match risk-neutral rates and are minimax optimal.
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
Augmenting model-based RL agents with calibrated predictive uncertainties improves planning, sample efficiency, and exploration on continuous control tasks.
CSTS learns context-dependent weights for multiple objectives in a multi-objective contextual bandit and outperforms fixed-weight and standard contextual bandit baselines on Swiss public broadcaster programming data.
citing papers explorer
-
Shuffle and Joint Differential Privacy for Generalized Linear Contextual Bandits
First shuffle-DP and joint-DP algorithms for GLM contextual bandits achieve near non-private regret without strong spectral assumptions on contexts.
-
Multi-Armed Bandits with Arriving Arms: Sequential Screening, Dynamic Regret, and Sublinear Guarantees
UCB-AA is a screening-enhanced UCB algorithm for bandits with arriving arms that delivers arrival-dependent regret bounds and sublinear dynamic regret under gap regularity conditions.
-
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
-
Offline Local Search for Online Stochastic Bandits
A generic conversion turns offline local search algorithms into online stochastic combinatorial bandit algorithms with O(log^3 T) approximate regret.
-
Anytime-valid Optimal Policy Identification
Constructs a time-indexed set S_t retaining the true optimal policy uniformly over time with high probability, enabling early stopping with sample complexity O((log |Π| + log log(1/Δ_min))/Δ_min²) when the optimum is unique.
-
Pessimistic Risk-Aware Policy Learning in Contextual Bandits
A distributional framework for optimizing Lipschitz risk functionals in offline contextual bandits yields data-dependent suboptimality bounds of Õ(1/√n) that match risk-neutral rates and are minimax optimal.
-
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
-
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
-
Calibrated Model-Based Deep Reinforcement Learning
Augmenting model-based RL agents with calibrated predictive uncertainties improves planning, sample efficiency, and exploration on continuous control tasks.
-
Contextual Scalarisation Thompson Sampling for multi-objective decisions in public media
CSTS learns context-dependent weights for multiple objectives in a multi-objective contextual bandit and outperforms fixed-weight and standard contextual bandit baselines on Swiss public broadcaster programming data.
- The EDGE Language: Extended General Einsums for Graph Algorithms