Statistical Inference for Misspecified Contextual Bandits

Yongyi Guo; Ziping Xu

arxiv: 2606.22639 · v1 · pith:BH5NAA2Mnew · submitted 2026-06-21 · 📊 stat.ML · cs.LG

Statistical Inference for Misspecified Contextual Bandits

Yongyi Guo , Ziping Xu This is my paper

Pith reviewed 2026-06-26 09:29 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords contextual banditsmisspecificationinverse probability weightingZ-estimationasymptotic normalityadaptive experimentsstability conditionsstatistical inference

0 comments

The pith

Scaled inverse-propensity convergence ensures consistent and asymptotically normal inference for misspecified contextual bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard inference methods break down with contextual-bandit data when outcome models are misspecified, as adaptive algorithms like LinUCB can fail to stabilize and produce non-Gaussian estimator behavior. It develops an inverse-probability-weighted Z-estimation framework for marginal moment targets such as projection parameters, structural parameters with noisy contexts, and off-policy values. The central stability condition of scaled inverse-propensity convergence is shown to deliver consistency, asymptotic normality, and a consistent sandwich variance estimator. Sufficient conditions for this convergence are derived for multi-armed bandit algorithms and smooth contextual allocation policies. Simulations and a HeartSteps V1 application confirm reliable coverage across targets.

Core claim

Under the scaled inverse-propensity convergence condition, the IPW-Z estimator for a broad class of marginal moment targets is consistent and asymptotically normal with a consistent sandwich variance estimator, even without a well-specified outcome model.

What carries the argument

The IPW-Z estimator, which applies inverse probability weighting to Z-estimating equations, stabilized by the scaled inverse-propensity convergence condition.

If this is right

The IPW-Z estimator applies to projection parameters, structural parameters with noisy contexts, and off-policy values.
Sufficient conditions establish scaled inverse-propensity convergence for multi-armed bandit algorithms and smooth contextual allocation policies.
Simulations and real-data application demonstrate reliable coverage and competitive performance.
Misspecification can cause non-Gaussian behavior in standard algorithms without the stability condition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of adaptive experiments may need to verify or enforce the convergence condition to support later inference.
The framework could apply to other adaptive designs if analogous stability conditions can be checked.
Misspecification effects suggest prioritizing stability alongside reward in online policy selection.

Load-bearing premise

The scaled inverse-propensity convergence condition holds for the policy classes under consideration.

What would settle it

An example where scaled inverse-propensity convergence holds yet the IPW-Z estimator fails to be asymptotically normal or lacks consistent sandwich variance would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.22639 by Yongyi Guo, Ziping Xu.

**Figure 2.** Figure 2: (a) Empirical coverages of 95%, 90%, 80%, 70%, 60%, and 50% confidence intervals vs. the target coverage for Target 1. (b) Empirical coverages of 95% confidence interval over 10,000 steps. Results averaged across 2,500 Monte Carlo simulations. Error bars/shaded bands denote ±2 Monte Carlo standard errors. w.r.t. Ridge regression estimator in five environments introduced above. In [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 3.** Figure 3: (a) Empirical coverages of 95%, 90%, 80%, 70%, 60%, and 50% confidence intervals vs. the target coverage for Target 2. (b) Empirical coverages of 95% confidence interval over 10,000 steps under three noisy context environments. Error bars/shaded bands denote ±2 Monte Carlo standard errors. into a single bank and drawing one record with replacement at each time t. The potential outcome is linear in the true… view at source ↗

**Figure 4.** Figure 4: (a) Empirical coverage for confidence intervals based on CADR and AW through StableVar under different prediction models and our proposed inference method across five environments. (b) Monte Carlo estimates (based on 2,500 samples) of the variance of OPE target estimator based on CADR and AW through StableVar under different prediction models and our proposed inference method across five environments over… view at source ↗

**Figure 5.** Figure 5: (a) Empirical coverage versus target level for the dosage-by-treatment interaction in the HeartSteps V1 environment, at the final horizon, across four behavior policies. (b) Empirical 90% coverage over T. Results averaged across 2,000 Monte Carlo simulations; shaded bands denote ±2 Monte Carlo standard errors. 5.4 Acknowledgment Yongyi Guo acknowledges support from the National Science Foundation under Gra… view at source ↗

**Figure 6.** Figure 6: Variance of the IPW-Z estimator over 10,000 steps for Target [PITH_FULL_IMAGE:figures/full_fig_p068_6.png] view at source ↗

read the original abstract

Contextual bandit algorithms have transformed modern experimentation by enabling real-time adaptation for personalized treatment. Yet these advantages create challenges for statistical inference due to adaptivity. We study inference with contextual-bandit data without assuming a well-specified outcome model. In this setting, we show a previously overlooked issue: standard algorithms such as LinUCB may fail to stabilize under misspecified working models, leading to non-Gaussian estimator behavior and invalid inference. This issue is practically important, as misspecified working models -- such as approximations of complex dynamical systems -- are often employed by online agents in real-world adaptive experiments to balance reward, computational tractability, and robustness. We develop an inverse-probability-weighted Z-estimation framework for a broad class of marginal moment targets, including projection parameters, structural parameters with noisy contexts, and off-policy values. We identify a stability condition tailored to this framework, scaled inverse-propensity convergence, under which the IPW-Z estimator is consistent and asymptotically normal with a consistent sandwich variance estimator. We further establish sufficient conditions for scaled inverse-propensity convergence for several policy classes, including multi-armed bandit algorithms and smooth contextual allocation policies. Simulations and a HeartSteps V1 real-data-calibrated application show reliable coverage and competitive performance across multiple targets. Overall, our results highlight the importance of stability-aware adaptive design for valid post-experiment inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags LinUCB instability under misspecification and gives a workable IPW-Z estimator plus a scaled inverse-propensity stability condition that delivers consistency and asymptotic normality.

read the letter

The core contribution is identifying that standard LinUCB can produce non-Gaussian behavior when the working model is misspecified, then building an inverse-probability-weighted Z-estimator for marginal moment targets that recovers valid inference once a tailored stability condition holds.

What stands out is the derivation of sufficient conditions for that stability condition in multi-armed bandits and smooth contextual policies. The simulations and HeartSteps calibration show the estimator achieves reliable coverage where naive approaches do not. That is concrete progress on a problem that shows up in deployed adaptive experiments.

The main limitation is that the stability condition is stated as an assumption rather than something automatically satisfied by the algorithm; users still need to verify it for their policy class. The paper supplies checks for the cases it treats, so the gap is narrower than it first appears, but it remains a practical hurdle.

This work is aimed at statisticians and experimenters who analyze data from contextual bandit deployments and cannot assume correct outcome models. The framing is direct, the targets are relevant, and the technical steps look internally consistent from the abstract and stress-test notes.

I would send it to peer review. The problem is real, the proposed fix is specific, and the evidence presented is enough to merit referee scrutiny even if revisions are needed on the scope of the stability condition.

Referee Report

1 major / 2 minor

Summary. The paper develops an inverse-probability-weighted Z-estimation (IPW-Z) framework for inference on marginal moment targets (including projections, structural parameters, and off-policy values) in contextual bandits under outcome model misspecification. It introduces a stability condition called scaled inverse-propensity convergence, under which the IPW-Z estimator is consistent, asymptotically normal, and admits a consistent sandwich variance estimator. Sufficient conditions for this stability are derived for multi-armed bandit algorithms and smooth contextual allocation policies. Simulations and a HeartSteps V1 real-data application are used to illustrate reliable coverage.

Significance. If the central results hold, the work provides a practical route to valid post-experiment inference in adaptive designs without requiring a correctly specified outcome model, which is common in real-world applications. The derivation of sufficient conditions for standard policy classes (multi-armed bandits and smooth contextual policies) and the emphasis on stability-aware design are concrete strengths that could influence both theory and experimental practice.

major comments (1)

[Abstract / main theoretical section] The central claim rests on the scaled inverse-propensity convergence condition being sufficient for consistency, asymptotic normality, and sandwich variance consistency of the IPW-Z estimator. While the abstract states that sufficient conditions are established for the policy classes considered, the manuscript should explicitly verify (via the paper's own equations) that these conditions are not circular with the estimator definition itself and that they are checkable from observable quantities.

minor comments (2)

[Introduction] Clarify the precise definition of the IPW-Z estimator and the marginal moment targets at the first use in the introduction, as these are central to the framework.
[Simulations] The simulations section should report the exact policy classes, misspecification levels, and sample sizes used, to allow readers to assess how well the sufficient conditions are tested.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive comment on clarifying the scaled inverse-propensity convergence condition. We address the point below and will incorporate the requested verification in the revision.

read point-by-point responses

Referee: [Abstract / main theoretical section] The central claim rests on the scaled inverse-propensity convergence condition being sufficient for consistency, asymptotic normality, and sandwich variance consistency of the IPW-Z estimator. While the abstract states that sufficient conditions are established for the policy classes considered, the manuscript should explicitly verify (via the paper's own equations) that these conditions are not circular with the estimator definition itself and that they are checkable from observable quantities.

Authors: We agree that an explicit verification strengthens the presentation. The scaled inverse-propensity convergence condition (Definition 3.1) is formulated exclusively in terms of the policy-generated propensity sequence {π_t(a|x)} and the target moment function m(θ; x, a, y), with no dependence on the outcome model or the IPW-Z estimating equation itself; hence it is non-circular by construction. In the revision we will add a short paragraph immediately after Definition 3.1 that quotes the relevant equations and states this independence explicitly. Regarding checkability, the sufficient conditions in Theorems 4.1 (multi-armed bandits) and 4.2 (smooth contextual policies) are expressed in terms of observable or design quantities: the minimum per-arm allocation probability (verifiable from the realized counts N_t(a)) and the policy Lipschitz constant (a known design parameter). We will insert a new Remark 4.3 that lists these quantities and notes how they can be inspected from the experimental log without reference to the outcome data. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces the scaled inverse-propensity convergence condition as an external stability requirement and separately derives sufficient conditions for it under multi-armed bandit and smooth contextual policies. These steps rely on standard asymptotic arguments for inverse-probability weighting and Z-estimation rather than any self-definition, fitted-parameter renaming, or load-bearing self-citation chain. The consistency, asymptotic normality, and sandwich variance results are direct consequences of the stated condition plus the derived sufficient conditions, with no reduction of outputs to inputs by construction visible in the provided material.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard Z-estimation assumptions plus the newly introduced stability condition; no free parameters or invented entities are evident from the abstract.

axioms (2)

domain assumption Outcome model may be misspecified
Core premise allowing the framework to apply beyond well-specified cases.
domain assumption Scaled inverse-propensity convergence holds for the policies
Load-bearing condition identified and claimed to be sufficient for the asymptotic results.

pith-pipeline@v0.9.1-grok · 5762 in / 1227 out tokens · 22303 ms · 2026-06-26T09:29:23.362945+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 1 canonical work pages

[1]

Inbal Nahum-Shani, Shawna N Smith, Bonnie J Spring, Linda M Collins, Katie Witkiewitz, Ambuj Tewari, and Susan A Murphy. Just-in-time adaptive interventions (jitais) in mobile health: key components and design principles for ongoing health behavior support.Annals of behavioral medicine, pages 1–17, 2016

2016
[2]

A systematic review of just-in-time adaptive interventions (jitais) to promote physical activity

Wendy Hardeman, Julie Houghton, Kathleen Lane, Andy Jones, and Felix Naughton. A systematic review of just-in-time adaptive interventions (jitais) to promote physical activity. International Journal of Behavioral Nutrition and Physical Activity, 16(1):31, 2019

2019
[3]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010

2010
[4]

Automatic ad format selec- tion via contextual bandits

Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selec- tion via contextual bandits. InProceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 1587–1594, 2013

2013
[5]

Scaling up behavioral science interventions in online education.Proceedings of the National Academy of Sciences, 117(26): 14900–14905, 2020

Ren´ e F Kizilcec, Justin Reich, Michael Yeomans, Christoph Dann, Emma Brunskill, Glenn Lopez, Selen Turkay, Joseph Jay Williams, and Dustin Tingley. Scaling up behavioral science interventions in online education.Proceedings of the National Academy of Sciences, 117(26): 14900–14905, 2020

2020
[6]

Mining big data in education: Affordances and challenges.Review of research in education, 44(1):130–160, 2020

Christian Fischer, Zachary A Pardos, Ryan Shaun Baker, Joseph Jay Williams, Padhraic Smyth, Renzhe Yu, Stefan Slater, Rachel Baker, and Mark Warschauer. Mining big data in education: Affordances and challenges.Review of research in education, 44(1):130–160, 2020

2020
[7]

Adaptive experimental design: Prospects and applications in political science.American Journal of Political Science, 65(4): 826–844, 2021

Molly Offer-Westort, Alexander Coppock, and Donald P Green. Adaptive experimental design: Prospects and applications in political science.American Journal of Political Science, 65(4): 826–844, 2021

2021
[8]

A mobile health intervention for emerging adults with regular cannabis use: A micro-randomized pilot trial design protocol.Contemporary Clinical Trials, 145:107667, 2024

Lara N Coughlin, Maya Campbell, Tiffany Wheeler, Chavez Rodriguez, Autumn Rae Florim- bio, Susobhan Ghosh, Yongyi Guo, Pei-Yao Hung, Mark W Newman, Huijie Pan, et al. A mobile health intervention for emerging adults with regular cannabis use: A micro-randomized pilot trial design protocol.Contemporary Clinical Trials, 145:107667, 2024

2024
[9]

The impact of using reinforcement learning to personalize communication 75 on medication adherence: findings from the reinforce trial.npj Digital Medicine, 7(1):39, 2024

Julie C Lauffenburger, Elad Yom-Tov, Punam A Keller, Marie E McDonnell, Katherine L Crum, Gauri Bhatkhande, Ellen S Sears, Kaitlin Hanken, Lily G Bessette, Constance P Fontanet, et al. The impact of using reinforcement learning to personalize communication 75 on medication adherence: findings from the reinforce trial.npj Digital Medicine, 7(1):39, 2024

2024
[10]

Optimizing an adaptive digital oral health intervention for promoting oral self-care behaviors: Micro-randomized trial protocol.Contemporary clinical trials, 139:107464, 2024

Inbal Nahum-Shani, Zara M Greer, Anna L Trella, Kelly W Zhang, Stephanie M Carpenter, Dennis Ruenger, David Elashoff, Susan A Murphy, and Vivek Shetty. Optimizing an adaptive digital oral health intervention for promoting oral self-care behaviors: Micro-randomized trial protocol.Contemporary clinical trials, 139:107464, 2024

2024
[11]

Learning to optimize via posterior sampling.Mathe- matics of Operations Research, 39(4):1221–1243, 2014

Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling.Mathe- matics of Operations Research, 39(4):1221–1243, 2014

2014
[12]

Estimation consid- erations in contextual bandits.arXiv preprint arXiv:1711.07077, 2017

Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Estimation consid- erations in contextual bandits.arXiv preprint arXiv:1711.07077, 2017

Pith/arXiv arXiv 2017
[13]

A deployed online reinforcement learning algorithm in an oral health clinical trial

Anna L Trella, Kelly W Zhang, Hinal Jajal, Inbal Nahum-Shani, Vivek Shetty, Finale Doshi- Velez, and Susan A Murphy. A deployed online reinforcement learning algorithm in an oral health clinical trial. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pages 28792–28800, 2025

2025
[14]

Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning.arXiv preprint arXiv:2211.12004, 2022

Susan Athey, Undral Byambadalai, Vitor Hadad, Sanath Kumar Krishnamurthy, Weiwen Le- ung, and Joseph Jay Williams. Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning.arXiv preprint arXiv:2211.12004, 2022

arXiv 2022
[15]

From ads to interventions: Contextual bandits in mobile health

Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. InMobile health: sensors, analytic methods, and applications, pages 495–517. Springer, 2017

2017
[16]

Misspecified linear bandits

Avishek Ghosh, Sayak Ray Chowdhury, and Aditya Gopalan. Misspecified linear bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

2017
[17]

Adapting to mis- specification in contextual bandits.Advances in Neural Information Processing Systems, 33: 11478–11489, 2020

Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to mis- specification in contextual bandits.Advances in Neural Information Processing Systems, 33: 11478–11489, 2020

2020
[18]

Learning with good feature representa- tions in bandits and in rl with a generative model

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representa- tions in bandits and in rl with a generative model. InInternational conference on machine learning, pages 5662–5670. PMLR, 2020

2020
[19]

Tractable contextual bandits beyond realizability

Sanath Kumar Krishnamurthy, Vitor Hadad, and Susan Athey. Tractable contextual bandits beyond realizability. InInternational Conference on Artificial Intelligence and Statistics, pages 1423–1431. PMLR, 2021

2021
[20]

Adapting to misspecification in contextual bandits with offline regression oracles

Sanath Kumar Krishnamurthy, Vitor Hadad, and Susan Athey. Adapting to misspecification in contextual bandits with offline regression oracles. InInternational Conference on Machine Learning, pages 5805–5814. PMLR, 2021

2021
[21]

Accurate inference for adaptive linear models

Yash Deshpande, Lester Mackey, Vasilis Syrgkanis, and Matt Taddy. Accurate inference for adaptive linear models. InInternational Conference on Machine Learning, pages 1194–1203. PMLR, 2018. 76

2018
[22]

Assessing time- varying causal effect moderation in mobile health.Journal of the American Statistical Asso- ciation, 113(523):1112–1121, 2018

Audrey Boruvka, Daniel Almirall, Katie Witkiewitz, and Susan A Murphy. Assessing time- varying causal effect moderation in mobile health.Journal of the American Statistical Asso- ciation, 113(523):1112–1121, 2018

2018
[23]

Estimat- ing time-varying causal excursion effects in mobile health with binary outcomes.Biometrika, 108(3):507–527, 2021

Tianchen Qian, Hyesun Yoo, Predrag Klasnja, Daniel Almirall, and Susan A Murphy. Estimat- ing time-varying causal excursion effects in mobile health with binary outcomes.Biometrika, 108(3):507–527, 2021

2021
[24]

Semi-parametric inference based on adaptively collected data.arXiv preprint arXiv:2303.02534, 2023

Licong Lin, Koulik Khamaru, and Martin J Wainwright. Semi-parametric inference based on adaptively collected data.arXiv preprint arXiv:2303.02534, 2023

arXiv 2023
[25]

Post-episodic reinforcement learning inference.arXiv e-prints, pages arXiv–2302, 2023

Vasilis Syrgkanis and Ruohan Zhan. Post-episodic reinforcement learning inference.arXiv e-prints, pages arXiv–2302, 2023

2023
[26]

On conditional least squares estimation for stochastic processes.The Annals of statistics, pages 629–642, 1978

Lawrence A Klimko and Paul I Nelson. On conditional least squares estimation for stochastic processes.The Annals of statistics, pages 629–642, 1978

1978
[27]

Asymptotic properties of nonlinear least squares estimates in stochastic re- gression models.The Annals of Statistics, pages 1917–1930, 1994

Tze Leung Lai. Asymptotic properties of nonlinear least squares estimates in stochastic re- gression models.The Annals of Statistics, pages 1917–1930, 1994

1917
[28]

Statistical inference with m-estimators on adaptively collected data.Advances in neural information processing systems, 34:7460–7471, 2021

Kelly Zhang, Lucas Janson, and Susan Murphy. Statistical inference with m-estimators on adaptively collected data.Advances in neural information processing systems, 34:7460–7471, 2021

2021
[29]

Statistical inference for online decision making: In a contextual bandit setting.Journal of the American Statistical Association, 116(533):240–255, 2021

Haoyu Chen, Wenbin Lu, and Rui Song. Statistical inference for online decision making: In a contextual bandit setting.Journal of the American Statistical Association, 116(533):240–255, 2021

2021
[30]

Off-policy evaluation via adaptive weighting with data from contextual bandits

Ruohan Zhan, Vitor Hadad, David A Hirshberg, and Susan Athey. Off-policy evaluation via adaptive weighting with data from contextual bandits. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2125–2135, 2021

2021
[31]

Post-contextual-bandit inference.Advances in neural information processing sys- tems, 34:28548–28559, 2021

Aur´ elien Bibaut, Maria Dimakopoulou, Nathan Kallus, Antoine Chambaz, and Mark van Der Laan. Post-contextual-bandit inference.Advances in neural information processing sys- tems, 34:28548–28559, 2021

2021
[32]

Statistical inference after adaptive sampling for longitudinal data.arXiv preprint arXiv:2202.07098, 2022

Kelly W Zhang, Lucas Janson, and Susan A Murphy. Statistical inference after adaptive sampling for longitudinal data.arXiv preprint arXiv:2202.07098, 2022

arXiv 2022
[33]

Replicable bandits for digital health interventions.arXiv preprint arXiv:2407.15377, 2024

Kelly W Zhang, Nowell Closser, Anna L Trella, and Susan A Murphy. Replicable bandits for digital health interventions.arXiv preprint arXiv:2407.15377, 2024

arXiv 2024
[34]

Inference with the upper confidence bound algorithm

Koulik Khamaru and Cun-Hui Zhang. Inference with the upper confidence bound algorithm. arXiv preprint arXiv:2408.04595, 2024

arXiv 2024
[35]

Stable thompson sampling: Valid inference via variance inflation.arXiv preprint arXiv:2505.23260, 2025

Budhaditya Halder, Shubhayan Pan, and Koulik Khamaru. Stable thompson sampling: Valid inference via variance inflation.arXiv preprint arXiv:2505.23260, 2025. 77

arXiv 2025
[36]

Why adaptively collected data have negative bias and how to correct for it

Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. Why adaptively collected data have negative bias and how to correct for it. InInternational Conference on Artificial Intelligence and Statistics, pages 1261–1269. PMLR, 2018

2018
[37]

Inference for batched bandits.Advances in neural information processing systems, 33:9818–9829, 2020

Kelly Zhang, Lucas Janson, and Susan Murphy. Inference for batched bandits.Advances in neural information processing systems, 33:9818–9829, 2020

2020
[38]

Strong consistency of least squares estimates in dynamic models.The annals of Statistics, 7(3):484–489, 1979

TW Anderson and John B Taylor. Strong consistency of least squares estimates in dynamic models.The annals of Statistics, 7(3):484–489, 1979

1979
[39]

Strong consistency of least squares estimators in linear regression models.The Annals of Statistics, 8(4):778–788, 1980

Norbert Christopeit and Kurt Helmes. Strong consistency of least squares estimators in linear regression models.The Annals of Statistics, 8(4):778–788, 1980

1980
[40]

Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems.The Annals of Statistics, pages 154–166, 1982

Tze Leung Lai and Ching Zong Wei. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems.The Annals of Statistics, pages 154–166, 1982

1982
[41]

Strong consistency of maximum quasi-likelihood esti- mators in generalized linear models with fixed and adaptive designs.The Annals of Statistics, 27(4):1155–1163, 1999

Kani Chen, Inchi Hu, and Zhiliang Ying. Strong consistency of maximum quasi-likelihood esti- mators in generalized linear models with fixed and adaptive designs.The Annals of Statistics, 27(4):1155–1163, 1999

1999
[42]

Near-optimal inference in adaptive linear regression.arXiv preprint arXiv:2107.02266, 2021

Koulik Khamaru, Yash Deshpande, Tor Lattimore, Lester Mackey, and Martin J Wainwright. Near-optimal inference in adaptive linear regression.arXiv preprint arXiv:2107.02266, 2021

arXiv 2021
[43]

Statistical inference for online decision making via stochastic gradient descent.Journal of the American Statistical Association, 116(534):708–719, 2021

Haoyu Chen, Wenbin Lu, and Rui Song. Statistical inference for online decision making via stochastic gradient descent.Journal of the American Statistical Association, 116(534):708–719, 2021

2021
[44]

Ucb algorithms for multi-armed bandits: Precise regret and adaptive inference.arXiv preprint arXiv:2412.06126, 2024

Qiyang Han, Koulik Khamaru, and Cun-Hui Zhang. Ucb algorithms for multi-armed bandits: Precise regret and adaptive inference.arXiv preprint arXiv:2412.06126, 2024. URLhttps: //arxiv.org/abs/2412.06126

arXiv 2024
[45]

Confidence intervals for policy evaluation in adaptive experiments.Proceedings of the national academy of sciences, 118(15):e2014602118, 2021

Vitor Hadad, David A Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Confidence intervals for policy evaluation in adaptive experiments.Proceedings of the national academy of sciences, 118(15):e2014602118, 2021

2021
[46]

Anytime-valid off-policy inference for contextual bandits.ACM/IMS Journal of Data Sci- ence, 1(3):1–42, 2024

Ian Waudby-Smith, Lili Wu, Aaditya Ramdas, Nikos Karampatziakis, and Paul Mineiro. Anytime-valid off-policy inference for contextual bandits.ACM/IMS Journal of Data Sci- ence, 1(3):1–42, 2024

2024
[47]

Off-policy estimation of long-term average outcomes with applications to mobile health.Journal of the American Statistical Association, 116(533):382–391, 2021

Peng Liao, Predrag Klasnja, and Susan Murphy. Off-policy estimation of long-term average outcomes with applications to mobile health.Journal of the American Statistical Association, 116(533):382–391, 2021

2021
[48]

Batch policy learning in average reward markov decision processes.Annals of statistics, 50(6):3364, 2022

Peng Liao, Zhengling Qi, Runzhe Wan, Predrag Klasnja, and Susan A Murphy. Batch policy learning in average reward markov decision processes.Annals of statistics, 50(6):3364, 2022

2022
[49]

Springer, 2003

Mark J Laan and James M Robins.Unified methods for censored longitudinal data and causal- ity. Springer, 2003. 78

2003
[50]

Springer, 2013

Bibhas Chakraborty and Erica EM Moodie.Statistical methods for dynamic treatment regimes, volume 2. Springer, 2013

2013
[51]

Adaptive off-policy inference for M- estimators under model misspecification.arXiv preprint arXiv:2509.14218, 2025

James Leiner, Robin Dunn, and Aaditya Ramdas. Adaptive off-policy inference for M- estimators under model misspecification.arXiv preprint arXiv:2509.14218, 2025. URL https://arxiv.org/abs/2509.14218

arXiv 2025
[52]

The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

2000
[53]

A unified switching system perspective and ode analysis of q-learning algorithms.arXiv preprint arXiv:1912.02270, 2019

Donghwan Lee and Niao He. A unified switching system perspective and ode analysis of q-learning algorithms.arXiv preprint arXiv:1912.02270, 2019

arXiv 1912
[54]

A new convergent variant of q-learning with linear function approximation.Advances in Neural Information Processing Systems, 33: 19412–19421, 2020

Diogo Carvalho, Francisco S Melo, and Pedro Santos. A new convergent variant of q-learning with linear function approximation.Advances in Neural Information Processing Systems, 33: 19412–19421, 2020

2020
[55]

The ode method for stochastic approximation and reinforcement learning with markovian noise.Journal of Machine Learning Research, 26(24):1–76, 2025

Shuze Daniel Liu, Shuhang Chen, and Shangtong Zhang. The ode method for stochastic approximation and reinforcement learning with markovian noise.Journal of Machine Learning Research, 26(24):1–76, 2025

2025
[56]

Reinforcement learning under model mismatch

Aurko Roy, Huan Xu, and Sebastian Pokutta. Reinforcement learning under model mismatch. Advances in neural information processing systems, 30, 2017

2017
[57]

Cambridge university press, 2015

Guido W Imbens and Donald B Rubin.Causal inference in statistics, social, and biomedical sciences. Cambridge university press, 2015

2015
[58]

Using least squares to approximate unknown regression functions.Interna- tional economic review, pages 149–170, 1980

Halbert White. Using least squares to approximate unknown regression functions.Interna- tional economic review, pages 149–170, 1980

1980
[59]

Models as approximations I: Consequences illustrated with linear regression.Statistical Science, 34(4):523–544, 2019

Andreas Buja, Lawrence Brown, Richard Berk, Edward George, Emil Pitkin, Mikhail Traskin, Kai Zhang, and Linda Zhao. Models as approximations I: Consequences illustrated with linear regression.Statistical Science, 34(4):523–544, 2019. doi: 10.1214/18-STS693. URL https://doi.org/10.1214/18-STS693

work page doi:10.1214/18-sts693 2019
[60]

Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in india

Victor Chernozhukov, Mert Demirer, Esther Duflo, and Ivan Fernandez-Val. Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in india. Technical report, National Bureau of Economic Research, 2018

2018
[61]

Online learning in bandits with predicted context

Yongyi Guo, Ziping Xu, and Susan Murphy. Online learning in bandits with predicted context. InInternational Conference on Artificial Intelligence and Statistics, pages 2215–2223. PMLR, 2024

2024
[62]

CRC press, 1995

Raymond J Carroll, David Ruppert, and Leonard A Stefanski.Measurement error in nonlinear models, volume 105. CRC press, 1995

1995
[63]

John Wiley & Sons, 2009

Wayne A Fuller.Measurement error models. John Wiley & Sons, 2009. 79

2009
[64]

Reinforcement learning on dyads to enhance med- ication adherence

Ziping Xu, Hinal Jajal, Sung Won Choi, Inbal Nahum-Shani, Guy Shani, Alexandra M Psiho- gios, Pei-Yao Hung, and Susan A Murphy. Reinforcement learning on dyads to enhance med- ication adherence. InInternational Conference on Artificial Intelligence in Medicine, pages 490–499. Springer, 2025

2025
[65]

Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. InProceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011

2011
[66]

Cambridge university press, 2000

Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge university press, 2000

2000
[67]

Reinforcement learning: an introduction mit press

Richard S Sutton and Andrew G Barto. Reinforcement learning: an introduction mit press. Cambridge, MA, 22447:10, 1998

1998
[68]

Finite-time analysis of the multiarmed bandit problem, 2002

P Auer. Finite-time analysis of the multiarmed bandit problem, 2002

2002
[69]

Contextual bandits with linear payoff functions

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011

2011
[70]

Analysis of thompson sampling for the multi-armed bandit problem

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. InConference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012

2012
[71]

A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–96, 2018

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–96, 2018

2018
[72]

Peng Liao, Kristjan Greenewald, Predrag Klasnja, and Susan Murphy. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22, 2020

2020
[73]

Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

Yasin Abbasi-Yadkori, D´ avid P´ al, and Csaba Szepesv´ ari. Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

2011
[74]

Thompson sampling for contextual bandits with linear payoffs

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InInternational conference on machine learning, pages 127–135. PMLR, 2013

2013
[75]

Boltzmann exploration done right.Advances in neural information processing systems, 30, 2017

Nicol` o Cesa-Bianchi, Claudio Gentile, G´ abor Lugosi, and Gergely Neu. Boltzmann exploration done right.Advances in neural information processing systems, 30, 2017

2017
[76]

Power constrained bandits

Jiayu Yao, Emma Brunskill, Weiwei Pan, Susan Murphy, and Finale Doshi-Velez. Power constrained bandits. InMachine Learning for Healthcare Conference, pages 209–259. PMLR, 2021

2021
[77]

Targeting for long-term outcomes.Management Science, 70(6):3841–3855, 2024

Jeremy Yang, Dean Eckles, Paramveer Dhillon, and Sinan Aral. Targeting for long-term outcomes.Management Science, 70(6):3841–3855, 2024

2024
[78]

Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996. 80

1996
[79]

Policy gradient methods for reinforcement learning with function approximation.Advances in neural infor- mation processing systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation.Advances in neural infor- mation processing systems, 12, 1999

1999
[80]

Multi-armed bandit algorithms and empirical evalua- tion

Joannes Vermorel and Mehryar Mohri. Multi-armed bandit algorithms and empirical evalua- tion. InEuropean conference on machine learning, pages 437–448. Springer, 2005

2005

Showing first 80 references.

[1] [1]

Inbal Nahum-Shani, Shawna N Smith, Bonnie J Spring, Linda M Collins, Katie Witkiewitz, Ambuj Tewari, and Susan A Murphy. Just-in-time adaptive interventions (jitais) in mobile health: key components and design principles for ongoing health behavior support.Annals of behavioral medicine, pages 1–17, 2016

2016

[2] [2]

A systematic review of just-in-time adaptive interventions (jitais) to promote physical activity

Wendy Hardeman, Julie Houghton, Kathleen Lane, Andy Jones, and Felix Naughton. A systematic review of just-in-time adaptive interventions (jitais) to promote physical activity. International Journal of Behavioral Nutrition and Physical Activity, 16(1):31, 2019

2019

[3] [3]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010

2010

[4] [4]

Automatic ad format selec- tion via contextual bandits

Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selec- tion via contextual bandits. InProceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 1587–1594, 2013

2013

[5] [5]

Scaling up behavioral science interventions in online education.Proceedings of the National Academy of Sciences, 117(26): 14900–14905, 2020

Ren´ e F Kizilcec, Justin Reich, Michael Yeomans, Christoph Dann, Emma Brunskill, Glenn Lopez, Selen Turkay, Joseph Jay Williams, and Dustin Tingley. Scaling up behavioral science interventions in online education.Proceedings of the National Academy of Sciences, 117(26): 14900–14905, 2020

2020

[6] [6]

Mining big data in education: Affordances and challenges.Review of research in education, 44(1):130–160, 2020

Christian Fischer, Zachary A Pardos, Ryan Shaun Baker, Joseph Jay Williams, Padhraic Smyth, Renzhe Yu, Stefan Slater, Rachel Baker, and Mark Warschauer. Mining big data in education: Affordances and challenges.Review of research in education, 44(1):130–160, 2020

2020

[7] [7]

Adaptive experimental design: Prospects and applications in political science.American Journal of Political Science, 65(4): 826–844, 2021

Molly Offer-Westort, Alexander Coppock, and Donald P Green. Adaptive experimental design: Prospects and applications in political science.American Journal of Political Science, 65(4): 826–844, 2021

2021

[8] [8]

A mobile health intervention for emerging adults with regular cannabis use: A micro-randomized pilot trial design protocol.Contemporary Clinical Trials, 145:107667, 2024

Lara N Coughlin, Maya Campbell, Tiffany Wheeler, Chavez Rodriguez, Autumn Rae Florim- bio, Susobhan Ghosh, Yongyi Guo, Pei-Yao Hung, Mark W Newman, Huijie Pan, et al. A mobile health intervention for emerging adults with regular cannabis use: A micro-randomized pilot trial design protocol.Contemporary Clinical Trials, 145:107667, 2024

2024

[9] [9]

The impact of using reinforcement learning to personalize communication 75 on medication adherence: findings from the reinforce trial.npj Digital Medicine, 7(1):39, 2024

Julie C Lauffenburger, Elad Yom-Tov, Punam A Keller, Marie E McDonnell, Katherine L Crum, Gauri Bhatkhande, Ellen S Sears, Kaitlin Hanken, Lily G Bessette, Constance P Fontanet, et al. The impact of using reinforcement learning to personalize communication 75 on medication adherence: findings from the reinforce trial.npj Digital Medicine, 7(1):39, 2024

2024

[10] [10]

Optimizing an adaptive digital oral health intervention for promoting oral self-care behaviors: Micro-randomized trial protocol.Contemporary clinical trials, 139:107464, 2024

Inbal Nahum-Shani, Zara M Greer, Anna L Trella, Kelly W Zhang, Stephanie M Carpenter, Dennis Ruenger, David Elashoff, Susan A Murphy, and Vivek Shetty. Optimizing an adaptive digital oral health intervention for promoting oral self-care behaviors: Micro-randomized trial protocol.Contemporary clinical trials, 139:107464, 2024

2024

[11] [11]

Learning to optimize via posterior sampling.Mathe- matics of Operations Research, 39(4):1221–1243, 2014

Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling.Mathe- matics of Operations Research, 39(4):1221–1243, 2014

2014

[12] [12]

Estimation consid- erations in contextual bandits.arXiv preprint arXiv:1711.07077, 2017

Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Estimation consid- erations in contextual bandits.arXiv preprint arXiv:1711.07077, 2017

Pith/arXiv arXiv 2017

[13] [13]

A deployed online reinforcement learning algorithm in an oral health clinical trial

Anna L Trella, Kelly W Zhang, Hinal Jajal, Inbal Nahum-Shani, Vivek Shetty, Finale Doshi- Velez, and Susan A Murphy. A deployed online reinforcement learning algorithm in an oral health clinical trial. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pages 28792–28800, 2025

2025

[14] [14]

Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning.arXiv preprint arXiv:2211.12004, 2022

Susan Athey, Undral Byambadalai, Vitor Hadad, Sanath Kumar Krishnamurthy, Weiwen Le- ung, and Joseph Jay Williams. Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning.arXiv preprint arXiv:2211.12004, 2022

arXiv 2022

[15] [15]

From ads to interventions: Contextual bandits in mobile health

Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. InMobile health: sensors, analytic methods, and applications, pages 495–517. Springer, 2017

2017

[16] [16]

Misspecified linear bandits

Avishek Ghosh, Sayak Ray Chowdhury, and Aditya Gopalan. Misspecified linear bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

2017

[17] [17]

Adapting to mis- specification in contextual bandits.Advances in Neural Information Processing Systems, 33: 11478–11489, 2020

Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to mis- specification in contextual bandits.Advances in Neural Information Processing Systems, 33: 11478–11489, 2020

2020

[18] [18]

Learning with good feature representa- tions in bandits and in rl with a generative model

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representa- tions in bandits and in rl with a generative model. InInternational conference on machine learning, pages 5662–5670. PMLR, 2020

2020

[19] [19]

Tractable contextual bandits beyond realizability

Sanath Kumar Krishnamurthy, Vitor Hadad, and Susan Athey. Tractable contextual bandits beyond realizability. InInternational Conference on Artificial Intelligence and Statistics, pages 1423–1431. PMLR, 2021

2021

[20] [20]

Adapting to misspecification in contextual bandits with offline regression oracles

Sanath Kumar Krishnamurthy, Vitor Hadad, and Susan Athey. Adapting to misspecification in contextual bandits with offline regression oracles. InInternational Conference on Machine Learning, pages 5805–5814. PMLR, 2021

2021

[21] [21]

Accurate inference for adaptive linear models

Yash Deshpande, Lester Mackey, Vasilis Syrgkanis, and Matt Taddy. Accurate inference for adaptive linear models. InInternational Conference on Machine Learning, pages 1194–1203. PMLR, 2018. 76

2018

[22] [22]

Assessing time- varying causal effect moderation in mobile health.Journal of the American Statistical Asso- ciation, 113(523):1112–1121, 2018

Audrey Boruvka, Daniel Almirall, Katie Witkiewitz, and Susan A Murphy. Assessing time- varying causal effect moderation in mobile health.Journal of the American Statistical Asso- ciation, 113(523):1112–1121, 2018

2018

[23] [23]

Estimat- ing time-varying causal excursion effects in mobile health with binary outcomes.Biometrika, 108(3):507–527, 2021

Tianchen Qian, Hyesun Yoo, Predrag Klasnja, Daniel Almirall, and Susan A Murphy. Estimat- ing time-varying causal excursion effects in mobile health with binary outcomes.Biometrika, 108(3):507–527, 2021

2021

[24] [24]

Semi-parametric inference based on adaptively collected data.arXiv preprint arXiv:2303.02534, 2023

Licong Lin, Koulik Khamaru, and Martin J Wainwright. Semi-parametric inference based on adaptively collected data.arXiv preprint arXiv:2303.02534, 2023

arXiv 2023

[25] [25]

Post-episodic reinforcement learning inference.arXiv e-prints, pages arXiv–2302, 2023

Vasilis Syrgkanis and Ruohan Zhan. Post-episodic reinforcement learning inference.arXiv e-prints, pages arXiv–2302, 2023

2023

[26] [26]

On conditional least squares estimation for stochastic processes.The Annals of statistics, pages 629–642, 1978

Lawrence A Klimko and Paul I Nelson. On conditional least squares estimation for stochastic processes.The Annals of statistics, pages 629–642, 1978

1978

[27] [27]

Asymptotic properties of nonlinear least squares estimates in stochastic re- gression models.The Annals of Statistics, pages 1917–1930, 1994

Tze Leung Lai. Asymptotic properties of nonlinear least squares estimates in stochastic re- gression models.The Annals of Statistics, pages 1917–1930, 1994

1917

[28] [28]

Statistical inference with m-estimators on adaptively collected data.Advances in neural information processing systems, 34:7460–7471, 2021

Kelly Zhang, Lucas Janson, and Susan Murphy. Statistical inference with m-estimators on adaptively collected data.Advances in neural information processing systems, 34:7460–7471, 2021

2021

[29] [29]

Statistical inference for online decision making: In a contextual bandit setting.Journal of the American Statistical Association, 116(533):240–255, 2021

Haoyu Chen, Wenbin Lu, and Rui Song. Statistical inference for online decision making: In a contextual bandit setting.Journal of the American Statistical Association, 116(533):240–255, 2021

2021

[30] [30]

Off-policy evaluation via adaptive weighting with data from contextual bandits

Ruohan Zhan, Vitor Hadad, David A Hirshberg, and Susan Athey. Off-policy evaluation via adaptive weighting with data from contextual bandits. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2125–2135, 2021

2021

[31] [31]

Post-contextual-bandit inference.Advances in neural information processing sys- tems, 34:28548–28559, 2021

Aur´ elien Bibaut, Maria Dimakopoulou, Nathan Kallus, Antoine Chambaz, and Mark van Der Laan. Post-contextual-bandit inference.Advances in neural information processing sys- tems, 34:28548–28559, 2021

2021

[32] [32]

Statistical inference after adaptive sampling for longitudinal data.arXiv preprint arXiv:2202.07098, 2022

Kelly W Zhang, Lucas Janson, and Susan A Murphy. Statistical inference after adaptive sampling for longitudinal data.arXiv preprint arXiv:2202.07098, 2022

arXiv 2022

[33] [33]

Replicable bandits for digital health interventions.arXiv preprint arXiv:2407.15377, 2024

Kelly W Zhang, Nowell Closser, Anna L Trella, and Susan A Murphy. Replicable bandits for digital health interventions.arXiv preprint arXiv:2407.15377, 2024

arXiv 2024

[34] [34]

Inference with the upper confidence bound algorithm

Koulik Khamaru and Cun-Hui Zhang. Inference with the upper confidence bound algorithm. arXiv preprint arXiv:2408.04595, 2024

arXiv 2024

[35] [35]

Stable thompson sampling: Valid inference via variance inflation.arXiv preprint arXiv:2505.23260, 2025

Budhaditya Halder, Shubhayan Pan, and Koulik Khamaru. Stable thompson sampling: Valid inference via variance inflation.arXiv preprint arXiv:2505.23260, 2025. 77

arXiv 2025

[36] [36]

Why adaptively collected data have negative bias and how to correct for it

Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. Why adaptively collected data have negative bias and how to correct for it. InInternational Conference on Artificial Intelligence and Statistics, pages 1261–1269. PMLR, 2018

2018

[37] [37]

Inference for batched bandits.Advances in neural information processing systems, 33:9818–9829, 2020

Kelly Zhang, Lucas Janson, and Susan Murphy. Inference for batched bandits.Advances in neural information processing systems, 33:9818–9829, 2020

2020

[38] [38]

Strong consistency of least squares estimates in dynamic models.The annals of Statistics, 7(3):484–489, 1979

TW Anderson and John B Taylor. Strong consistency of least squares estimates in dynamic models.The annals of Statistics, 7(3):484–489, 1979

1979

[39] [39]

Strong consistency of least squares estimators in linear regression models.The Annals of Statistics, 8(4):778–788, 1980

Norbert Christopeit and Kurt Helmes. Strong consistency of least squares estimators in linear regression models.The Annals of Statistics, 8(4):778–788, 1980

1980

[40] [40]

Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems.The Annals of Statistics, pages 154–166, 1982

Tze Leung Lai and Ching Zong Wei. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems.The Annals of Statistics, pages 154–166, 1982

1982

[41] [41]

Strong consistency of maximum quasi-likelihood esti- mators in generalized linear models with fixed and adaptive designs.The Annals of Statistics, 27(4):1155–1163, 1999

Kani Chen, Inchi Hu, and Zhiliang Ying. Strong consistency of maximum quasi-likelihood esti- mators in generalized linear models with fixed and adaptive designs.The Annals of Statistics, 27(4):1155–1163, 1999

1999

[42] [42]

Near-optimal inference in adaptive linear regression.arXiv preprint arXiv:2107.02266, 2021

Koulik Khamaru, Yash Deshpande, Tor Lattimore, Lester Mackey, and Martin J Wainwright. Near-optimal inference in adaptive linear regression.arXiv preprint arXiv:2107.02266, 2021

arXiv 2021

[43] [43]

Statistical inference for online decision making via stochastic gradient descent.Journal of the American Statistical Association, 116(534):708–719, 2021

Haoyu Chen, Wenbin Lu, and Rui Song. Statistical inference for online decision making via stochastic gradient descent.Journal of the American Statistical Association, 116(534):708–719, 2021

2021

[44] [44]

Ucb algorithms for multi-armed bandits: Precise regret and adaptive inference.arXiv preprint arXiv:2412.06126, 2024

Qiyang Han, Koulik Khamaru, and Cun-Hui Zhang. Ucb algorithms for multi-armed bandits: Precise regret and adaptive inference.arXiv preprint arXiv:2412.06126, 2024. URLhttps: //arxiv.org/abs/2412.06126

arXiv 2024

[45] [45]

Confidence intervals for policy evaluation in adaptive experiments.Proceedings of the national academy of sciences, 118(15):e2014602118, 2021

Vitor Hadad, David A Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Confidence intervals for policy evaluation in adaptive experiments.Proceedings of the national academy of sciences, 118(15):e2014602118, 2021

2021

[46] [46]

Anytime-valid off-policy inference for contextual bandits.ACM/IMS Journal of Data Sci- ence, 1(3):1–42, 2024

Ian Waudby-Smith, Lili Wu, Aaditya Ramdas, Nikos Karampatziakis, and Paul Mineiro. Anytime-valid off-policy inference for contextual bandits.ACM/IMS Journal of Data Sci- ence, 1(3):1–42, 2024

2024

[47] [47]

Off-policy estimation of long-term average outcomes with applications to mobile health.Journal of the American Statistical Association, 116(533):382–391, 2021

Peng Liao, Predrag Klasnja, and Susan Murphy. Off-policy estimation of long-term average outcomes with applications to mobile health.Journal of the American Statistical Association, 116(533):382–391, 2021

2021

[48] [48]

Batch policy learning in average reward markov decision processes.Annals of statistics, 50(6):3364, 2022

Peng Liao, Zhengling Qi, Runzhe Wan, Predrag Klasnja, and Susan A Murphy. Batch policy learning in average reward markov decision processes.Annals of statistics, 50(6):3364, 2022

2022

[49] [49]

Springer, 2003

Mark J Laan and James M Robins.Unified methods for censored longitudinal data and causal- ity. Springer, 2003. 78

2003

[50] [50]

Springer, 2013

Bibhas Chakraborty and Erica EM Moodie.Statistical methods for dynamic treatment regimes, volume 2. Springer, 2013

2013

[51] [51]

Adaptive off-policy inference for M- estimators under model misspecification.arXiv preprint arXiv:2509.14218, 2025

James Leiner, Robin Dunn, and Aaditya Ramdas. Adaptive off-policy inference for M- estimators under model misspecification.arXiv preprint arXiv:2509.14218, 2025. URL https://arxiv.org/abs/2509.14218

arXiv 2025

[52] [52]

The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

2000

[53] [53]

A unified switching system perspective and ode analysis of q-learning algorithms.arXiv preprint arXiv:1912.02270, 2019

Donghwan Lee and Niao He. A unified switching system perspective and ode analysis of q-learning algorithms.arXiv preprint arXiv:1912.02270, 2019

arXiv 1912

[54] [54]

A new convergent variant of q-learning with linear function approximation.Advances in Neural Information Processing Systems, 33: 19412–19421, 2020

Diogo Carvalho, Francisco S Melo, and Pedro Santos. A new convergent variant of q-learning with linear function approximation.Advances in Neural Information Processing Systems, 33: 19412–19421, 2020

2020

[55] [55]

The ode method for stochastic approximation and reinforcement learning with markovian noise.Journal of Machine Learning Research, 26(24):1–76, 2025

Shuze Daniel Liu, Shuhang Chen, and Shangtong Zhang. The ode method for stochastic approximation and reinforcement learning with markovian noise.Journal of Machine Learning Research, 26(24):1–76, 2025

2025

[56] [56]

Reinforcement learning under model mismatch

Aurko Roy, Huan Xu, and Sebastian Pokutta. Reinforcement learning under model mismatch. Advances in neural information processing systems, 30, 2017

2017

[57] [57]

Cambridge university press, 2015

Guido W Imbens and Donald B Rubin.Causal inference in statistics, social, and biomedical sciences. Cambridge university press, 2015

2015

[58] [58]

Using least squares to approximate unknown regression functions.Interna- tional economic review, pages 149–170, 1980

Halbert White. Using least squares to approximate unknown regression functions.Interna- tional economic review, pages 149–170, 1980

1980

[59] [59]

Models as approximations I: Consequences illustrated with linear regression.Statistical Science, 34(4):523–544, 2019

Andreas Buja, Lawrence Brown, Richard Berk, Edward George, Emil Pitkin, Mikhail Traskin, Kai Zhang, and Linda Zhao. Models as approximations I: Consequences illustrated with linear regression.Statistical Science, 34(4):523–544, 2019. doi: 10.1214/18-STS693. URL https://doi.org/10.1214/18-STS693

work page doi:10.1214/18-sts693 2019

[60] [60]

Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in india

Victor Chernozhukov, Mert Demirer, Esther Duflo, and Ivan Fernandez-Val. Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in india. Technical report, National Bureau of Economic Research, 2018

2018

[61] [61]

Online learning in bandits with predicted context

Yongyi Guo, Ziping Xu, and Susan Murphy. Online learning in bandits with predicted context. InInternational Conference on Artificial Intelligence and Statistics, pages 2215–2223. PMLR, 2024

2024

[62] [62]

CRC press, 1995

Raymond J Carroll, David Ruppert, and Leonard A Stefanski.Measurement error in nonlinear models, volume 105. CRC press, 1995

1995

[63] [63]

John Wiley & Sons, 2009

Wayne A Fuller.Measurement error models. John Wiley & Sons, 2009. 79

2009

[64] [64]

Reinforcement learning on dyads to enhance med- ication adherence

Ziping Xu, Hinal Jajal, Sung Won Choi, Inbal Nahum-Shani, Guy Shani, Alexandra M Psiho- gios, Pei-Yao Hung, and Susan A Murphy. Reinforcement learning on dyads to enhance med- ication adherence. InInternational Conference on Artificial Intelligence in Medicine, pages 490–499. Springer, 2025

2025

[65] [65]

Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. InProceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011

2011

[66] [66]

Cambridge university press, 2000

Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge university press, 2000

2000

[67] [67]

Reinforcement learning: an introduction mit press

Richard S Sutton and Andrew G Barto. Reinforcement learning: an introduction mit press. Cambridge, MA, 22447:10, 1998

1998

[68] [68]

Finite-time analysis of the multiarmed bandit problem, 2002

P Auer. Finite-time analysis of the multiarmed bandit problem, 2002

2002

[69] [69]

Contextual bandits with linear payoff functions

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011

2011

[70] [70]

Analysis of thompson sampling for the multi-armed bandit problem

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. InConference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012

2012

[71] [71]

A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–96, 2018

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–96, 2018

2018

[72] [72]

Peng Liao, Kristjan Greenewald, Predrag Klasnja, and Susan Murphy. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22, 2020

2020

[73] [73]

Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

Yasin Abbasi-Yadkori, D´ avid P´ al, and Csaba Szepesv´ ari. Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

2011

[74] [74]

Thompson sampling for contextual bandits with linear payoffs

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InInternational conference on machine learning, pages 127–135. PMLR, 2013

2013

[75] [75]

Boltzmann exploration done right.Advances in neural information processing systems, 30, 2017

Nicol` o Cesa-Bianchi, Claudio Gentile, G´ abor Lugosi, and Gergely Neu. Boltzmann exploration done right.Advances in neural information processing systems, 30, 2017

2017

[76] [76]

Power constrained bandits

Jiayu Yao, Emma Brunskill, Weiwei Pan, Susan Murphy, and Finale Doshi-Velez. Power constrained bandits. InMachine Learning for Healthcare Conference, pages 209–259. PMLR, 2021

2021

[77] [77]

Targeting for long-term outcomes.Management Science, 70(6):3841–3855, 2024

Jeremy Yang, Dean Eckles, Paramveer Dhillon, and Sinan Aral. Targeting for long-term outcomes.Management Science, 70(6):3841–3855, 2024

2024

[78] [78]

Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996. 80

1996

[79] [79]

Policy gradient methods for reinforcement learning with function approximation.Advances in neural infor- mation processing systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation.Advances in neural infor- mation processing systems, 12, 1999

1999

[80] [80]

Multi-armed bandit algorithms and empirical evalua- tion

Joannes Vermorel and Mehryar Mohri. Multi-armed bandit algorithms and empirical evalua- tion. InEuropean conference on machine learning, pages 437–448. Springer, 2005

2005