A Tale of Two Cities: Pessimism and Opportunism in Offline Dynamic Pricing

Lan Wang; Zeyu Bian; Zhengling Qi

arxiv: 2411.08126 · v2 · pith:2FYNN2G2new · submitted 2024-11-12 · 📊 stat.ML · cs.LG

A Tale of Two Cities: Pessimism and Opportunism in Offline Dynamic Pricing

Zeyu Bian , Zhengling Qi , Lan Wang This is my paper

Pith reviewed 2026-05-23 17:35 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords offline dynamic pricingpartial identificationdemand monotonicitypessimistic policyopportunistic policyregret boundsno-coverage setting

0 comments

The pith

When historical pricing data leaves some prices unobserved, including the optimum, a monotonicity-based partial identification framework produces pessimistic and opportunistic policies with finite-sample regret bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a nonparametric framework for offline dynamic pricing that uses the fact that demand falls as price rises to place bounds on the revenue of prices never seen in the data. From this framework it derives two sequential decision rules: one that selects prices to maximize the worst-case revenue guarantee, and one that selects prices to minimize the worst-case regret relative to the best possible policy. Finite-sample regret bounds are proved for both rules; the bounds match the usual offline rate when the optimal price appears in the data and add an explicit penalty term when it does not. Algorithms implementing the rules are given and shown to outperform standard offline reinforcement-learning baselines on simulated and airline-ticket data.

Core claim

The central claim is that the monotonicity of demand supplies enough structure to partially identify the value of unobserved prices, allowing the construction of two dynamic policies—one pessimistic and one opportunistic—whose finite-sample regret can be bounded even in a sequential no-coverage environment, with the bounds recovering the standard rate whenever the optimal price is observed.

What carries the argument

Nonparametric partial identification framework that exploits monotonicity of demand in price to bound the value of unobserved prices.

If this is right

The pessimistic policy delivers a revenue guarantee that protects against the worst possible completion of the unobserved prices.
The opportunistic policy delivers a regret bound that limits the loss relative to the best feasible policy even when the optimum is missing.
Both bounds recover the usual offline rate when the optimal price is covered and add a quantifiable extra term when it is not.
Efficient algorithms exist that implement the two policies and outperform standard offline RL baselines in no-coverage regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit mapping from a firm’s risk posture to policy choice could be applied to other sequential decision settings that face partial data coverage.
Replacing monotonicity with other shape restrictions on demand would produce analogous frameworks for different economic environments.
The same bounding technique might be used to derive policies in offline inventory or assortment problems where certain actions are missing from historical data.

Load-bearing premise

Demand is monotonically decreasing in price.

What would settle it

A data set in which the observed demand function is not monotonically decreasing and the derived policies produce regret larger than the stated bounds.

Figures

Figures reproduced from arXiv: 2411.08126 by Lan Wang, Zeyu Bian, Zhengling Qi.

**Figure 2.** Figure 2: Value functions for the optimal policy and three types of suboptimal policies: Type [PITH_FULL_IMAGE:figures/full_fig_p032_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of empirical value functions among vanilla pessimistic, refined pes [PITH_FULL_IMAGE:figures/full_fig_p033_3.png] view at source ↗

read the original abstract

We study offline dynamic pricing when historical data provide incomplete coverage of the price space such that some candidate prices, including the optimal one, may be entirely unobserved. This setting is common in practice and is especially difficult in dynamic environments. Existing offline reinforcement learning methods typically rely on full or partial coverage and can therefore perform poorly in such settings. We develop a nonparametric partial identification framework for offline dynamic pricing that exploits the monotonicity of demand in price to bound the value of unobserved prices. Within this framework, we formulate two dynamic decision rules: a pessimistic policy that maximizes worst-case revenue and an opportunistic policy that minimizes worst-case regret. These rules are tailored to a sequential no-coverage environment and are not direct extensions of existing pessimistic offline RL or static opportunistic approaches. We establish finite-sample regret bounds for both policies, recovering the standard rate when the optimal price is covered and quantifying the additional cost when it is not. We also develop efficient algorithms and show, through simulations and an airline ticket application, that our methods outperform standard offline RL baselines in no-coverage settings. Managerially, the framework provides a practical mapping from a firm's risk posture to its pricing policy: firms seeking revenue stability and downside protection should prefer the pessimistic policy, whereas firms willing to bear measured risk for potential gains from underexplored prices should prefer the opportunistic policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives finite-sample regret bounds for offline dynamic pricing when some prices including the optimum are missing from the data, by bounding them with demand monotonicity and then running either a pessimistic or opportunistic policy.

read the letter

The main point is a nonparametric partial-identification setup that uses the standard monotonicity of demand to produce interval bounds on the value of unobserved prices, then optimizes a worst-case revenue policy or a worst-case regret policy on those intervals. The regret analysis splits into covered and uncovered cases and recovers the usual rate when the optimum is seen while adding an explicit extra term otherwise. The two policies are written for the sequential setting rather than lifted from static or fully covered offline RL work, and the simulations plus the airline ticket example show they beat the usual baselines when coverage is incomplete. The risk-posture mapping is also a clear practical output: one policy for downside protection, the other for measured upside from unexplored prices. The argument looks internally consistent on the high-level description; the monotonicity assumption is explicit and the recursion does not appear to break. The main soft spot is that everything rests on that monotonicity holding exactly, so the bounds can degrade if it is violated even mildly. The extra regret term is stated to be quantified, but without the full proof it is hard to judge how sharp it is in finite samples. The empirical section is only summarized, so the data-exclusion rules and hyper-parameter choices are not visible yet. This is aimed at people working on revenue management or offline RL for pricing problems with partial coverage. A reader who already knows the standard pessimistic RL literature will see the incremental step clearly. It is worth sending to a serious referee because the central claim is motivated by a real gap and the construction does not contain obvious circularity or hidden assumptions beyond what is stated.

Referee Report

2 major / 2 minor

Summary. The paper develops a nonparametric partial identification framework for offline dynamic pricing under incomplete price coverage. Exploiting demand monotonicity to produce interval bounds on unobserved prices, it defines a pessimistic policy (maximizing worst-case revenue) and an opportunistic policy (minimizing worst-case regret). Finite-sample regret bounds are derived for both, recovering the standard rate when the optimal price is covered and quantifying the extra cost otherwise; efficient algorithms are provided and the methods are tested on simulations plus an airline-ticket application, where they outperform standard offline RL baselines.

Significance. If the regret bounds hold, the work supplies the first finite-sample guarantees for dynamic pricing in sequential no-coverage regimes and gives a direct mapping from a firm’s risk posture to policy choice. The nonparametric use of monotonicity, the clean split between covered and uncovered cases, and the reproducible simulation results constitute concrete strengths.

major comments (2)

[§4] §4 (regret analysis): the finite-sample bound for the opportunistic policy when the optimum is uncovered is stated to be O(√(T log T) + extra term); the extra term’s dependence on the width of the partial-identification interval must be shown explicitly (e.g., via the length of the demand interval at the unobserved price) to confirm it is not an artifact of the proof technique.
[§3.2] §3.2 (policy definitions): the opportunistic policy minimizes worst-case regret over the identified set; it is not immediate that this coincides with the static opportunistic rule of the literature, yet the text claims the rules are “not direct extensions.” An explicit side-by-side derivation or counter-example showing the difference in the dynamic recursion is needed.

minor comments (2)

[§5] The simulation section should report the precise data-exclusion rule used to create the no-coverage regime and the number of Monte-Carlo replications.
[§2] Notation for the identified demand interval (e.g., [D̲(p), D̄(p)]) should be introduced once and used consistently; several passages still write “bounds on demand” without the interval symbols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, the positive assessment of the contribution, and the recommendation for minor revision. We address each major comment below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (regret analysis): the finite-sample bound for the opportunistic policy when the optimum is uncovered is stated to be O(√(T log T) + extra term); the extra term’s dependence on the width of the partial-identification interval must be shown explicitly (e.g., via the length of the demand interval at the unobserved price) to confirm it is not an artifact of the proof technique.

Authors: We agree that an explicit dependence would strengthen the result. Re-inspecting the proof of Theorem 4.2, the extra term arises directly from the diameter of the demand interval at the unobserved price (via the partial-identification bounds on the value function). In the revision we will add a short corollary that isolates this dependence, expressing the additive term as a function of the interval length at the optimal price. This confirms the term is intrinsic to the partial-identification setting rather than an artifact. revision: yes
Referee: [§3.2] §3.2 (policy definitions): the opportunistic policy minimizes worst-case regret over the identified set; it is not immediate that this coincides with the static opportunistic rule of the literature, yet the text claims the rules are “not direct extensions.” An explicit side-by-side derivation or counter-example showing the difference in the dynamic recursion is needed.

Authors: We will supply the requested comparison. The dynamic opportunistic policy differs from the static rule because the identified set and the worst-case regret are updated recursively across periods; the static rule treats each period independently. In the revision we will add an appendix subsection containing (i) a side-by-side derivation of the two Bellman operators and (ii) a two-period counter-example in which the dynamic policy selects a different first-period action than the static rule when coverage is incomplete. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper develops a nonparametric partial-identification framework that invokes the standard monotonicity of demand in price to produce interval bounds on unobserved prices; the pessimistic and opportunistic policies are then defined directly on those intervals, and finite-sample regret bounds are obtained by case analysis on coverage of the optimal price. No equation or claim reduces by construction to a fitted parameter, self-citation, or renamed input; the monotonicity assumption is external and the regret derivation splits cleanly into covered/uncovered regimes without circular dependence on the policies themselves. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of monotonic demand to enable partial identification when coverage is absent; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Demand is monotonic (non-increasing) in price
Invoked to bound revenue of unobserved prices in the partial identification framework.

pith-pipeline@v0.9.0 · 5769 in / 1129 out tokens · 25439 ms · 2026-05-23T17:35:57.316409+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

Antos, A., Szepesv \'a ri, C., and Munos, R. (2008). Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning , 71:89--129

work page 2008
[2]

Ban, G.-Y. (2020). Confidence intervals for data-driven inventory policies with demand censoring. Operations Research , 68(2):309--326

work page 2020
[3]

and Keskin, N

Ban, G.-Y. and Keskin, N. B. (2021). Personalized dynamic pricing with machine learning: High-dimensional features and heterogeneous elasticity. Management Science , 67(9):5549--5568

work page 2021
[4]

Bastani, H., Simchi-Levi, D., and Zhu, R. (2022). Meta dynamic pricing: Transfer learning across experiments. Management Science , 68(3):1865--1881

work page 2022
[5]

Bellman, R. (1957). A markovian decision process. Journal of Mathematics and Mechanics , 6(5):679--684

work page 1957
[6]

and Caldentey, R

Bitran, G. and Caldentey, R. (2003). An overview of pricing models for revenue management. Manufacturing & Service Operations Management , 5(3):203--229

work page 2003
[7]

and Rusmevichientong, P

Broder, J. and Rusmevichientong, P. (2012). Dynamic pricing under a general parametric choice model. Operations Research , 60(4):965--980

work page 2012
[8]

Bu, J., Simchi-Levi, D., and Wang, L. (2023). Offline pricing and demand learning with censored data. Management Science , 69(2):885--903

work page 2023
[9]

Buckman, J., Gelada, C., and Bellemare, M. G. (2020). The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799

work page arXiv 2020
[10]

Chen, B., Chao, X., and Shi, C. (2021). Nonparametric learning algorithms for joint pricing and inventory control with lost sales and censored demand. Mathematics of Operations Research , 46(2):726--756

work page 2021
[11]

and Pouzo, D

Chen, X. and Pouzo, D. (2012). Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica , 80(1):277--321

work page 2012
[12]

Chen, X., Qi, Z., and Wan, R. (2023). Steel: Singularity-aware reinforcement learning. arXiv preprint arXiv:2301.13152

work page arXiv 2023
[13]

R., and Schorfheide, F

Christensen, T., Moon, H. R., and Schorfheide, F. (2022). Optimal discrete decisions when payoffs are partially identified. arXiv preprint arXiv:2204.11748

work page arXiv 2022
[14]

Cui, Y. (2021). Individualized Decision - Making Under Partial Identification : Three Perspectives , Two Optimality Results , and One Paradox . Harvard Data Science Review , 3(3)

work page 2021
[15]

Den Boer, A. V. (2014). Dynamic pricing with multiple products and partially specified demand distribution. Mathematics of operations research , 39(3):863--888

work page 2014
[16]

Den Boer, A. V. (2015). Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science , 20(1):1--18

work page 2015
[17]

Elmachtoub, A. N. and Hamilton, M. L. (2021). The power of opaque products in pricing. Management Science , 67(8):4686--4702

work page 2021
[18]

and Keskinocak, P

Elmaghraby, W. and Keskinocak, P. (2003). Dynamic pricing in the presence of inventory considerations: Research overview, current practices, and future directions. Management science , 49(10):1287--1309

work page 2003
[19]

Farahmand, A.-m., Szepesv \'a ri, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. Advances in neural information processing systems , 23

work page 2010
[20]

Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International conference on machine learning , pages 2052--2062. PMLR

work page 2019
[21]

and Van Ryzin, G

Gallego, G. and Van Ryzin, G. (1994). Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management science , 40(8):999--1020

work page 1994
[22]

Huh, W. T. and Rusmevichientong, P. (2009). A nonparametric asymptotic analysis of inventory planning with censored demand. Mathematics of Operations Research , 34(1):103--123

work page 2009
[23]

and Nazerzadeh, H

Javanmard, A. and Nazerzadeh, H. (2019). Dynamic pricing in high-dimensions. Journal of Machine Learning Research , 20(9):1--49

work page 2019
[24]

Jia, H., Shi, C., and Shen, S. (2024). Online learning and pricing for service systems with reusable resources. Operations Research , 72(3):1203--1241

work page 2024
[25]

Jin, Y., Yang, Z., and Wang, Z. (2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning , pages 5084--5096. PMLR

work page 2021
[26]

Keskin, N. B. and Zeevi, A. (2014). Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations research , 62(5):1142--1167

work page 2014
[27]

Kido, D. (2023). Locally asymptotically minimax statistical treatment rules under partial identification. arXiv preprint arXiv:2311.08958

work page arXiv 2023
[28]

Kosorok, M. R. and Moodie, E. E. M. (2015). Adaptive treatment strategies in practice: planning trials and analyzing data for personalized medicine . Society for Industrial and Applied Mathematics, Philadelphia, PA

work page 2015
[29]

O., and Shmoys, D

Levi, R., Roundy, R. O., and Shmoys, D. B. (2007). Provably near-optimal sampling-based policies for stochastic inventory control models. Mathematics of Operations Research , 32(4):821--839

work page 2007
[30]

Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643

work page internal anchor Pith review Pith/arXiv arXiv 2020
[31]

Liao, P., Qi, Z., Wan, R., Klasnja, P., and Murphy, S. A. (2022). Batch policy learning in average reward markov decision processes. Annals of statistics , 50(6):3364

work page 2022
[32]

Lin, K. Y. (2006). Dynamic pricing with real-time demand learning. European Journal of Operational Research , 174(1):522--538

work page 2006
[33]

Ma, W., Simchi-Levi, D., and Zhao, J. (2021). Dynamic pricing (and assortment) under a static calendar. Management Science , 67(4):2292--2313

work page 2021
[34]

Manski, C. F. (2005). Social choice with partial knowledge of treatment response , volume 1. Princeton University Press

work page 2005
[35]

Manski, C. F. (2007). Minimax-regret treatment choice with missing outcome data. Journal of Econometrics , 139(1):105--115

work page 2007
[36]

Masten, M. A. (2023). Minimax-regret treatment rules with many treatments. The Japanese Economic Review , 74(4):501--537

work page 2023
[37]

and Szepesv \'a ri, C

Munos, R. and Szepesv \'a ri, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research , 9(5)

work page 2008
[38]

Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B , 65(2):331--355

work page 2003
[39]

Nambiar, M., Simchi-Levi, D., and Wang, H. (2019). Dynamic learning and pricing with model misspecification. Management Science , 65(11):4980--5000

work page 2019
[40]

Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons

work page 2014
[41]

Qi, Z., Tang, J., Fang, E., and Shi, C. (2022). Offline personalized pricing with censored demand. In Technical Report . [Sl]: SSRN

work page 2022
[42]

Qin, H., Simchi-Levi, D., and Wang, L. (2022). Data-driven approximation schemes for joint pricing and inventory control models. Management Science , 68(9):6591--6609

work page 2022
[43]

Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In Lin, D. Y. and Heagerty, P., editors, Proceedings of the Second Seattle Symposium in Biostatistics , pages 189--326, New York. Springer

work page 2004
[44]

Savage, L. J. (1951). The theory of statistical decision. Journal of the American Statistical association , 46(253):55--67

work page 1951
[45]

Shi, C., Qi, Z., Wang, J., and Zhou, F. (2023). Value enhancement of reinforcement learning via efficient and robust trust region optimization. Journal of the American Statistical Association , pages 1--15

work page 2023
[46]

Silver, D., Huang, A., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature , 529(7587):484--489

work page 2016
[47]

Stoye, J. (2009). Minimax regret treatment choice with finite samples. Journal of Econometrics , 151(1):70--81

work page 2009
[48]

Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validity of experiments. Journal of Econometrics , 166(1):138--156

work page 2012
[49]

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction . MIT press

work page 2018
[50]

Y., Levine, S., Finn, C., and Ma, T

Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. (2020). Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems , 33:14129--14142

work page 2020

[1] [1]

Antos, A., Szepesv \'a ri, C., and Munos, R. (2008). Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning , 71:89--129

work page 2008

[2] [2]

Ban, G.-Y. (2020). Confidence intervals for data-driven inventory policies with demand censoring. Operations Research , 68(2):309--326

work page 2020

[3] [3]

and Keskin, N

Ban, G.-Y. and Keskin, N. B. (2021). Personalized dynamic pricing with machine learning: High-dimensional features and heterogeneous elasticity. Management Science , 67(9):5549--5568

work page 2021

[4] [4]

Bastani, H., Simchi-Levi, D., and Zhu, R. (2022). Meta dynamic pricing: Transfer learning across experiments. Management Science , 68(3):1865--1881

work page 2022

[5] [5]

Bellman, R. (1957). A markovian decision process. Journal of Mathematics and Mechanics , 6(5):679--684

work page 1957

[6] [6]

and Caldentey, R

Bitran, G. and Caldentey, R. (2003). An overview of pricing models for revenue management. Manufacturing & Service Operations Management , 5(3):203--229

work page 2003

[7] [7]

and Rusmevichientong, P

Broder, J. and Rusmevichientong, P. (2012). Dynamic pricing under a general parametric choice model. Operations Research , 60(4):965--980

work page 2012

[8] [8]

Bu, J., Simchi-Levi, D., and Wang, L. (2023). Offline pricing and demand learning with censored data. Management Science , 69(2):885--903

work page 2023

[9] [9]

Buckman, J., Gelada, C., and Bellemare, M. G. (2020). The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799

work page arXiv 2020

[10] [10]

Chen, B., Chao, X., and Shi, C. (2021). Nonparametric learning algorithms for joint pricing and inventory control with lost sales and censored demand. Mathematics of Operations Research , 46(2):726--756

work page 2021

[11] [11]

and Pouzo, D

Chen, X. and Pouzo, D. (2012). Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica , 80(1):277--321

work page 2012

[12] [12]

Chen, X., Qi, Z., and Wan, R. (2023). Steel: Singularity-aware reinforcement learning. arXiv preprint arXiv:2301.13152

work page arXiv 2023

[13] [13]

R., and Schorfheide, F

Christensen, T., Moon, H. R., and Schorfheide, F. (2022). Optimal discrete decisions when payoffs are partially identified. arXiv preprint arXiv:2204.11748

work page arXiv 2022

[14] [14]

Cui, Y. (2021). Individualized Decision - Making Under Partial Identification : Three Perspectives , Two Optimality Results , and One Paradox . Harvard Data Science Review , 3(3)

work page 2021

[15] [15]

Den Boer, A. V. (2014). Dynamic pricing with multiple products and partially specified demand distribution. Mathematics of operations research , 39(3):863--888

work page 2014

[16] [16]

Den Boer, A. V. (2015). Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science , 20(1):1--18

work page 2015

[17] [17]

Elmachtoub, A. N. and Hamilton, M. L. (2021). The power of opaque products in pricing. Management Science , 67(8):4686--4702

work page 2021

[18] [18]

and Keskinocak, P

Elmaghraby, W. and Keskinocak, P. (2003). Dynamic pricing in the presence of inventory considerations: Research overview, current practices, and future directions. Management science , 49(10):1287--1309

work page 2003

[19] [19]

Farahmand, A.-m., Szepesv \'a ri, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. Advances in neural information processing systems , 23

work page 2010

[20] [20]

Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International conference on machine learning , pages 2052--2062. PMLR

work page 2019

[21] [21]

and Van Ryzin, G

Gallego, G. and Van Ryzin, G. (1994). Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management science , 40(8):999--1020

work page 1994

[22] [22]

Huh, W. T. and Rusmevichientong, P. (2009). A nonparametric asymptotic analysis of inventory planning with censored demand. Mathematics of Operations Research , 34(1):103--123

work page 2009

[23] [23]

and Nazerzadeh, H

Javanmard, A. and Nazerzadeh, H. (2019). Dynamic pricing in high-dimensions. Journal of Machine Learning Research , 20(9):1--49

work page 2019

[24] [24]

Jia, H., Shi, C., and Shen, S. (2024). Online learning and pricing for service systems with reusable resources. Operations Research , 72(3):1203--1241

work page 2024

[25] [25]

Jin, Y., Yang, Z., and Wang, Z. (2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning , pages 5084--5096. PMLR

work page 2021

[26] [26]

Keskin, N. B. and Zeevi, A. (2014). Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations research , 62(5):1142--1167

work page 2014

[27] [27]

Kido, D. (2023). Locally asymptotically minimax statistical treatment rules under partial identification. arXiv preprint arXiv:2311.08958

work page arXiv 2023

[28] [28]

Kosorok, M. R. and Moodie, E. E. M. (2015). Adaptive treatment strategies in practice: planning trials and analyzing data for personalized medicine . Society for Industrial and Applied Mathematics, Philadelphia, PA

work page 2015

[29] [29]

O., and Shmoys, D

Levi, R., Roundy, R. O., and Shmoys, D. B. (2007). Provably near-optimal sampling-based policies for stochastic inventory control models. Mathematics of Operations Research , 32(4):821--839

work page 2007

[30] [30]

Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643

work page internal anchor Pith review Pith/arXiv arXiv 2020

[31] [31]

Liao, P., Qi, Z., Wan, R., Klasnja, P., and Murphy, S. A. (2022). Batch policy learning in average reward markov decision processes. Annals of statistics , 50(6):3364

work page 2022

[32] [32]

Lin, K. Y. (2006). Dynamic pricing with real-time demand learning. European Journal of Operational Research , 174(1):522--538

work page 2006

[33] [33]

Ma, W., Simchi-Levi, D., and Zhao, J. (2021). Dynamic pricing (and assortment) under a static calendar. Management Science , 67(4):2292--2313

work page 2021

[34] [34]

Manski, C. F. (2005). Social choice with partial knowledge of treatment response , volume 1. Princeton University Press

work page 2005

[35] [35]

Manski, C. F. (2007). Minimax-regret treatment choice with missing outcome data. Journal of Econometrics , 139(1):105--115

work page 2007

[36] [36]

Masten, M. A. (2023). Minimax-regret treatment rules with many treatments. The Japanese Economic Review , 74(4):501--537

work page 2023

[37] [37]

and Szepesv \'a ri, C

Munos, R. and Szepesv \'a ri, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research , 9(5)

work page 2008

[38] [38]

Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B , 65(2):331--355

work page 2003

[39] [39]

Nambiar, M., Simchi-Levi, D., and Wang, H. (2019). Dynamic learning and pricing with model misspecification. Management Science , 65(11):4980--5000

work page 2019

[40] [40]

Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons

work page 2014

[41] [41]

Qi, Z., Tang, J., Fang, E., and Shi, C. (2022). Offline personalized pricing with censored demand. In Technical Report . [Sl]: SSRN

work page 2022

[42] [42]

Qin, H., Simchi-Levi, D., and Wang, L. (2022). Data-driven approximation schemes for joint pricing and inventory control models. Management Science , 68(9):6591--6609

work page 2022

[43] [43]

Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In Lin, D. Y. and Heagerty, P., editors, Proceedings of the Second Seattle Symposium in Biostatistics , pages 189--326, New York. Springer

work page 2004

[44] [44]

Savage, L. J. (1951). The theory of statistical decision. Journal of the American Statistical association , 46(253):55--67

work page 1951

[45] [45]

Shi, C., Qi, Z., Wang, J., and Zhou, F. (2023). Value enhancement of reinforcement learning via efficient and robust trust region optimization. Journal of the American Statistical Association , pages 1--15

work page 2023

[46] [46]

Silver, D., Huang, A., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature , 529(7587):484--489

work page 2016

[47] [47]

Stoye, J. (2009). Minimax regret treatment choice with finite samples. Journal of Econometrics , 151(1):70--81

work page 2009

[48] [48]

Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validity of experiments. Journal of Econometrics , 166(1):138--156

work page 2012

[49] [49]

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction . MIT press

work page 2018

[50] [50]

Y., Levine, S., Finn, C., and Ma, T

Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. (2020). Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems , 33:14129--14142

work page 2020