Learning to Bid in Repeated Second-Price Auctions with Dynamic Values and Aggregated Feedback

Benjamin Heymann; Otmane Sakhi

arxiv: 2605.28133 · v1 · pith:CYGY2OPInew · submitted 2026-05-27 · 💻 cs.LG · stat.ML

Learning to Bid in Repeated Second-Price Auctions with Dynamic Values and Aggregated Feedback

Benjamin Heymann , Otmane Sakhi This is my paper

Pith reviewed 2026-06-29 14:25 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords online learningsecond-price auctionsdynamic valuesregret boundsaggregated feedbackconfidence boundsbidding strategies

0 comments

The pith

A confidence-bound algorithm learns the optimal bidding policy with near-optimal regret using only aggregated feedback in repeated second-price auctions with dynamic values.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies a bidder in repeated second-price auctions whose value for winning depends on the time elapsed since the last successful bid. Auctions occur in continuous time and the bidder sees only aggregated feedback at the horizon's end. The bidder must trade off immediate wins against their effect on future values while learning unknown parameters. The authors derive regret bounds for methods that combine plug-in estimators with a differential-equation characterization of the optimal policy. A specific confidence-bound algorithm attains ilde O(log N) regret for piecewise-linear primitives and ilde O(N^{1/3}) regret for general smooth primitives without explicit randomization.

Core claim

The paper claims that a confidence-bound algorithm using plug-in estimators for unknown parameters together with the differential-equation characterization of the optimal bidding policy learns the optimal policy with regret ilde O(log N) when the primitives are piecewise linear and ilde O(N^{1/3}) when they are general and smooth.

What carries the argument

The differential-equation characterization of the optimal policy combined with plug-in estimators from aggregated feedback, embedded in a confidence-bound algorithm.

If this is right

The algorithm balances the immediate benefit of winning against future value impact while learning from aggregated feedback alone.
The same regret rates hold for both piecewise-linear and smooth value functions.
No explicit randomization is required to achieve the bounds.
Numerical experiments confirm the theoretical rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar plug-in plus differential-equation methods might apply to other dynamic bidding or resource allocation problems with delayed feedback.
The absence of randomization could simplify deployment in practical bidding systems.
The approach may connect to other online learning settings where optimal policies satisfy differential equations.

Load-bearing premise

The bidder's value is a known function of time since the last successful bid with unknown parameters, and the optimal policy admits a differential-equation characterization.

What would settle it

An experiment or calculation in which the observed regret for the described primitives exceeds the stated ilde O(log N) or ilde O(N^{1/3}) bounds would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.28133 by Benjamin Heymann, Otmane Sakhi.

**Figure 1.** Figure 1: Regret as a function of N. In this section, we design a simple experimental setting to validate our algorithms. We consider a single environment with k(t) = 1 − exp(−θt) and q(v) = v α, where α = 2 and θ = 0.1. We use a family of piecewise linear functions with 10 uniform breakpoints as an approximation. We set µ = 0.5 and γ = 0.1, and assume that these parameters are known. We evaluate the no-regret algor… view at source ↗

read the original abstract

We study the problem of learning to bid when the bidder's value is dynamic, i.e., when the current value depends on past outcomes. Specifically, we consider a bidder participating in repeated second-price auctions whose value depends on the time elapsed since their last successful bid, with auctions arriving in continuous time and only aggregated feedback revealed at the end of the horizon. Such a bidder must (1) balance the immediate benefit of winning the current auction against its impact on future values and (2) learn unknown environmental parameters. We derive regret bounds for a class of learning methods that combine plug-in estimators with a differential-equation characterization of the optimal policy, and show that a specific confidence bound algorithm learns the optimal policy with a near optimal regret of $\widetilde{O}(\log N)$ for piecewise linear primitives, and $\widetilde{O}(N^{1/3})$ for general, smooth primitives, achieving these regrets without explicit randomization. These theoretical results are supported by numerical experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper models repeated second-price bidding with values that reset based on time since last win plus only end-of-horizon aggregates, then claims Õ(log N) and Õ(N^{1/3}) regret via DE characterization plus plug-in estimation.

read the letter

The core contribution is a new repeated-auction setting where the bidder's value is a function of time elapsed since the last win, arrivals are continuous-time, and feedback arrives only as aggregates at the horizon end. They characterize the optimal policy through a differential equation and combine it with plug-in parameter estimates to get explicit regret rates without needing explicit randomization.

What works is the modeling. Prior repeated-auction work usually assumes stationary values or per-round feedback; tying value to time-since-win and restricting to aggregates captures some practical bidding environments more closely. The claimed rates (log N for piecewise-linear primitives, N^{1/3} for smooth ones) are concrete and the numerical experiments are mentioned as support.

The soft spot is the stability of the DE-to-policy map under plug-in estimation from aggregates. The regret bounds require that estimation error from the limited feedback translates into policy error at the right rate. If the DE solution is not uniformly Lipschitz in the parameters, or if the aggregates do not yield consistent estimators fast enough, the log-N claim in particular could slip. The abstract gives no detail on boundary conditions, numerical DE solving, or identifiability under aggregation, so this needs checking in the proofs.

This is for people already working on online learning in auctions with partial feedback. It is specific enough and has enough formal claims to deserve a serious referee, though the referee will likely focus on the error propagation step between the DE and the estimators.

Referee Report

2 major / 2 minor

Summary. The paper studies learning to bid in repeated second-price auctions where the bidder's value depends on time elapsed since the last successful bid. Auctions arrive in continuous time with only aggregated feedback at the end of the horizon. The authors derive regret bounds for methods that combine plug-in estimators with a differential-equation characterization of the optimal policy, claiming Õ(log N) regret for piecewise-linear primitives and Õ(N^{1/3}) for general smooth primitives, achieved without explicit randomization. Numerical experiments are provided to support the results.

Significance. If the central claims hold, the work would advance online learning for auctions with dynamic values and limited feedback by showing how a DE-based policy characterization can be paired with plug-in estimation from aggregates to obtain near-optimal regret rates without randomization. This addresses a practically relevant setting where explicit randomization may be undesirable.

major comments (2)

[main theorems / §4-5 (DE characterization and plug-in analysis)] The central regret claims (abstract and main theorems) rest on the DE-to-policy map being uniformly Lipschitz in the unknown parameters so that plug-in estimates from aggregated feedback yield the stated rates. No explicit lemma or proposition is referenced establishing this Lipschitz property or the required parametric convergence rate of the estimators under aggregation; without it the Õ(log N) bound for piecewise-linear primitives is not guaranteed.
[§3 (model and DE characterization)] The abstract states that the optimal policy admits a differential-equation characterization that combines with plug-in estimators, but boundary conditions, identifiability under end-of-horizon aggregation, and numerical stability of the DE solution are not detailed. These are load-bearing for controlling sub-optimality of the resulting policy and thus for both regret bounds.

minor comments (2)

[§2] Notation for the primitives (piecewise-linear vs. smooth) and the exact form of the aggregated feedback should be clarified in the model section to avoid ambiguity when reading the regret statements.
[experiments] The numerical experiments section would benefit from reporting the precise parameter values used for the DE solver and the number of independent runs to allow reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [main theorems / §4-5 (DE characterization and plug-in analysis)] The central regret claims (abstract and main theorems) rest on the DE-to-policy map being uniformly Lipschitz in the unknown parameters so that plug-in estimates from aggregated feedback yield the stated rates. No explicit lemma or proposition is referenced establishing this Lipschitz property or the required parametric convergence rate of the estimators under aggregation; without it the Õ(log N) bound for piecewise-linear primitives is not guaranteed.

Authors: We agree that the connection between the Lipschitz property of the DE-to-policy map and the plug-in estimator convergence rate should be stated more explicitly in the main text. Lemma 4.2 in the appendix establishes uniform Lipschitz continuity of the map with respect to the parameters, and Proposition 3.3 provides the parametric convergence rate under end-of-horizon aggregation. We will add forward references to these results in the statements of Theorems 4.1 and 5.1, along with a short paragraph in Section 4 summarizing how the Lipschitz constant enters the regret analysis. This clarification does not change the claims but improves readability. revision: partial
Referee: [§3 (model and DE characterization)] The abstract states that the optimal policy admits a differential-equation characterization that combines with plug-in estimators, but boundary conditions, identifiability under end-of-horizon aggregation, and numerical stability of the DE solution are not detailed. These are load-bearing for controlling sub-optimality of the resulting policy and thus for both regret bounds.

Authors: We acknowledge that a more self-contained treatment of these aspects in the main body would strengthen the exposition. Boundary conditions appear in Equation (3), identifiability is formalized in Assumption 2.1, and numerical stability is ensured by the uniform discretization scheme described in Section 6. We will expand Section 3 with a new subsection that collects these elements, includes a short proof sketch for identifiability from aggregated observations, and states the stability bound used in the regret analysis. These additions address the referee's concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation combines external DE characterization with plug-in estimation

full rationale

The abstract states that regret bounds are derived for methods combining plug-in estimators with a differential-equation characterization of the optimal policy. No load-bearing step reduces by construction to a fitted quantity or self-citation; the DE characterization is presented as an independent modeling step whose solution is then estimated from aggregates. The provided text contains no equations or claims that equate the regret bound to an input fit or rename a known result. This matches the default expectation of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard assumptions from online learning and continuous-time optimal control (existence of an optimal policy via differential equations, learnability of parameters from aggregated feedback) plus domain assumptions on the value function primitives; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption The optimal bidding policy admits a differential-equation characterization that remains valid under the dynamic value model.
Invoked to justify combining plug-in estimators with the DE characterization.
domain assumption Environmental parameters are identifiable from aggregated end-of-horizon feedback.
Required for the plug-in estimator to converge to the quantities needed for the regret analysis.

pith-pipeline@v0.9.1-grok · 5696 in / 1403 out tokens · 35975 ms · 2026-06-29T14:25:16.204542+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 4 canonical work pages

[1]

Academic press, 2009

Vijay Krishna.Auction theory. Academic press, 2009

2009
[2]

Causal models for real time bidding with repeated user interactions

Martin Bompaire, Alexandre Gilotte, and Benjamin Heymann. Causal models for real time bidding with repeated user interactions. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 75–85, New York, NY , USA, 2021. Association for Computing Machinery

2021
[3]

Attribution modeling increases efficiency of bidding in display advertising

Eustache Diemert, Julien Meynet, Pierre Galland, and Damien Lefortier. Attribution modeling increases efficiency of bidding in display advertising. InProceedings of the ADKDD’17, pages 1–6. 2017

2017
[4]

A pragmatic policy learning approach to account for users’ fatigue in repeated auctions.arXiv preprint arXiv:2407.10504, 2024

Benjamin Heymann, Alexandre Gilotte, and Rémi Chan-Renous. A pragmatic policy learning approach to account for users’ fatigue in repeated auctions.arXiv preprint arXiv:2407.10504, 2024

work page arXiv 2024
[5]

Fixed point label attribution for real-time bidding.Manufacturing & Service Operations Management, 26(3):1043–1061, 2024

Martin Bompaire, Antoine Désir, and Benjamin Heymann. Fixed point label attribution for real-time bidding.Manufacturing & Service Operations Management, 26(3):1043–1061, 2024

2024
[6]

Repeated bidding with dynamic value

Benjamin Heymann, Alexandre Gilotte, and Rémi Chan-Renous. Repeated bidding with dynamic value. In Marios Mavronicolas, Qi Qi, and Grant Schoenebeck, editors,Web and Internet Economics, pages 548–563, Cham, 2026. Springer Nature Switzerland

2026
[7]

Efficient algorithms for stochastic repeated second-price auctions

Juliette Achddou, Olivier Cappé, and Aurélien Garivier. Efficient algorithms for stochastic repeated second-price auctions. InAlgorithmic Learning Theory, pages 99–150. PMLR, 2021

2021
[8]

Online learning in repeated auctions

Jonathan Weed, Vianney Perchet, and Philippe Rigollet. Online learning in repeated auctions. InConference on Learning Theory, pages 1562–1583. PMLR, 2016

2016
[9]

Learning in repeated auctions with budgets: Regret minimization and equilibrium.Management Science, 65(9):3952–3968, 2019

Santiago R Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimization and equilibrium.Management Science, 65(9):3952–3968, 2019

2019
[10]

Dual mirror descent for online allocation problems

Santiago Balseiro, Haihao Lu, and Vahab Mirrokni. Dual mirror descent for online allocation problems. InInternational Conference on Machine Learning, pages 613–628. PMLR, 2020

2020
[11]

A unifying framework for online optimization with long-term constraints.Advances in Neural Information Processing Systems, 35:33589–33602, 2022

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Giulia Romano, and Nicola Gatti. A unifying framework for online optimization with long-term constraints.Advances in Neural Information Processing Systems, 35:33589–33602, 2022

2022
[12]

Maximizing the success probability of policy allocations in online systems

Artem Betlei, Mariia Vladimirova, Mehdi Sebbar, Nicolas Urien, Thibaud Rahier, and Benjamin Heymann. Maximizing the success probability of policy allocations in online systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 11061–11068, 2024

2024
[13]

Non-linear counterfactual aggregate optimization

Benjamin Heymann and Otmane Sakhi. Non-linear counterfactual aggregate optimization. arXiv preprint arXiv:2509.03438, 2025. Presented at Recsys 2025 Consequences workshop

work page arXiv 2025
[14]

Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023

Xin Guo, Anran Hu, and Yufei Zhang. Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023

2023
[15]

Continuous-time model-based reinforcement learning

Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Continuous-time model-based reinforcement learning. InInternational Conference on Machine Learning, pages 12009–12018. PMLR, 2021

2021
[16]

Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

2020
[17]

Incrementality bidding via reinforcement learning under mixed and delayed rewards.Advances in Neural Information Processing Systems, 35:2142–2153, 2022

Ashwinkumar Badanidiyuru Varadaraja, Zhe Feng, Tianxi Li, and Haifeng Xu. Incrementality bidding via reinforcement learning under mixed and delayed rewards.Advances in Neural Information Processing Systems, 35:2142–2153, 2022. 10

2022
[18]

arXiv preprint arXiv:2208.12809 , eprint =

Randall Lewis and Jeffrey Wong. Incrementality bidding and attribution.arXiv preprint arXiv:2208.12809, 2022

work page arXiv 2022
[19]

Ads that stick: Near-optimal ad optimization through psychological behavior models.arXiv preprint arXiv:2509.20304, 2025

Kailash Gopal Darmasubramanian, Akash Pareek, Arindam Khan, and Arpit Agarwal. Ads that stick: Near-optimal ad optimization through psychological behavior models.arXiv preprint arXiv:2509.20304, 2025

work page arXiv 2025
[20]

SIAM, 2002

Philip Hartman.Ordinary differential equations. SIAM, 2002

2002
[21]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011

2011
[22]

Oxford University Press, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013

2013
[23]

Joel A. Tropp. Freedman’s inequality for matrix martingales, 2011

2011
[24]

Cambridge University Press, March 2004

Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, March 2004

2004
[25]

Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon

Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences.The Annals of Statistics, 49(2):1055 – 1080, 2021

2021
[26]

exp − Z T 0 µq(πn(t))dt !# ≥1−E T∼Exp(γ)

Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020. 11 Appendix Outline A Concentration Lemmata B Proofs on the Theoretical Solution C Proofs and remarks for Section 4.3 (concentration ofK) D Proofs and remarks for Section 4.4 (concentration ofQ) E Analysis of the algorithms F Analysis of the lower bound A Concentration...

2020
[27]

Compute ˆk=K(D k N1) andˆq=Q(D q N1)

Exploration Phase ( N1 episodes):Use a fixed exploratory policy πx that bids k(∞) (the maximal value) to collect data for estimating k and q. Compute ˆk=K(D k N1) andˆq=Q(D q N1). 2.Exploitation Phase (N 2 =N−N 1 episodes):Use policyπ=S( ˆk,ˆq). The algorithm explores withN 1 episodes and exploits the rest of theN 2 sessions. Let δ∈(0,1) . There exists C0...
[28]

Compute ˆk=K(D k N1)

Exploration Phase for k (N1 episodes):Use a fixed exploratory policy πx that bids b0 >0to collect data for estimatingk. Compute ˆk=K(D k N1)
[29]

Computeˆq=Q(D q N2)

Exploration Phase for q (N2 episodes):Use the learned πk = ˆk to collect data for estimatingq. Computeˆq=Q(D q N2). 3.Exploitation Phase (N 3 =N−N 1 −N 2 episodes):Use policyπ=S( ˆk,ˆq). The algorithm learns ˆk, uses ˆk to learn ˆqthen exploits with π=S( ˆk,ˆq)the rest of the N sessions. Let δ∈(0,1) . There exists C0 such that once N≥C 0, setting N1 =N 2 ...

[1] [1]

Academic press, 2009

Vijay Krishna.Auction theory. Academic press, 2009

2009

[2] [2]

Causal models for real time bidding with repeated user interactions

Martin Bompaire, Alexandre Gilotte, and Benjamin Heymann. Causal models for real time bidding with repeated user interactions. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 75–85, New York, NY , USA, 2021. Association for Computing Machinery

2021

[3] [3]

Attribution modeling increases efficiency of bidding in display advertising

Eustache Diemert, Julien Meynet, Pierre Galland, and Damien Lefortier. Attribution modeling increases efficiency of bidding in display advertising. InProceedings of the ADKDD’17, pages 1–6. 2017

2017

[4] [4]

A pragmatic policy learning approach to account for users’ fatigue in repeated auctions.arXiv preprint arXiv:2407.10504, 2024

Benjamin Heymann, Alexandre Gilotte, and Rémi Chan-Renous. A pragmatic policy learning approach to account for users’ fatigue in repeated auctions.arXiv preprint arXiv:2407.10504, 2024

work page arXiv 2024

[5] [5]

Fixed point label attribution for real-time bidding.Manufacturing & Service Operations Management, 26(3):1043–1061, 2024

Martin Bompaire, Antoine Désir, and Benjamin Heymann. Fixed point label attribution for real-time bidding.Manufacturing & Service Operations Management, 26(3):1043–1061, 2024

2024

[6] [6]

Repeated bidding with dynamic value

Benjamin Heymann, Alexandre Gilotte, and Rémi Chan-Renous. Repeated bidding with dynamic value. In Marios Mavronicolas, Qi Qi, and Grant Schoenebeck, editors,Web and Internet Economics, pages 548–563, Cham, 2026. Springer Nature Switzerland

2026

[7] [7]

Efficient algorithms for stochastic repeated second-price auctions

Juliette Achddou, Olivier Cappé, and Aurélien Garivier. Efficient algorithms for stochastic repeated second-price auctions. InAlgorithmic Learning Theory, pages 99–150. PMLR, 2021

2021

[8] [8]

Online learning in repeated auctions

Jonathan Weed, Vianney Perchet, and Philippe Rigollet. Online learning in repeated auctions. InConference on Learning Theory, pages 1562–1583. PMLR, 2016

2016

[9] [9]

Learning in repeated auctions with budgets: Regret minimization and equilibrium.Management Science, 65(9):3952–3968, 2019

Santiago R Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimization and equilibrium.Management Science, 65(9):3952–3968, 2019

2019

[10] [10]

Dual mirror descent for online allocation problems

Santiago Balseiro, Haihao Lu, and Vahab Mirrokni. Dual mirror descent for online allocation problems. InInternational Conference on Machine Learning, pages 613–628. PMLR, 2020

2020

[11] [11]

A unifying framework for online optimization with long-term constraints.Advances in Neural Information Processing Systems, 35:33589–33602, 2022

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Giulia Romano, and Nicola Gatti. A unifying framework for online optimization with long-term constraints.Advances in Neural Information Processing Systems, 35:33589–33602, 2022

2022

[12] [12]

Maximizing the success probability of policy allocations in online systems

Artem Betlei, Mariia Vladimirova, Mehdi Sebbar, Nicolas Urien, Thibaud Rahier, and Benjamin Heymann. Maximizing the success probability of policy allocations in online systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 11061–11068, 2024

2024

[13] [13]

Non-linear counterfactual aggregate optimization

Benjamin Heymann and Otmane Sakhi. Non-linear counterfactual aggregate optimization. arXiv preprint arXiv:2509.03438, 2025. Presented at Recsys 2025 Consequences workshop

work page arXiv 2025

[14] [14]

Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023

Xin Guo, Anran Hu, and Yufei Zhang. Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023

2023

[15] [15]

Continuous-time model-based reinforcement learning

Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Continuous-time model-based reinforcement learning. InInternational Conference on Machine Learning, pages 12009–12018. PMLR, 2021

2021

[16] [16]

Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

2020

[17] [17]

Incrementality bidding via reinforcement learning under mixed and delayed rewards.Advances in Neural Information Processing Systems, 35:2142–2153, 2022

Ashwinkumar Badanidiyuru Varadaraja, Zhe Feng, Tianxi Li, and Haifeng Xu. Incrementality bidding via reinforcement learning under mixed and delayed rewards.Advances in Neural Information Processing Systems, 35:2142–2153, 2022. 10

2022

[18] [18]

arXiv preprint arXiv:2208.12809 , eprint =

Randall Lewis and Jeffrey Wong. Incrementality bidding and attribution.arXiv preprint arXiv:2208.12809, 2022

work page arXiv 2022

[19] [19]

Ads that stick: Near-optimal ad optimization through psychological behavior models.arXiv preprint arXiv:2509.20304, 2025

Kailash Gopal Darmasubramanian, Akash Pareek, Arindam Khan, and Arpit Agarwal. Ads that stick: Near-optimal ad optimization through psychological behavior models.arXiv preprint arXiv:2509.20304, 2025

work page arXiv 2025

[20] [20]

SIAM, 2002

Philip Hartman.Ordinary differential equations. SIAM, 2002

2002

[21] [21]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011

2011

[22] [22]

Oxford University Press, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013

2013

[23] [23]

Joel A. Tropp. Freedman’s inequality for matrix martingales, 2011

2011

[24] [24]

Cambridge University Press, March 2004

Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, March 2004

2004

[25] [25]

Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon

Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences.The Annals of Statistics, 49(2):1055 – 1080, 2021

2021

[26] [26]

exp − Z T 0 µq(πn(t))dt !# ≥1−E T∼Exp(γ)

Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020. 11 Appendix Outline A Concentration Lemmata B Proofs on the Theoretical Solution C Proofs and remarks for Section 4.3 (concentration ofK) D Proofs and remarks for Section 4.4 (concentration ofQ) E Analysis of the algorithms F Analysis of the lower bound A Concentration...

2020

[27] [27]

Compute ˆk=K(D k N1) andˆq=Q(D q N1)

Exploration Phase ( N1 episodes):Use a fixed exploratory policy πx that bids k(∞) (the maximal value) to collect data for estimating k and q. Compute ˆk=K(D k N1) andˆq=Q(D q N1). 2.Exploitation Phase (N 2 =N−N 1 episodes):Use policyπ=S( ˆk,ˆq). The algorithm explores withN 1 episodes and exploits the rest of theN 2 sessions. Let δ∈(0,1) . There exists C0...

[28] [28]

Compute ˆk=K(D k N1)

Exploration Phase for k (N1 episodes):Use a fixed exploratory policy πx that bids b0 >0to collect data for estimatingk. Compute ˆk=K(D k N1)

[29] [29]

Computeˆq=Q(D q N2)

Exploration Phase for q (N2 episodes):Use the learned πk = ˆk to collect data for estimatingq. Computeˆq=Q(D q N2). 3.Exploitation Phase (N 3 =N−N 1 −N 2 episodes):Use policyπ=S( ˆk,ˆq). The algorithm learns ˆk, uses ˆk to learn ˆqthen exploits with π=S( ˆk,ˆq)the rest of the N sessions. Let δ∈(0,1) . There exists C0 such that once N≥C 0, setting N1 =N 2 ...