Learning to Bid in Repeated Second-Price Auctions with Dynamic Values and Aggregated Feedback
Pith reviewed 2026-06-29 14:25 UTC · model grok-4.3
The pith
A confidence-bound algorithm learns the optimal bidding policy with near-optimal regret using only aggregated feedback in repeated second-price auctions with dynamic values.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a confidence-bound algorithm using plug-in estimators for unknown parameters together with the differential-equation characterization of the optimal bidding policy learns the optimal policy with regret ilde O(log N) when the primitives are piecewise linear and ilde O(N^{1/3}) when they are general and smooth.
What carries the argument
The differential-equation characterization of the optimal policy combined with plug-in estimators from aggregated feedback, embedded in a confidence-bound algorithm.
If this is right
- The algorithm balances the immediate benefit of winning against future value impact while learning from aggregated feedback alone.
- The same regret rates hold for both piecewise-linear and smooth value functions.
- No explicit randomization is required to achieve the bounds.
- Numerical experiments confirm the theoretical rates.
Where Pith is reading between the lines
- Similar plug-in plus differential-equation methods might apply to other dynamic bidding or resource allocation problems with delayed feedback.
- The absence of randomization could simplify deployment in practical bidding systems.
- The approach may connect to other online learning settings where optimal policies satisfy differential equations.
Load-bearing premise
The bidder's value is a known function of time since the last successful bid with unknown parameters, and the optimal policy admits a differential-equation characterization.
What would settle it
An experiment or calculation in which the observed regret for the described primitives exceeds the stated ilde O(log N) or ilde O(N^{1/3}) bounds would falsify the claim.
Figures
read the original abstract
We study the problem of learning to bid when the bidder's value is dynamic, i.e., when the current value depends on past outcomes. Specifically, we consider a bidder participating in repeated second-price auctions whose value depends on the time elapsed since their last successful bid, with auctions arriving in continuous time and only aggregated feedback revealed at the end of the horizon. Such a bidder must (1) balance the immediate benefit of winning the current auction against its impact on future values and (2) learn unknown environmental parameters. We derive regret bounds for a class of learning methods that combine plug-in estimators with a differential-equation characterization of the optimal policy, and show that a specific confidence bound algorithm learns the optimal policy with a near optimal regret of $\widetilde{O}(\log N)$ for piecewise linear primitives, and $\widetilde{O}(N^{1/3})$ for general, smooth primitives, achieving these regrets without explicit randomization. These theoretical results are supported by numerical experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies learning to bid in repeated second-price auctions where the bidder's value depends on time elapsed since the last successful bid. Auctions arrive in continuous time with only aggregated feedback at the end of the horizon. The authors derive regret bounds for methods that combine plug-in estimators with a differential-equation characterization of the optimal policy, claiming Õ(log N) regret for piecewise-linear primitives and Õ(N^{1/3}) for general smooth primitives, achieved without explicit randomization. Numerical experiments are provided to support the results.
Significance. If the central claims hold, the work would advance online learning for auctions with dynamic values and limited feedback by showing how a DE-based policy characterization can be paired with plug-in estimation from aggregates to obtain near-optimal regret rates without randomization. This addresses a practically relevant setting where explicit randomization may be undesirable.
major comments (2)
- [main theorems / §4-5 (DE characterization and plug-in analysis)] The central regret claims (abstract and main theorems) rest on the DE-to-policy map being uniformly Lipschitz in the unknown parameters so that plug-in estimates from aggregated feedback yield the stated rates. No explicit lemma or proposition is referenced establishing this Lipschitz property or the required parametric convergence rate of the estimators under aggregation; without it the Õ(log N) bound for piecewise-linear primitives is not guaranteed.
- [§3 (model and DE characterization)] The abstract states that the optimal policy admits a differential-equation characterization that combines with plug-in estimators, but boundary conditions, identifiability under end-of-horizon aggregation, and numerical stability of the DE solution are not detailed. These are load-bearing for controlling sub-optimality of the resulting policy and thus for both regret bounds.
minor comments (2)
- [§2] Notation for the primitives (piecewise-linear vs. smooth) and the exact form of the aggregated feedback should be clarified in the model section to avoid ambiguity when reading the regret statements.
- [experiments] The numerical experiments section would benefit from reporting the precise parameter values used for the DE solver and the number of independent runs to allow reproducibility assessment.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [main theorems / §4-5 (DE characterization and plug-in analysis)] The central regret claims (abstract and main theorems) rest on the DE-to-policy map being uniformly Lipschitz in the unknown parameters so that plug-in estimates from aggregated feedback yield the stated rates. No explicit lemma or proposition is referenced establishing this Lipschitz property or the required parametric convergence rate of the estimators under aggregation; without it the Õ(log N) bound for piecewise-linear primitives is not guaranteed.
Authors: We agree that the connection between the Lipschitz property of the DE-to-policy map and the plug-in estimator convergence rate should be stated more explicitly in the main text. Lemma 4.2 in the appendix establishes uniform Lipschitz continuity of the map with respect to the parameters, and Proposition 3.3 provides the parametric convergence rate under end-of-horizon aggregation. We will add forward references to these results in the statements of Theorems 4.1 and 5.1, along with a short paragraph in Section 4 summarizing how the Lipschitz constant enters the regret analysis. This clarification does not change the claims but improves readability. revision: partial
-
Referee: [§3 (model and DE characterization)] The abstract states that the optimal policy admits a differential-equation characterization that combines with plug-in estimators, but boundary conditions, identifiability under end-of-horizon aggregation, and numerical stability of the DE solution are not detailed. These are load-bearing for controlling sub-optimality of the resulting policy and thus for both regret bounds.
Authors: We acknowledge that a more self-contained treatment of these aspects in the main body would strengthen the exposition. Boundary conditions appear in Equation (3), identifiability is formalized in Assumption 2.1, and numerical stability is ensured by the uniform discretization scheme described in Section 6. We will expand Section 3 with a new subsection that collects these elements, includes a short proof sketch for identifiability from aggregated observations, and states the stability bound used in the regret analysis. These additions address the referee's concern directly. revision: yes
Circularity Check
No circularity: derivation combines external DE characterization with plug-in estimation
full rationale
The abstract states that regret bounds are derived for methods combining plug-in estimators with a differential-equation characterization of the optimal policy. No load-bearing step reduces by construction to a fitted quantity or self-citation; the DE characterization is presented as an independent modeling step whose solution is then estimated from aggregates. The provided text contains no equations or claims that equate the regret bound to an input fit or rename a known result. This matches the default expectation of a self-contained derivation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The optimal bidding policy admits a differential-equation characterization that remains valid under the dynamic value model.
- domain assumption Environmental parameters are identifiable from aggregated end-of-horizon feedback.
Reference graph
Works this paper leans on
-
[1]
Academic press, 2009
Vijay Krishna.Auction theory. Academic press, 2009
2009
-
[2]
Causal models for real time bidding with repeated user interactions
Martin Bompaire, Alexandre Gilotte, and Benjamin Heymann. Causal models for real time bidding with repeated user interactions. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 75–85, New York, NY , USA, 2021. Association for Computing Machinery
2021
-
[3]
Attribution modeling increases efficiency of bidding in display advertising
Eustache Diemert, Julien Meynet, Pierre Galland, and Damien Lefortier. Attribution modeling increases efficiency of bidding in display advertising. InProceedings of the ADKDD’17, pages 1–6. 2017
2017
-
[4]
Benjamin Heymann, Alexandre Gilotte, and Rémi Chan-Renous. A pragmatic policy learning approach to account for users’ fatigue in repeated auctions.arXiv preprint arXiv:2407.10504, 2024
-
[5]
Fixed point label attribution for real-time bidding.Manufacturing & Service Operations Management, 26(3):1043–1061, 2024
Martin Bompaire, Antoine Désir, and Benjamin Heymann. Fixed point label attribution for real-time bidding.Manufacturing & Service Operations Management, 26(3):1043–1061, 2024
2024
-
[6]
Repeated bidding with dynamic value
Benjamin Heymann, Alexandre Gilotte, and Rémi Chan-Renous. Repeated bidding with dynamic value. In Marios Mavronicolas, Qi Qi, and Grant Schoenebeck, editors,Web and Internet Economics, pages 548–563, Cham, 2026. Springer Nature Switzerland
2026
-
[7]
Efficient algorithms for stochastic repeated second-price auctions
Juliette Achddou, Olivier Cappé, and Aurélien Garivier. Efficient algorithms for stochastic repeated second-price auctions. InAlgorithmic Learning Theory, pages 99–150. PMLR, 2021
2021
-
[8]
Online learning in repeated auctions
Jonathan Weed, Vianney Perchet, and Philippe Rigollet. Online learning in repeated auctions. InConference on Learning Theory, pages 1562–1583. PMLR, 2016
2016
-
[9]
Learning in repeated auctions with budgets: Regret minimization and equilibrium.Management Science, 65(9):3952–3968, 2019
Santiago R Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimization and equilibrium.Management Science, 65(9):3952–3968, 2019
2019
-
[10]
Dual mirror descent for online allocation problems
Santiago Balseiro, Haihao Lu, and Vahab Mirrokni. Dual mirror descent for online allocation problems. InInternational Conference on Machine Learning, pages 613–628. PMLR, 2020
2020
-
[11]
A unifying framework for online optimization with long-term constraints.Advances in Neural Information Processing Systems, 35:33589–33602, 2022
Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Giulia Romano, and Nicola Gatti. A unifying framework for online optimization with long-term constraints.Advances in Neural Information Processing Systems, 35:33589–33602, 2022
2022
-
[12]
Maximizing the success probability of policy allocations in online systems
Artem Betlei, Mariia Vladimirova, Mehdi Sebbar, Nicolas Urien, Thibaud Rahier, and Benjamin Heymann. Maximizing the success probability of policy allocations in online systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 11061–11068, 2024
2024
-
[13]
Non-linear counterfactual aggregate optimization
Benjamin Heymann and Otmane Sakhi. Non-linear counterfactual aggregate optimization. arXiv preprint arXiv:2509.03438, 2025. Presented at Recsys 2025 Consequences workshop
-
[14]
Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023
Xin Guo, Anran Hu, and Yufei Zhang. Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023
2023
-
[15]
Continuous-time model-based reinforcement learning
Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Continuous-time model-based reinforcement learning. InInternational Conference on Machine Learning, pages 12009–12018. PMLR, 2021
2021
-
[16]
Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020
Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020
2020
-
[17]
Incrementality bidding via reinforcement learning under mixed and delayed rewards.Advances in Neural Information Processing Systems, 35:2142–2153, 2022
Ashwinkumar Badanidiyuru Varadaraja, Zhe Feng, Tianxi Li, and Haifeng Xu. Incrementality bidding via reinforcement learning under mixed and delayed rewards.Advances in Neural Information Processing Systems, 35:2142–2153, 2022. 10
2022
-
[18]
arXiv preprint arXiv:2208.12809 , eprint =
Randall Lewis and Jeffrey Wong. Incrementality bidding and attribution.arXiv preprint arXiv:2208.12809, 2022
-
[19]
Kailash Gopal Darmasubramanian, Akash Pareek, Arindam Khan, and Arpit Agarwal. Ads that stick: Near-optimal ad optimization through psychological behavior models.arXiv preprint arXiv:2509.20304, 2025
-
[20]
SIAM, 2002
Philip Hartman.Ordinary differential equations. SIAM, 2002
2002
-
[21]
Improved algorithms for linear stochastic bandits
Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011
2011
-
[22]
Oxford University Press, 2013
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013
2013
-
[23]
Joel A. Tropp. Freedman’s inequality for matrix martingales, 2011
2011
-
[24]
Cambridge University Press, March 2004
Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, March 2004
2004
-
[25]
Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon
Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences.The Annals of Statistics, 49(2):1055 – 1080, 2021
2021
-
[26]
exp − Z T 0 µq(πn(t))dt !# ≥1−E T∼Exp(γ)
Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020. 11 Appendix Outline A Concentration Lemmata B Proofs on the Theoretical Solution C Proofs and remarks for Section 4.3 (concentration ofK) D Proofs and remarks for Section 4.4 (concentration ofQ) E Analysis of the algorithms F Analysis of the lower bound A Concentration...
2020
-
[27]
Compute ˆk=K(D k N1) andˆq=Q(D q N1)
Exploration Phase ( N1 episodes):Use a fixed exploratory policy πx that bids k(∞) (the maximal value) to collect data for estimating k and q. Compute ˆk=K(D k N1) andˆq=Q(D q N1). 2.Exploitation Phase (N 2 =N−N 1 episodes):Use policyπ=S( ˆk,ˆq). The algorithm explores withN 1 episodes and exploits the rest of theN 2 sessions. Let δ∈(0,1) . There exists C0...
-
[28]
Compute ˆk=K(D k N1)
Exploration Phase for k (N1 episodes):Use a fixed exploratory policy πx that bids b0 >0to collect data for estimatingk. Compute ˆk=K(D k N1)
-
[29]
Computeˆq=Q(D q N2)
Exploration Phase for q (N2 episodes):Use the learned πk = ˆk to collect data for estimatingq. Computeˆq=Q(D q N2). 3.Exploitation Phase (N 3 =N−N 1 −N 2 episodes):Use policyπ=S( ˆk,ˆq). The algorithm learns ˆk, uses ˆk to learn ˆqthen exploits with π=S( ˆk,ˆq)the rest of the N sessions. Let δ∈(0,1) . There exists C0 such that once N≥C 0, setting N1 =N 2 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.