pith. sign in

arxiv: 2605.28133 · v1 · pith:CYGY2OPInew · submitted 2026-05-27 · 💻 cs.LG · stat.ML

Learning to Bid in Repeated Second-Price Auctions with Dynamic Values and Aggregated Feedback

Pith reviewed 2026-06-29 14:25 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords online learningsecond-price auctionsdynamic valuesregret boundsaggregated feedbackconfidence boundsbidding strategies
0
0 comments X

The pith

A confidence-bound algorithm learns the optimal bidding policy with near-optimal regret using only aggregated feedback in repeated second-price auctions with dynamic values.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies a bidder in repeated second-price auctions whose value for winning depends on the time elapsed since the last successful bid. Auctions occur in continuous time and the bidder sees only aggregated feedback at the horizon's end. The bidder must trade off immediate wins against their effect on future values while learning unknown parameters. The authors derive regret bounds for methods that combine plug-in estimators with a differential-equation characterization of the optimal policy. A specific confidence-bound algorithm attains ilde O(log N) regret for piecewise-linear primitives and ilde O(N^{1/3}) regret for general smooth primitives without explicit randomization.

Core claim

The paper claims that a confidence-bound algorithm using plug-in estimators for unknown parameters together with the differential-equation characterization of the optimal bidding policy learns the optimal policy with regret ilde O(log N) when the primitives are piecewise linear and ilde O(N^{1/3}) when they are general and smooth.

What carries the argument

The differential-equation characterization of the optimal policy combined with plug-in estimators from aggregated feedback, embedded in a confidence-bound algorithm.

If this is right

  • The algorithm balances the immediate benefit of winning against future value impact while learning from aggregated feedback alone.
  • The same regret rates hold for both piecewise-linear and smooth value functions.
  • No explicit randomization is required to achieve the bounds.
  • Numerical experiments confirm the theoretical rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar plug-in plus differential-equation methods might apply to other dynamic bidding or resource allocation problems with delayed feedback.
  • The absence of randomization could simplify deployment in practical bidding systems.
  • The approach may connect to other online learning settings where optimal policies satisfy differential equations.

Load-bearing premise

The bidder's value is a known function of time since the last successful bid with unknown parameters, and the optimal policy admits a differential-equation characterization.

What would settle it

An experiment or calculation in which the observed regret for the described primitives exceeds the stated ilde O(log N) or ilde O(N^{1/3}) bounds would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.28133 by Benjamin Heymann, Otmane Sakhi.

Figure 1
Figure 1. Figure 1: Regret as a function of N. In this section, we design a simple experimental setting to validate our algorithms. We consider a single environment with k(t) = 1 − exp(−θt) and q(v) = v α, where α = 2 and θ = 0.1. We use a family of piecewise linear functions with 10 uniform breakpoints as an approximation. We set µ = 0.5 and γ = 0.1, and assume that these parameters are known. We evaluate the no-regret algor… view at source ↗
read the original abstract

We study the problem of learning to bid when the bidder's value is dynamic, i.e., when the current value depends on past outcomes. Specifically, we consider a bidder participating in repeated second-price auctions whose value depends on the time elapsed since their last successful bid, with auctions arriving in continuous time and only aggregated feedback revealed at the end of the horizon. Such a bidder must (1) balance the immediate benefit of winning the current auction against its impact on future values and (2) learn unknown environmental parameters. We derive regret bounds for a class of learning methods that combine plug-in estimators with a differential-equation characterization of the optimal policy, and show that a specific confidence bound algorithm learns the optimal policy with a near optimal regret of $\widetilde{O}(\log N)$ for piecewise linear primitives, and $\widetilde{O}(N^{1/3})$ for general, smooth primitives, achieving these regrets without explicit randomization. These theoretical results are supported by numerical experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies learning to bid in repeated second-price auctions where the bidder's value depends on time elapsed since the last successful bid. Auctions arrive in continuous time with only aggregated feedback at the end of the horizon. The authors derive regret bounds for methods that combine plug-in estimators with a differential-equation characterization of the optimal policy, claiming Õ(log N) regret for piecewise-linear primitives and Õ(N^{1/3}) for general smooth primitives, achieved without explicit randomization. Numerical experiments are provided to support the results.

Significance. If the central claims hold, the work would advance online learning for auctions with dynamic values and limited feedback by showing how a DE-based policy characterization can be paired with plug-in estimation from aggregates to obtain near-optimal regret rates without randomization. This addresses a practically relevant setting where explicit randomization may be undesirable.

major comments (2)
  1. [main theorems / §4-5 (DE characterization and plug-in analysis)] The central regret claims (abstract and main theorems) rest on the DE-to-policy map being uniformly Lipschitz in the unknown parameters so that plug-in estimates from aggregated feedback yield the stated rates. No explicit lemma or proposition is referenced establishing this Lipschitz property or the required parametric convergence rate of the estimators under aggregation; without it the Õ(log N) bound for piecewise-linear primitives is not guaranteed.
  2. [§3 (model and DE characterization)] The abstract states that the optimal policy admits a differential-equation characterization that combines with plug-in estimators, but boundary conditions, identifiability under end-of-horizon aggregation, and numerical stability of the DE solution are not detailed. These are load-bearing for controlling sub-optimality of the resulting policy and thus for both regret bounds.
minor comments (2)
  1. [§2] Notation for the primitives (piecewise-linear vs. smooth) and the exact form of the aggregated feedback should be clarified in the model section to avoid ambiguity when reading the regret statements.
  2. [experiments] The numerical experiments section would benefit from reporting the precise parameter values used for the DE solver and the number of independent runs to allow reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [main theorems / §4-5 (DE characterization and plug-in analysis)] The central regret claims (abstract and main theorems) rest on the DE-to-policy map being uniformly Lipschitz in the unknown parameters so that plug-in estimates from aggregated feedback yield the stated rates. No explicit lemma or proposition is referenced establishing this Lipschitz property or the required parametric convergence rate of the estimators under aggregation; without it the Õ(log N) bound for piecewise-linear primitives is not guaranteed.

    Authors: We agree that the connection between the Lipschitz property of the DE-to-policy map and the plug-in estimator convergence rate should be stated more explicitly in the main text. Lemma 4.2 in the appendix establishes uniform Lipschitz continuity of the map with respect to the parameters, and Proposition 3.3 provides the parametric convergence rate under end-of-horizon aggregation. We will add forward references to these results in the statements of Theorems 4.1 and 5.1, along with a short paragraph in Section 4 summarizing how the Lipschitz constant enters the regret analysis. This clarification does not change the claims but improves readability. revision: partial

  2. Referee: [§3 (model and DE characterization)] The abstract states that the optimal policy admits a differential-equation characterization that combines with plug-in estimators, but boundary conditions, identifiability under end-of-horizon aggregation, and numerical stability of the DE solution are not detailed. These are load-bearing for controlling sub-optimality of the resulting policy and thus for both regret bounds.

    Authors: We acknowledge that a more self-contained treatment of these aspects in the main body would strengthen the exposition. Boundary conditions appear in Equation (3), identifiability is formalized in Assumption 2.1, and numerical stability is ensured by the uniform discretization scheme described in Section 6. We will expand Section 3 with a new subsection that collects these elements, includes a short proof sketch for identifiability from aggregated observations, and states the stability bound used in the regret analysis. These additions address the referee's concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation combines external DE characterization with plug-in estimation

full rationale

The abstract states that regret bounds are derived for methods combining plug-in estimators with a differential-equation characterization of the optimal policy. No load-bearing step reduces by construction to a fitted quantity or self-citation; the DE characterization is presented as an independent modeling step whose solution is then estimated from aggregates. The provided text contains no equations or claims that equate the regret bound to an input fit or rename a known result. This matches the default expectation of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard assumptions from online learning and continuous-time optimal control (existence of an optimal policy via differential equations, learnability of parameters from aggregated feedback) plus domain assumptions on the value function primitives; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption The optimal bidding policy admits a differential-equation characterization that remains valid under the dynamic value model.
    Invoked to justify combining plug-in estimators with the DE characterization.
  • domain assumption Environmental parameters are identifiable from aggregated end-of-horizon feedback.
    Required for the plug-in estimator to converge to the quantities needed for the regret analysis.

pith-pipeline@v0.9.1-grok · 5696 in / 1403 out tokens · 35975 ms · 2026-06-29T14:25:16.204542+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 4 canonical work pages

  1. [1]

    Academic press, 2009

    Vijay Krishna.Auction theory. Academic press, 2009

  2. [2]

    Causal models for real time bidding with repeated user interactions

    Martin Bompaire, Alexandre Gilotte, and Benjamin Heymann. Causal models for real time bidding with repeated user interactions. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 75–85, New York, NY , USA, 2021. Association for Computing Machinery

  3. [3]

    Attribution modeling increases efficiency of bidding in display advertising

    Eustache Diemert, Julien Meynet, Pierre Galland, and Damien Lefortier. Attribution modeling increases efficiency of bidding in display advertising. InProceedings of the ADKDD’17, pages 1–6. 2017

  4. [4]

    A pragmatic policy learning approach to account for users’ fatigue in repeated auctions.arXiv preprint arXiv:2407.10504, 2024

    Benjamin Heymann, Alexandre Gilotte, and Rémi Chan-Renous. A pragmatic policy learning approach to account for users’ fatigue in repeated auctions.arXiv preprint arXiv:2407.10504, 2024

  5. [5]

    Fixed point label attribution for real-time bidding.Manufacturing & Service Operations Management, 26(3):1043–1061, 2024

    Martin Bompaire, Antoine Désir, and Benjamin Heymann. Fixed point label attribution for real-time bidding.Manufacturing & Service Operations Management, 26(3):1043–1061, 2024

  6. [6]

    Repeated bidding with dynamic value

    Benjamin Heymann, Alexandre Gilotte, and Rémi Chan-Renous. Repeated bidding with dynamic value. In Marios Mavronicolas, Qi Qi, and Grant Schoenebeck, editors,Web and Internet Economics, pages 548–563, Cham, 2026. Springer Nature Switzerland

  7. [7]

    Efficient algorithms for stochastic repeated second-price auctions

    Juliette Achddou, Olivier Cappé, and Aurélien Garivier. Efficient algorithms for stochastic repeated second-price auctions. InAlgorithmic Learning Theory, pages 99–150. PMLR, 2021

  8. [8]

    Online learning in repeated auctions

    Jonathan Weed, Vianney Perchet, and Philippe Rigollet. Online learning in repeated auctions. InConference on Learning Theory, pages 1562–1583. PMLR, 2016

  9. [9]

    Learning in repeated auctions with budgets: Regret minimization and equilibrium.Management Science, 65(9):3952–3968, 2019

    Santiago R Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimization and equilibrium.Management Science, 65(9):3952–3968, 2019

  10. [10]

    Dual mirror descent for online allocation problems

    Santiago Balseiro, Haihao Lu, and Vahab Mirrokni. Dual mirror descent for online allocation problems. InInternational Conference on Machine Learning, pages 613–628. PMLR, 2020

  11. [11]

    A unifying framework for online optimization with long-term constraints.Advances in Neural Information Processing Systems, 35:33589–33602, 2022

    Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Giulia Romano, and Nicola Gatti. A unifying framework for online optimization with long-term constraints.Advances in Neural Information Processing Systems, 35:33589–33602, 2022

  12. [12]

    Maximizing the success probability of policy allocations in online systems

    Artem Betlei, Mariia Vladimirova, Mehdi Sebbar, Nicolas Urien, Thibaud Rahier, and Benjamin Heymann. Maximizing the success probability of policy allocations in online systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 11061–11068, 2024

  13. [13]

    Non-linear counterfactual aggregate optimization

    Benjamin Heymann and Otmane Sakhi. Non-linear counterfactual aggregate optimization. arXiv preprint arXiv:2509.03438, 2025. Presented at Recsys 2025 Consequences workshop

  14. [14]

    Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023

    Xin Guo, Anran Hu, and Yufei Zhang. Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023

  15. [15]

    Continuous-time model-based reinforcement learning

    Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Continuous-time model-based reinforcement learning. InInternational Conference on Machine Learning, pages 12009–12018. PMLR, 2021

  16. [16]

    Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

    Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

  17. [17]

    Incrementality bidding via reinforcement learning under mixed and delayed rewards.Advances in Neural Information Processing Systems, 35:2142–2153, 2022

    Ashwinkumar Badanidiyuru Varadaraja, Zhe Feng, Tianxi Li, and Haifeng Xu. Incrementality bidding via reinforcement learning under mixed and delayed rewards.Advances in Neural Information Processing Systems, 35:2142–2153, 2022. 10

  18. [18]

    arXiv preprint arXiv:2208.12809 , eprint =

    Randall Lewis and Jeffrey Wong. Incrementality bidding and attribution.arXiv preprint arXiv:2208.12809, 2022

  19. [19]

    Ads that stick: Near-optimal ad optimization through psychological behavior models.arXiv preprint arXiv:2509.20304, 2025

    Kailash Gopal Darmasubramanian, Akash Pareek, Arindam Khan, and Arpit Agarwal. Ads that stick: Near-optimal ad optimization through psychological behavior models.arXiv preprint arXiv:2509.20304, 2025

  20. [20]

    SIAM, 2002

    Philip Hartman.Ordinary differential equations. SIAM, 2002

  21. [21]

    Improved algorithms for linear stochastic bandits

    Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011

  22. [22]

    Oxford University Press, 2013

    Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013

  23. [23]

    Joel A. Tropp. Freedman’s inequality for matrix martingales, 2011

  24. [24]

    Cambridge University Press, March 2004

    Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, March 2004

  25. [25]

    Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon

    Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences.The Annals of Statistics, 49(2):1055 – 1080, 2021

  26. [26]

    exp − Z T 0 µq(πn(t))dt !# ≥1−E T∼Exp(γ)

    Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020. 11 Appendix Outline A Concentration Lemmata B Proofs on the Theoretical Solution C Proofs and remarks for Section 4.3 (concentration ofK) D Proofs and remarks for Section 4.4 (concentration ofQ) E Analysis of the algorithms F Analysis of the lower bound A Concentration...

  27. [27]

    Compute ˆk=K(D k N1) andˆq=Q(D q N1)

    Exploration Phase ( N1 episodes):Use a fixed exploratory policy πx that bids k(∞) (the maximal value) to collect data for estimating k and q. Compute ˆk=K(D k N1) andˆq=Q(D q N1). 2.Exploitation Phase (N 2 =N−N 1 episodes):Use policyπ=S( ˆk,ˆq). The algorithm explores withN 1 episodes and exploits the rest of theN 2 sessions. Let δ∈(0,1) . There exists C0...

  28. [28]

    Compute ˆk=K(D k N1)

    Exploration Phase for k (N1 episodes):Use a fixed exploratory policy πx that bids b0 >0to collect data for estimatingk. Compute ˆk=K(D k N1)

  29. [29]

    Computeˆq=Q(D q N2)

    Exploration Phase for q (N2 episodes):Use the learned πk = ˆk to collect data for estimatingq. Computeˆq=Q(D q N2). 3.Exploitation Phase (N 3 =N−N 1 −N 2 episodes):Use policyπ=S( ˆk,ˆq). The algorithm learns ˆk, uses ˆk to learn ˆqthen exploits with π=S( ˆk,ˆq)the rest of the N sessions. Let δ∈(0,1) . There exists C0 such that once N≥C 0, setting N1 =N 2 ...