Online Market Making and the Value of Observing the Order Book

Davide Maran; Marcello Restelli

arxiv: 2605.19584 · v1 · pith:S4PUUCTAnew · submitted 2026-05-19 · 💻 cs.LG · stat.ML

Online Market Making and the Value of Observing the Order Book

Davide Maran , Marcello Restelli This is my paper

Pith reviewed 2026-05-20 06:41 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords online market makingregret boundsaction-dependent feedbackorder bookelimination algorithmmean-reverting pricesstochastic and adversarial learning

0 comments

The pith

Action-dependent feedback from order book no-trades enables O(sqrt(T)) regret in online market making without smoothness assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies sequential bid and ask posting for a single asset when traders hold private valuations. It replaces the usual fully censored feedback with an action-dependent model in which a no-trade event reveals information about supply and demand while a trade conceals the valuation. This change is shown to make the problem substantially more learnable than standard bandit models. In the stochastic i.i.d. price setting an elimination algorithm attains high-probability square-root regret without any smoothness requirement on the valuation distribution. The same style of bound is proved for mean-reverting prices under local or global drift conditions and an explore-then-perturb method yields T to the two-thirds regret in the adversarial case.

Core claim

In the stochastic setting with i.i.d. market prices, an elimination-based algorithm achieves O(sqrt(T)) regret with high probability without requiring any smoothness assumptions on the distribution of trader valuations. The result rests on the action-dependent feedback model in which no-trade events reveal informative supply-and-demand signals while trades leave valuations hidden. The same O(sqrt(T)) high-probability bounds are obtained for broad classes of mean-reverting price processes by means of a new concentration inequality. In the adversarial setting with oblivious prices an explore-then-perturb algorithm guarantees O(T^{2/3}) regret in expectation.

What carries the argument

The action-dependent feedback model in which no-trade events reveal informative supply-and-demand signals while trades leave valuations hidden.

If this is right

Market makers obtain sublinear regret in stochastic environments even when trader valuations lack smoothness.
The same square-root bounds hold when prices follow local autoregressive dynamics or satisfy a global cumulative-deviation condition.
In adversarial oblivious-price environments the regret improves to T to the two-thirds instead of remaining linear.
The results directly quantify how limited order-book observations improve learning relative to fully censored feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar action-dependent feedback may improve regret in other sequential pricing tasks that involve partial observability.
Empirical tests on real order-book data could check whether no-trade events indeed correlate with valuation ranges as assumed.
The new concentration inequality may apply to other online problems that mix revealed and censored observations.

Load-bearing premise

That no-trade events supply useful information about the hidden valuations while trades supply none.

What would settle it

An experiment in which the proposed elimination algorithm is run on i.i.d. prices yet the observed regret remains linear in T rather than square-root.

read the original abstract

We study an online market-making problem in which a learner sequentially posts bid and ask prices for a single asset while interacting with traders holding private valuations. Unlike existing online learning formulations that assume fully censored feedback, we introduce an action-dependent feedback model inspired by real limit order books: when a trade occurs, the trader's valuation remains hidden, whereas when no trade occurs, informative feedback about supply and demand is revealed. We show that this additional information fundamentally changes the learnability of the problem. In the stochastic setting with i.i.d. market prices, we propose an elimination-based algorithm that achieves $O(\sqrt T)$ regret with high probability, without requiring any smoothness assumptions on the distribution of trader valuations. We then extend this result to a broad class of mean-reverting price processes by considering both local, autoregressive dynamics and a weaker global drift condition based on cumulative deviations from the mean. Under either assumption, we establish high-probability $O(\sqrt T)$ regret bounds, relying on a new concentration inequality of independent interest. Finally, in the adversarial setting with oblivious prices, we design an explore-then-perturb algorithm that guarantees $O(T^{2/3})$ regret in expectation. Our results quantify the value of observing the order book in online market making and demonstrate that even limited, action-dependent feedback can substantially improve regret guarantees compared to standard bandit feedback models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Action-dependent no-trade feedback lets them drop smoothness assumptions and hit O(sqrt(T)) regret in stochastic market making, though continuous prices could still need extra handling.

read the letter

The main thing to know is that this paper shows how modeling no-trade events as revealing supply-demand signals (while trades censor valuations) lets an elimination algorithm reach O(sqrt(T)) high-probability regret in the i.i.d. stochastic setting without any smoothness on trader valuations. They then extend the same rate to mean-reverting prices under local autoregressive or global drift conditions via a new concentration inequality, and get O(T^{2/3}) in the adversarial oblivious case with explore-then-perturb. This is a clear step up from standard fully censored feedback models in online market making. The modeling choice is grounded in real limit order books and the regret improvements are concrete, which is the part that actually moves the needle for algorithmic trading applications. The new concentration inequality is a small bonus that might travel to other problems. One soft spot worth checking is the price domain. If bid and ask prices live in a continuum rather than a finite discrete set, elimination needs either an explicit discretization argument or a way for the revealed signals to pin down valuation thresholds across all possible prices; without smoothness the discretization error could eat into the sqrt(T) rate. The abstract does not flag this, so the full paper should make the action space assumption explicit and show how the feedback controls it. The mean-reverting conditions are reasonable but could be sensitive in practice. This paper is for people working on partial-feedback bandits or online learning in finance. A reader who wants concrete regret rates that quantify the value of order-book information will get something useful out of the algorithms and bounds. It has enough structure and a distinct feedback model to deserve a serious referee who can verify the proofs and the handling of the action space.

Referee Report

2 major / 2 minor

Summary. The paper studies online market making with an action-dependent feedback model inspired by limit order books: no-trade events reveal supply/demand signals while trades censor trader valuations. In the stochastic i.i.d. setting it proposes an elimination algorithm achieving O(√T) high-probability regret without smoothness assumptions on valuations. Results extend to mean-reverting processes (local autoregressive and global drift) via a new concentration inequality, and an explore-then-perturb algorithm yields O(T^{2/3}) regret in the oblivious adversarial case.

Significance. If the derivations hold, the work shows that limited order-book feedback can improve learnability over fully censored bandit models, delivering √T regret in stochastic settings without regularity conditions on valuations and a new concentration tool for mean-reverting dynamics. These are concrete advances for online market-making literature.

major comments (2)

[§3] §3, Algorithm 1 and Theorem 1: the elimination procedure and O(√T) analysis are stated for a finite discrete price grid. When prices lie in a continuum (standard for market making), the regret bound requires an explicit covering or discretization argument whose error term remains o(√T) uniformly over arbitrary valuation distributions; the current no-smoothness claim does not address this and is therefore load-bearing for the central stochastic result.
[§5] §5, Theorem 3 and the new concentration inequality: the high-probability O(√T) bound under mean-reversion relies on this inequality. The proof sketch must be expanded to confirm that the deviation control holds under only the stated local autoregressive or cumulative-drift conditions and does not implicitly require stronger mixing or bounded moments that would narrow the claimed generality.

minor comments (2)

[Abstract] Abstract and §2: the phrasing 'i.i.d. market prices' is used while the model centers on trader valuations; add one clarifying sentence distinguishing the two processes and stating how market-price realizations enter the feedback model.
[§4] Notation in §4: the definition of the global drift condition uses cumulative deviations; ensure the constant factors and the precise form of the deviation threshold are stated explicitly so that the concentration inequality can be applied directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review. The comments highlight important points regarding the scope of our assumptions and the completeness of our proofs. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§3] §3, Algorithm 1 and Theorem 1: the elimination procedure and O(√T) analysis are stated for a finite discrete price grid. When prices lie in a continuum (standard for market making), the regret bound requires an explicit covering or discretization argument whose error term remains o(√T) uniformly over arbitrary valuation distributions; the current no-smoothness claim does not address this and is therefore load-bearing for the central stochastic result.

Authors: We thank the referee for this observation. The analysis in Section 3, Algorithm 1, and Theorem 1 is developed explicitly for a finite discrete price grid. This modeling choice focuses on the core learning challenge induced by the action-dependent feedback while avoiding continuity issues. The no-smoothness claim refers to the trader valuation distributions over the discrete grid, which permits arbitrary distributions and yields the O(√T) high-probability regret. We agree that a continuous price space would require an additional discretization argument, and without smoothness on valuations it is difficult to guarantee the approximation error is o(√T) uniformly. We will revise the manuscript to state the discrete price assumption explicitly in the problem setup and add a discussion of the challenges for continuous extensions. revision: yes
Referee: [§5] §5, Theorem 3 and the new concentration inequality: the high-probability O(√T) bound under mean-reversion relies on this inequality. The proof sketch must be expanded to confirm that the deviation control holds under only the stated local autoregressive or cumulative-drift conditions and does not implicitly require stronger mixing or bounded moments that would narrow the claimed generality.

Authors: We appreciate the referee's request to strengthen the presentation of the concentration inequality. The inequality is constructed to apply under the local autoregressive dynamics or the weaker global cumulative-drift condition, relying only on the stated assumptions without invoking stronger mixing rates or extra moment bounds. We will expand the current proof sketch into a complete, self-contained proof in the appendix of the revised version, with explicit steps verifying that the high-probability deviation bounds follow directly from the given conditions. This will confirm the claimed generality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent algorithmic analysis and new concentration inequality.

full rationale

The paper introduces an action-dependent feedback model and derives O(√T) regret bounds via an elimination algorithm without smoothness assumptions, plus extensions using a new concentration inequality of independent interest for mean-reverting processes. No quoted steps reduce the claimed regret bounds or learnability results to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on explicit algorithmic construction and probabilistic analysis that do not presuppose the target bounds by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the paper appears to rest on standard online-learning concentration tools plus the new inequality mentioned, with the action-dependent feedback model serving as the key modeling assumption.

axioms (1)

standard math Standard concentration inequalities for i.i.d. and mean-reverting processes
Invoked to obtain high-probability O(sqrt(T)) bounds.

pith-pipeline@v0.9.0 · 5772 in / 1174 out tokens · 38387 ms · 2026-05-20T06:41:56.066503+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2411.13993 , year=

Market Making without Regret , author=. arXiv preprint arXiv:2411.13993 , year=

work page arXiv
[2]

Quantitative Finance , volume=

A learning market-maker in the Glosten--Milgrom model , author=. Quantitative Finance , volume=. 2005 , publisher=

work page 2005
[3]

Journal of Statistical Mechanics: Theory and Experiment , volume=

Information thermodynamics of financial markets: The Glosten--Milgrom model , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

work page 2021
[4]

2000 , publisher=

Asymptotic statistics , author=. 2000 , publisher=

work page 2000
[5]

2020 , publisher=

Bandit algorithms , author=. 2020 , publisher=

work page 2020
[6]

, author=

Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. , author=. Journal of machine learning research , volume=

work page
[7]

, author=

X-Armed Bandits. , author=. Journal of Machine Learning Research , volume=

work page
[8]

Advances in Neural Information Processing Systems , volume=

Nearly tight bounds for the continuum-armed bandit problem , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

International Conference on Artificial Intelligence and Statistics , pages=

Smooth bandit optimization: generalization to holder space , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

work page 2021
[10]

Journal of financial economics , volume=

An equilibrium characterization of the term structure , author=. Journal of financial economics , volume=. 1977 , publisher=

work page 1977
[11]

2003 , publisher=

Theory of financial risk and derivative pricing: from statistical physics to risk management , author=. 2003 , publisher=

work page 2003
[12]

What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

What doubling tricks can and can't do for multi-armed bandits , author=. arXiv preprint arXiv:1803.06971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Advances in Neural Information Processing Systems , volume=

On explore-then-commit strategies , author=. Advances in Neural Information Processing Systems , volume=

work page
[14]

Journal of Computer and System Sciences , volume=

Efficient algorithms for online decision problems , author=. Journal of Computer and System Sciences , volume=. 2005 , publisher=

work page 2005
[15]

Journal of financial Economics , volume=

Asset pricing and the bid-ask spread , author=. Journal of financial Economics , volume=. 1986 , publisher=

work page 1986
[16]

Journal of financial Economics , volume=

Estimating the components of the bid/ask spread , author=. Journal of financial Economics , volume=. 1988 , publisher=

work page 1988
[17]

Journal of financial markets , volume=

Market microstructure: A survey , author=. Journal of financial markets , volume=. 2000 , publisher=

work page 2000
[18]

2025 , url =

Exchange Global Share and Segment Sizing 2025 , institution =. 2025 , url =

work page 2025
[19]

2020 , publisher=

Probability and random processes , author=. 2020 , publisher=

work page 2020

[1] [1]

arXiv preprint arXiv:2411.13993 , year=

Market Making without Regret , author=. arXiv preprint arXiv:2411.13993 , year=

work page arXiv

[2] [2]

Quantitative Finance , volume=

A learning market-maker in the Glosten--Milgrom model , author=. Quantitative Finance , volume=. 2005 , publisher=

work page 2005

[3] [3]

Journal of Statistical Mechanics: Theory and Experiment , volume=

Information thermodynamics of financial markets: The Glosten--Milgrom model , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

work page 2021

[4] [4]

2000 , publisher=

Asymptotic statistics , author=. 2000 , publisher=

work page 2000

[5] [5]

2020 , publisher=

Bandit algorithms , author=. 2020 , publisher=

work page 2020

[6] [6]

, author=

Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. , author=. Journal of machine learning research , volume=

work page

[7] [7]

, author=

X-Armed Bandits. , author=. Journal of Machine Learning Research , volume=

work page

[8] [8]

Advances in Neural Information Processing Systems , volume=

Nearly tight bounds for the continuum-armed bandit problem , author=. Advances in Neural Information Processing Systems , volume=

work page

[9] [9]

International Conference on Artificial Intelligence and Statistics , pages=

Smooth bandit optimization: generalization to holder space , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

work page 2021

[10] [10]

Journal of financial economics , volume=

An equilibrium characterization of the term structure , author=. Journal of financial economics , volume=. 1977 , publisher=

work page 1977

[11] [11]

2003 , publisher=

Theory of financial risk and derivative pricing: from statistical physics to risk management , author=. 2003 , publisher=

work page 2003

[12] [12]

What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

What doubling tricks can and can't do for multi-armed bandits , author=. arXiv preprint arXiv:1803.06971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Advances in Neural Information Processing Systems , volume=

On explore-then-commit strategies , author=. Advances in Neural Information Processing Systems , volume=

work page

[14] [14]

Journal of Computer and System Sciences , volume=

Efficient algorithms for online decision problems , author=. Journal of Computer and System Sciences , volume=. 2005 , publisher=

work page 2005

[15] [15]

Journal of financial Economics , volume=

Asset pricing and the bid-ask spread , author=. Journal of financial Economics , volume=. 1986 , publisher=

work page 1986

[16] [16]

Journal of financial Economics , volume=

Estimating the components of the bid/ask spread , author=. Journal of financial Economics , volume=. 1988 , publisher=

work page 1988

[17] [17]

Journal of financial markets , volume=

Market microstructure: A survey , author=. Journal of financial markets , volume=. 2000 , publisher=

work page 2000

[18] [18]

2025 , url =

Exchange Global Share and Segment Sizing 2025 , institution =. 2025 , url =

work page 2025

[19] [19]

2020 , publisher=

Probability and random processes , author=. 2020 , publisher=

work page 2020