arxiv: 2604.15825 · v1 · submitted 2026-04-17 · 💰 econ.GN · q-fin.EC

Recognition: unknown

Convergence to collusion in algorithmic pricing

Kevin Michael Frick

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:37 UTC · model grok-4.3

classification 💰 econ.GN q-fin.EC

keywords algorithmic pricingcollusiondeep reinforcement learningoligopolyrepeated gamesreward-punishmentprice competition

0 comments

The pith

Modern deep reinforcement learning pricing algorithms converge to collusion in repeated oligopoly games as quickly as real-world observations suggest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how fast AI pricing tools learn to collude. It deploys a modern deep reinforcement learning model in a repeated game where firms choose prices from a continuous range. The model reaches a collusive outcome in a timeframe that matches empirical data under reasonable assumptions about time step length. Cooperation is maintained through reward and punishment schemes that deter price cuts. A reader would care because this shows collusion can arise from independent profit-seeking algorithms without any direct communication between firms.

Core claim

The author shows that a modern deep reinforcement learning model, when deployed to set prices in a repeated oligopolistic competition game featuring continuous prices, converges to a collusive outcome. The time required for this convergence aligns with empirical observations under reasonable assumptions regarding the duration of each time step. The resulting behavior includes cooperative pricing backed by reward-punishment schemes designed to deter deviations from the collusive prices.

What carries the argument

The modern deep reinforcement learning model used as an autonomous pricing agent in the repeated oligopoly game, which discovers and sustains collusive strategies through repeated interaction and learning.

If this is right

Firms using independent deep RL pricing systems can reach collusive equilibria without explicit agreements.
The collusion emerges rapidly enough to be relevant for current market practices.
Reward and punishment dynamics support the stability of the higher price levels.
Antitrust concerns may need to account for algorithmic learning processes in pricing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Varying the time step length in follow-up simulations could test how sensitive the convergence speed is to modeling assumptions.
Extending the setup to more firms or asymmetric demand could reveal how general the collusion result is.
If real firms adopt similar algorithms, regulators might see faster collusion emerge in concentrated markets.
The same learning approach could be used to examine algorithmic coordination in other repeated market interactions beyond pricing.

Load-bearing premise

The length of each time step in the model is a reasonable approximation of how often firms can adjust prices in actual markets.

What would settle it

A simulation run with the same model but a different time step length where convergence either fails to occur or takes substantially longer or shorter than the empirical benchmarks.

Figures

Figures reproduced from arXiv: 2604.15825 by Kevin Michael Frick.

**Figure 3.1.** Figure 3.1: Profit Gain Over Time Notes: Lines show 1,000-step moving averages of the average profit gain ∆ = π¯−πN πM−πN , where π¯ is profit, defined relative to the Bertrand-Nash profit πN and monopoly profit πM, across 100 random seeds; the shaded band shows the empirical 95% confidence interval: since there are 100 seeds, for each timestep t, the lower limit is the mean between the 2nd and 3rd prices at t and t… view at source ↗

**Figure 3.2.** Figure 3.2: Profit Gain Distribution Notes: Histogram across 100 runs of the profit gain per session, averaged across agents at the end of the simulation. Profit gain equals (¯π − πN )/(πM − πN ) so zero corresponds to Bertrand-Nash and one to monopoly profits. is optimal in the long run, and supra-competitive prices may not be determined by actual collusive strategy but by failure to learn an optimal strategy profi… view at source ↗

**Figure 3.3.** Figure 3.3: Discounted Deviation Payoffs Notes: The x-axis variable, relative differential gain, quantifies the percentage change in profitability in all periods following the deviation, inclusive of the profit relative to the deviation itself). Profits after deviation are discounted at δ = 0.95 to match Calvano et al. (2020). It compares the average profit obtained after deviation to the profit obtained in the coun… view at source ↗

**Figure 3.4.** Figure 3.4: Emergent Reward-Punishment Pattern Notes: Average price paths when one agent deviates for one period to the static best response; agents play their mean strategy for 50 steps before one of them is forced to deviation. The plot includes only sessions in which the deviation is not profitable (Nash-convergent sessions). As every agent in each seed has a different strategy profile, this plot shows the averag… view at source ↗

**Figure 3.5.** Figure 3.5: Impulse Response Boxplot Notes: Box plots summarise the distribution, across Nash-convergent sessions, of the change in prices relative to the pre-deviation price for the compliant agent (right) and the deviator (left) after a one-period static-best-response deviation; whiskers cover the 5th–95th percentiles. exist (Hornik et al., 1989). Using continuous pricing in a game with two agents and oneperiod m… view at source ↗

**Figure 3.6.** Figure 3.6: Phase Diagram of a Learned Strategy Profile [PITH_FULL_IMAGE:figures/full_fig_p022_3_6.png] view at source ↗

**Figure 3.7.** Figure 3.7: Three-Agent Profit Gains Notes: Lines show 1,000-step moving averages of the average profit gain ∆ = π¯−πN πM−πN , where π¯ is profit, defined relative to the Bertrand-Nash profit πN and monopoly profit πM, across 100 random seeds in the three-agent case; πN and πM are recomputed for the three-agent case. The shaded band shows the empirical 95% confidence interval: since there are 100 seeds, for each tim… view at source ↗

**Figure 3.8.** Figure 3.8: Three-Agent Impulse Response Notes: Box plots summarise the distribution, across Nash-convergent sessions, of the change in prices relative to the pre-deviation price for the deviator (left) and the non-defector (right) after a one-period static-best-response deviation in the three-agent case; the non-defector is the average of the non-defecting agents; whiskers cover the 5th–95th percentiles. 4 Discussi… view at source ↗

read the original abstract

Artificial intelligence algorithms are increasingly used by firms to set prices. Previous research shows that they can exhibit collusive behaviour, but how quickly they can do so has so far remained an open question. I show that a modern deep reinforcement learning model deployed to price goods in a repeated oligopolistic competition game with continuous prices converges to a collusive outcome in an amount of time that matches empirical observations, under reasonable assumptions on the length of a time step. This model shows cooperative behaviour supported by reward-punishment schemes that discourage deviations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a DRL pricing model reaches collusion on empirical timescales, but the quantitative match depends on an untested time-step length assumption that the abstract does not justify or test.

read the letter

The main takeaway is that this paper takes the known result that reinforcement learning agents can collude and adds a claim about convergence speed: under continuous prices, the agents reach a collusive outcome in a number of steps that lines up with real-world observations once you pick a plausible duration for each step. It also notes that the collusion is held up by reward-punishment patterns that punish deviations. That temporal angle and the continuous-price setup are the concrete additions over prior simulation work on algorithmic collusion. The approach is straightforward and connects the model output to antitrust-relevant time scales, which is useful for readers thinking about how fast these systems might affect markets. The soft spot is exactly where the stress-test note points: the headline equivalence rests on choosing a time-step length that makes the numbers match, with no details on why that length is reasonable for oligopolistic pricing, no table showing what happens if you change it by a factor of ten, and no equations or parameter values in the abstract to check robustness. Without those, it is difficult to tell whether the match is a genuine prediction or an artifact of the calibration. The abstract also gives no indication of how sensitive the collusion is to reward scaling or training choices. This paper is for economists and computer scientists who follow algorithmic pricing and competition policy. A reader who wants to see simulation results tied to observable time scales could get something from it, but only after the methods section supplies the missing calibration checks and verification steps. I would send it for peer review. The question is timely and the basic setup is relevant, so referees can press on the time mapping and demand the robustness evidence that is currently absent.

Referee Report

2 major / 2 minor

Summary. The paper claims that a modern deep reinforcement learning agent, when deployed to set continuous prices in a repeated oligopolistic competition game, converges to a collusive outcome on a timescale that matches empirical observations, provided one adopts reasonable assumptions about the real-world duration of each discrete model time step. The collusion is further described as being sustained by reward-punishment mechanisms that deter unilateral deviations.

Significance. If the quantitative mapping to empirical convergence times can be shown to be robust rather than tuned, the result would supply a concrete, simulation-based estimate of how quickly algorithmic collusion can emerge under continuous pricing, thereby tightening the link between theoretical RL models and observed market behavior with potential relevance for antitrust enforcement.

major comments (2)

[Abstract] Abstract: the headline claim that convergence occurs 'in an amount of time that matches empirical observations' is load-bearing yet unsupported by any reported number of steps to collusion, any explicit value for the assumed time-step length, or any sensitivity table varying that length; without these, it is impossible to verify that the match is not an artifact of post-hoc calibration.
[Simulation setup and results] Simulation setup and results: the abstract asserts that the observed collusion is supported by reward-punishment schemes, but no equations, state-action definitions, reward scaling parameters, or discount-factor values are supplied, leaving open whether the cooperative behavior is a general property of the DRL architecture or specific to unstated hyperparameter choices.

minor comments (2)

Notation for the continuous price space and the precise definition of the per-period profit function should be stated explicitly rather than left implicit.
A short table reporting the exact hyperparameter vector used for the reported runs would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The two major comments identify areas where greater transparency will strengthen the manuscript, and we will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that convergence occurs 'in an amount of time that matches empirical observations' is load-bearing yet unsupported by any reported number of steps to collusion, any explicit value for the assumed time-step length, or any sensitivity table varying that length; without these, it is impossible to verify that the match is not an artifact of post-hoc calibration.

Authors: We agree that the abstract claim requires explicit supporting numbers to be verifiable. The manuscript states that convergence occurs under reasonable assumptions on time-step length but does not report the precise episode count, the baseline time-step duration, or sensitivity results in the abstract or main results summary. In revision we will add the observed number of steps to collusion, state the assumed time-step length with justification, and include a sensitivity table (varying the length across plausible market frequencies) either in the main text or as a new figure. This will allow direct assessment of robustness rather than post-hoc calibration. revision: yes
Referee: [Simulation setup and results] Simulation setup and results: the abstract asserts that the observed collusion is supported by reward-punishment schemes, but no equations, state-action definitions, reward scaling parameters, or discount-factor values are supplied, leaving open whether the cooperative behavior is a general property of the DRL architecture or specific to unstated hyperparameter choices.

Authors: The referee is correct that the abstract alone does not supply these details. The manuscript describes the continuous-price oligopoly environment and notes that collusion is sustained by reward-punishment, but the explicit state vector, action space, reward function equation, scaling, and discount factor appear only in the methods or appendix. To resolve the concern, we will move the state-action definitions, reward equation, and hyperparameter table (including discount factor and any scaling) into the main simulation-setup section. This will make clear which features of the DRL setup produce the observed behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: simulation result stands on computational experiment, not self-referential reduction

full rationale

The paper reports a simulation outcome in which a deep RL pricing agent reaches collusion after a number of discrete steps that, when multiplied by an assumed real-world duration per step, falls within observed empirical ranges. No equations, parameter-fitting procedures, or self-citations are quoted that would make the reported convergence time equivalent to its own inputs by construction. The time-step assumption is presented as an external modeling choice rather than a fitted quantity whose value is then re-used to generate the headline match. Because the central claim rests on an independent computational run whose output is compared to external data, the derivation chain does not collapse.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard deep RL and repeated-game assumptions; full paper would be needed to audit learning rates, discount factors, or state representations.

pith-pipeline@v0.9.0 · 5366 in / 962 out tokens · 27394 ms · 2026-05-10T07:37:16.986119+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 1 canonical work pages

[1]

Lambin, and N

Abada, I., X. Lambin, and N. Tchakarov (2024). Collusion by mistake: Does algorith- mic sophistication drive supra-competitive profits?European Journal of Operational Research 318(3), 927–953. Abreu, D. (1986, June). Extremal equilibria of oligopolistic supergames.Journal of Eco- nomic Theory 39(1), 191–225. Adamczyk, J., V. Makarenko, S. Tiomkin, and R. ...

2024
[2]

Calzolari, V

Calvano, E., G. Calzolari, V. Denicolò, and S. Pastorello (2020, October). Artificial intel- ligence, algorithmic pricing, and collusion.American Economic Review 110(10), 3267–

2020
[3]

Calzolari, V

27 Calvano, E., G. Calzolari, V. Denicolò, and S. Pastorello (2023). Algorithmic collusion: Genuine or spurious?International Journal of Industrial Organization, 102973. Cesa-Bianchi, N., C. Gentile, G. Lugosi, and G. Neu (2017). Boltzmann exploration done right.Advances in neural information processing systems

2023
[4]

Cesa-Bianchi, N. and G. Lugosi (2006).Prediction, learning, and games. Cambridge university press. Crandall, J. W. (2014). Towards minimizing disappointment in repeated games.Journal of Artificial Intelligence Research 49, 111–142. Crandall, J. W., M. Oudah, F. Ishowo-Oloko, S. Abdallah, J.-F. Bonnefon, M. Cebrian, A. Shariff, M. A. Goodrich, I. Rahwan, a...

work page doi:10.1111/1756-2171.12383 2006
[5]

van Hasselt, H

Curran Associates, Inc. van Hasselt, H. (2012). Reinforcement learning in continuous state and action spaces. In M. Wiering and M. van Otterlo (Eds.),Reinforcement Learning: State-of-the-Art, Adaptation, Learning, and Optimization, pp. 207–251. Berlin, Heidelberg: Springer. Yamada, H. (2019, 1). cpprb. Yang, G., J. Sohl-dickstein, J. Pennington, S. Schoen...

2012
[6]

smoothing the non-stationarity

under function approximation for infinite-horizon tasks. That is, if there is no natural “ending” to a task, the definition given above for the optimal policy (maximising the sum of discounted rewards) does not allow for comparison between any two pairs of policies: some policies may achieve higher rewards in some state and lower in others. In principle, ...

2019
[7]

architecture tuning

various improvements have been proposed that are both specific to soft-actor critic and generally applicable to machine learning architectures. Ma- chine learning is a very active research field, but models in deep reinforcement learning do not necessarily benefit from architectural improvements that have been developed for supervised learning. For exampl...

2019