arxiv: 2604.06522 · v1 · submitted 2026-04-07 · 💻 cs.GT · math.DS· math.OC

Recognition: 2 theorem links

· Lean Theorem

Constrained Policy Optimization for Provably Fair Order Matching

Zehua Cheng , Zhipeng Wang , Wei Dai , Wenhu Zhang , Vadzim Mahilny , David Shi , Elena Jia , Jiahao Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:48 UTC · model grok-4.3

classification 💻 cs.GT math.DSmath.OC

keywords order matchingfairness constraintsconstrained policy optimizationPID controlBIBO stabilityMarkov decision processtrading systemsconstraint violations

0 comments

The pith

Constrained policy optimization with PID feedback proves BIBO stability and drives fairness violations to zero in order matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors model automated order matching as a constrained decision process to counter persistent disparities from latency and access differences that undermine trust in trading systems. They introduce CPO-FOAM, which pairs an inner analytic optimization step with an outer feedback loop that adjusts constraint margins dynamically. The work proves the overall system stays bounded-input bounded-output stable and that the integral action forces violations toward zero in steady state. Experiments across multiple market conditions show the method preserves most of the original performance level while meeting fairness targets.

Core claim

We formulate provably fair order matching as a Constrained Markov Decision Process and propose CPO-FOAM. An inner loop computes an analytic trust-region step on the Fisher information manifold; a PID-controlled outer loop dynamically tightens safety margins, suppressing the sawtooth oscillations endemic to Lagrangian methods under non-stationary dynamics. We prove BIBO stability and that the integral term drives steady-state violations to zero. On LOBSTER NASDAQ data across six market regimes, CPO-FOAM recovers 95.9% of unconstrained throughput at 2.5% constraint violation frequency.

What carries the argument

The PID-controlled outer loop that dynamically tightens safety margins around the inner trust-region policy step in the constrained decision process.

If this is right

On NASDAQ data the method recovers 95.9 percent of unconstrained throughput at 2.5 percent constraint violation frequency across six regimes.
On crypto order-book data under MEV injection it captures 98.4 percent of the reward envelope at 3.2 percent violation frequency.
The algorithm scales sub-linearly with up to eight constraints and completes on-chain settlement inside one Ethereum block.
The same structure produces a 2.1 times reward gain on Safety-Gymnasium benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same outer-loop stabilization could be applied to other constrained sequential decision tasks that face drifting conditions.
More precise ways to translate fairness definitions into costs might shrink the small residual violation rates further.
On-chain execution capability suggests the method could support transparent decentralized matching services.
Live-market trials would test whether real-world non-stationarity exceeds the controller's ability to reach zero violations.

Load-bearing premise

Market conditions change slowly enough that the feedback loop can steadily reduce fairness violations without the system becoming unstable.

What would settle it

Extended runs on real market data that show constraint violations failing to approach zero over time or that exhibit growing oscillations would disprove the stability and zero-steady-state-violation claims.

Figures

Figures reproduced from arXiv: 2604.06522 by David Shi, Elena Jia, Jiahao Sun, Vadzim Mahilny, Wei Dai, Wenhu Zhang, Zehua Cheng, Zhipeng Wang.

**Figure 1.** Figure 1: Overview of the CPO-FOAM (Constrained Policy Optimization with Feedback-Optimized Adaptive Margins) framework. Top: The fair order-matching problem is formulated as a Constrained Markov Decision Process (CMDP). The policy network (πθ) structurally enforces individual (Lipschitz) fairness via spectral normalization, effectively decoupling it from the probabilistic group fairness constraints. The Limit Order… view at source ↗

**Figure 2.** Figure 2: Performance trade-offs, training dynamics, and transfer efficiency of CPO-FOAM. (A) Pareto frontier illustrating the trade-off between Market Quality Score and Aggregate Fairness Violation (∆DP). Our method (CPO-FOAM) effectively pushes the efficiency frontier, achieving near-optimal market quality while maintaining the lowest fairness violation compared to standard reinforcement learning baselines. (B) Co… view at source ↗

**Figure 3.** Figure 3: Regime-specific robustness and on-chain verification cost scaling. (A) Constraint Violation Frequency (CVF) across extreme market regimes. CPO-FOAM maintains safety compliance during flash events and liquidity cascades, whereas Vanilla CPO and Lagrangian PPO suffer severe constraint degradation. (B) Amortized challenge-response verification costs per transaction across execution layers (Ethereum L1, Arbitr… view at source ↗

read the original abstract

Automated matching engines execute millions of orders per session, yet systematic asymmetries in latency, order size, and market access compound into persistent execution disparities that erode participant trust. We formulate provably fair order matching as a Constrained Markov Decision Process and propose CPO-FOAM (Constrained Policy Optimization with Feedback-Optimized Adaptive Margins). An inner loop computes an analytic trust-region step on the Fisher information manifold; a PID-controlled outer loop dynamically tightens safety margins, suppressing the sawtooth oscillations endemic to Lagrangian methods under non-stationary dynamics. Group fairness (demographic parity, equalized odds) enters the CMDP cost vector while individual Lipschitz fairness is enforced deterministically via spectral normalization. We prove BIBO stability and that the integral term drives steady-state violations to zero. On LOBSTER NASDAQ data across six market regimes, CPO-FOAM recovers 95.9% of unconstrained throughput at 2.5% constraint violation frequency; on crypto-asset LOB data under MEV injection it captures 98.4% of the reward envelope at 3.2% CVF. The method scales sub-linearly to M=8 constraints, settles on-chain within one Ethereum block, and yields a 2.1X reward improvement on Safety-Gymnasium, confirming domain-agnostic generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CPO-FOAM layers a PID outer loop on analytic trust-region CPO for fair order matching and claims BIBO stability plus strong empirical recovery on LOB data, but the stability guarantee under abrupt regime shifts is the part that needs checking.

read the letter

The main thing to know is that this paper takes standard constrained policy optimization, adds a PID controller to adapt safety margins on the fly, and applies the result to order matching with a mix of group and individual fairness constraints. They report recovering 95.9% of unconstrained throughput at 2.5% violation rate on LOBSTER NASDAQ data across regimes, plus similar numbers on crypto data, and they claim a proof that the integral term drives steady-state violations to zero while keeping the closed loop BIBO stable.

Referee Report

3 major / 2 minor

Summary. The paper formulates fair order matching as a Constrained Markov Decision Process and introduces CPO-FOAM, which combines an inner analytic trust-region policy optimization step on the Fisher information manifold with a PID-controlled outer loop that dynamically adjusts safety margins. It claims to prove BIBO stability of the closed-loop system and that the integral action eliminates steady-state constraint violations for group fairness constraints (demographic parity, equalized odds) and individual Lipschitz fairness. Empirical results on LOBSTER NASDAQ data across six market regimes report 95.9% recovery of unconstrained throughput at 2.5% constraint violation frequency, with additional results on crypto LOB data and Safety-Gymnasium.

Significance. If the stability proof and empirical robustness hold, the work would offer a principled way to enforce fairness in non-stationary high-frequency matching without the oscillations typical of Lagrangian methods, with potential impact on automated trading systems and broader constrained RL applications. The sub-linear scaling to multiple constraints and on-chain feasibility are practical strengths; the cross-domain results on Safety-Gymnasium provide some evidence of generalization.

major comments (3)

[Abstract] The abstract asserts a proof of BIBO stability and that the integral term drives steady-state violations to zero, but no derivation, assumptions on disturbance rates, or closed-loop dynamics are provided. This directly bears on the central claim, as the skeptic's concern about discontinuous jumps in transition probabilities during LOB regime shifts (news, liquidity shocks) could cause integral wind-up if the proof only covers Lipschitz-continuous or slowly varying plants.
[Experiments section] The LOBSTER experiments (six regimes) report 95.9% throughput at 2.5% CVF but provide no error bars, baseline implementation details, data exclusion rules, or explicit stress tests isolating abrupt regime discontinuities. This leaves the empirical support for the theoretical guarantee unverifiable and weakens the claim that the PID outer loop handles non-stationarity.
[Method formulation] The encoding of fairness constraints (demographic parity, equalized odds, Lipschitz) as CMDP costs is stated without showing that the resulting cost functions preserve the underlying reward structure or that the trust-region step remains well-defined under the spectral normalization for individual fairness.

minor comments (2)

[Introduction] The term 'Feedback-Optimized Adaptive Margins' is introduced without comparison to prior adaptive margin or safety-filter literature in constrained RL.
[Notation] Notation for the PID gains and initial safety margins is not clearly distinguished from the CMDP parameters in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments identify important areas where additional detail will strengthen the presentation of the stability results, empirical support, and method formulation. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract] The abstract asserts a proof of BIBO stability and that the integral term drives steady-state violations to zero, but no derivation, assumptions on disturbance rates, or closed-loop dynamics are provided. This directly bears on the central claim, as the skeptic's concern about discontinuous jumps in transition probabilities during LOB regime shifts (news, liquidity shocks) could cause integral wind-up if the proof only covers Lipschitz-continuous or slowly varying plants.

Authors: We agree that the abstract is concise and omits the full derivation. We will add a dedicated subsection in the revised manuscript that provides the complete BIBO stability proof, explicitly stating the assumptions on bounded disturbance rates and the closed-loop dynamics of the PID-augmented system. We will also include analysis showing that the adaptive safety margins bound the integral term and prevent wind-up even under bounded discontinuous jumps in the transition probabilities that arise during LOB regime shifts. revision: yes
Referee: [Experiments section] The LOBSTER experiments (six regimes) report 95.9% throughput at 2.5% CVF but provide no error bars, baseline implementation details, data exclusion rules, or explicit stress tests isolating abrupt regime discontinuities. This leaves the empirical support for the theoretical guarantee unverifiable and weakens the claim that the PID outer loop handles non-stationarity.

Authors: We agree that these details are required for verifiability. In the revision we will report error bars over multiple random seeds for the LOBSTER results, provide full baseline implementation details in an appendix, specify the data exclusion criteria (e.g., minimum liquidity thresholds), and add explicit stress-test experiments that inject abrupt regime discontinuities via simulated news and liquidity shocks. These additions will directly support the claim that the PID outer loop maintains performance under non-stationarity. revision: yes
Referee: [Method formulation] The encoding of fairness constraints (demographic parity, equalized odds, Lipschitz) as CMDP costs is stated without showing that the resulting cost functions preserve the underlying reward structure or that the trust-region step remains well-defined under the spectral normalization for individual fairness.

Authors: We will expand the method section to include explicit arguments demonstrating that the CMDP cost functions are linear in the occupancy measure and therefore preserve the original reward structure. We will also add a short proof that the post-update spectral normalization for Lipschitz fairness keeps the Fisher information matrix positive definite, ensuring the analytic trust-region step on the manifold remains well-defined. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on standard CMDP/CPO with independent PID stability claim

full rationale

The abstract and claims present CPO-FOAM as an inner trust-region step on the Fisher manifold plus an outer PID loop for margin adaptation, with a stated proof of BIBO stability and integral action eliminating steady-state violations. No equations are supplied that would allow a quoted reduction of any prediction or stability result to a fitted parameter or self-definition. The method is explicitly positioned as extending existing constrained optimization machinery rather than deriving its guarantees from its own outputs or prior self-citations. Empirical recovery percentages are reported as separate validation, not as the basis for the theoretical claims. This satisfies the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard CMDP modeling of markets plus a novel adaptive-margin controller whose parameters and stability assumptions are not derived from first principles in the abstract.

free parameters (2)

PID controller gains
Proportional, integral, and derivative coefficients for the outer loop that tighten safety margins; values are not stated as fixed or derived.
Initial safety margins
Starting constraint boundaries that the PID loop adjusts.

axioms (2)

domain assumption Order matching can be faithfully represented as a CMDP whose cost vector exactly captures demographic parity and equalized odds.
Invoked when fairness enters the CMDP cost vector.
standard math The Fisher information manifold permits an analytic trust-region step that preserves feasibility.
Used for the inner-loop policy update.

invented entities (1)

Feedback-Optimized Adaptive Margins no independent evidence
purpose: Dynamically adjusted constraint boundaries controlled by PID to suppress oscillations.
New component introduced to handle non-stationary dynamics.

pith-pipeline@v0.9.0 · 5560 in / 1672 out tokens · 65319 ms · 2026-05-10T17:48:02.511983+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove BIBO stability and that the integral term drives steady-state violations to zero... closed-loop transfer function H(z) = z(z−1)/(z² + (αP + αI − 1)z − αP)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

An inner loop computes an analytic trust-region step on the Fisher information manifold

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 1 canonical work pages

[1]

Reward Constrained Policy Optimization

PMLR, 2020. Tessler, C., Mankowitz, D. J., and Mannor, S. Re- ward constrained policy optimization.arXiv preprint arXiv:1805.11074, 2018. Teutsch, J. and Reitwiesner, C. A scalable verification solu- tion for blockchains. InTrueBit Protocol, 2019. Avail- able at: https://people.cs.uchicago.edu/ ˜teutsch/papers/truebit.pdf. Wen, M., Bastani, O., and Topcu,...

work page Pith review arXiv 2020
[2]

17 Constrained Policy Optimization for Provably Fair Order Matching Williams, S., Diordiiev, V ., Berman, L., Russ, I., and Uem- lianin, I

PMLR, 2021. 17 Constrained Policy Optimization for Provably Fair Order Matching Williams, S., Diordiiev, V ., Berman, L., Russ, I., and Uem- lianin, I. Arweave: A protocol for economically sustain- able information permanence. Arweave Yellow Paper,

2021
[3]

Available at: https://www.arweave.org/ yellow-paper.pdf. Yu, B. Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, 22(1):94–116, 1994. 18

1994