Recognition: 1 theorem link
Resolution-Aware Perpetual Futures on Binary Prediction Markets: An Empirical Risk-Design Framework Using Polymarket Data
Pith reviewed 2026-05-12 03:21 UTC · model grok-4.3
The pith
A six-component resolution-aware framework for perpetual futures on binary prediction markets fails to validate for deployment but separates halt and margin roles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Standard basis-only funding paired with continuous-volatility static margin is non-portable to bounded-event underlyings. The PIRAP framework, built from an index estimator, jump-aware tiered margin, leverage compression schedule, resolution-aware funding rule, multi-stage halt protocol, and eligibility framework, when counterfactually evaluated on Polymarket PMXT v2 data, passes pre-registered stylized-fact floors for boundary depth asymmetry and terminal-jump magnitude but fails the majority of welfare-side and materiality floors. This establishes that the framework does not validate for deployment while confirming a halt-versus-margin scope distinction and a pre-emption trade-off that nar
What carries the argument
The six-component PIRAP risk-design framework, which supplies an index estimator, jump-aware tiered margin, leverage compression, resolution-aware funding, multi-stage halt protocol, and eligibility rules to address resolution-specific risks in perpetual contracts on binary-event underlyings.
Load-bearing premise
The chosen pre-registered stylized-fact floors, welfare floors, and materiality floors constitute an adequate and unbiased test of whether the six-component framework improves risk outcomes on bounded-event underlyings.
What would settle it
Live deployment of the six components on Polymarket or equivalent data that measures whether final-hour liquidation rates fall by approximately 80 percent or terminal bad-debt frequency drops relative to baseline perpetual designs would settle the risk-improvement claim.
read the original abstract
We develop and counterfactually evaluate a resolution-aware risk-design framework (PIRAP) for perpetual futures whose underlying tracks a single binary prediction-market probability through resolution. The framework specifies six components: an index estimator combining mid-price, depth-weighted mid, and time-decayed VWAP; jump-aware tiered margin sized against bounded-event terminal-collapse magnitude; leverage compression schedule contracting toward resolution; resolution-aware funding rule with boundary-aware correction; a multi-stage halt protocol; and an eligibility framework. Two formal non-portability propositions establish that standard basis-only funding paired with continuous-vol static margin fails on bounded-event underlyings. Empirical evaluation uses Polymarket's PMXT v2 archive for 2026-04-21 to 2026-04-27 (13,298-market analysis sample passing adequacy gates from 61,087 ingested; 13,115 resolved within the empirical window for E3). E1 evaluates two pre-registered stylized facts; E2 conducts counterfactual replay across three engine configurations; E3 isolates the resolution-zone protocol's contribution. Results are mixed. Five pre-registered floors: stylized-fact floors (boundary depth asymmetry, terminal-jump magnitude) PASS; welfare-side directional floors (final-hour liquidation -6%, drawdown -5.1% pooled, median PnL +14%) two FAIL one PASS; E3 mechanic floors (final-hour liquidation -80% by halt construction PASS; bad-debt frequency +2.4% FAIL). Three of five materiality floors fail: the framework as specified does not validate deployment, but the empirical record establishes a halt-versus-margin scope distinction (halt addresses execution-channel risk; terminal-jump bad-debt remains margin-side) and documents a pre-emption trade-off constraining the dynamic-margin component. The paper concludes with structural recommendations and explicit non-deployable status.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a resolution-aware risk-design framework (PIRAP) for perpetual futures on binary prediction-market probabilities, specifying six components (index estimator with mid-price/depth-weighted mid/time-decayed VWAP, jump-aware tiered margin, leverage compression, resolution-aware funding, multi-stage halt protocol, eligibility framework) plus two formal non-portability propositions. It counterfactually replays the framework on a 13,298-market Polymarket PMXT v2 sample (2026-04-21 to 2026-04-27), pre-registering five materiality floors across stylized-fact, welfare, and mechanic evaluations (E1–E3); results are mixed with three floors failing, leading to a non-deployable conclusion while claiming a halt-versus-margin scope distinction and pre-emption trade-off.
Significance. If the counterfactual results and pre-registered design hold under scrutiny, the work is significant for documenting why standard basis-only funding and static margin fail on bounded-event underlyings, for isolating the distinct roles of halt protocols (execution-channel risk) versus margin (terminal-jump bad debt), and for supplying explicit structural recommendations plus an empirical record from real prediction-market data. The pre-registered floors and formal propositions are strengths that support falsifiability.
major comments (2)
- [Abstract / E3] Abstract / E3: The conclusion that the framework 'does not validate deployment' rests on three of five pre-registered materiality floors failing (welfare directional floors at -6% liquidation, -5.1% drawdown, +14% median PnL; E3 bad-debt frequency at +2.4%). No derivation, sensitivity analysis, or justification is supplied for why these specific thresholds constitute an adequate and unbiased test of risk improvement on bounded-event underlyings, which is load-bearing for the non-deployability claim and the halt-versus-margin distinction.
- [E2] E2 counterfactual replay: The evaluation reports outcomes across three engine configurations but supplies no implementation detail on the free parameters (time-decay weight in the VWAP index estimator; jump-magnitude bound for tiered margin sizing). Without these, it is impossible to determine whether the mixed welfare and bad-debt results are robust or artifacts of the chosen parameter values.
minor comments (1)
- [Methods / Data] The abstract states that 13,298 markets passed adequacy gates from 61,087 ingested and that 13,115 resolved within the window; the methods section should expand on the exact gate criteria and any selection bias this introduces.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify areas where the manuscript can be strengthened. We respond to each major comment below and commit to revisions that address the concerns while preserving the core contributions of the pre-registered evaluation and the halt-versus-margin distinction.
read point-by-point responses
-
Referee: [Abstract / E3] Abstract / E3: The conclusion that the framework 'does not validate deployment' rests on three of five pre-registered materiality floors failing (welfare directional floors at -6% liquidation, -5.1% drawdown, +14% median PnL; E3 bad-debt frequency at +2.4%). No derivation, sensitivity analysis, or justification is supplied for why these specific thresholds constitute an adequate and unbiased test of risk improvement on bounded-event underlyings, which is load-bearing for the non-deployability claim and the halt-versus-margin distinction.
Authors: The materiality floors were pre-registered prior to any data analysis precisely to ensure an unbiased and falsifiable test of the framework. This pre-registration itself supplies the primary justification for treating the thresholds as an adequate benchmark, as they were fixed ex ante to capture economically material improvements in risk metrics for bounded-event underlyings. We nevertheless agree that the manuscript would be improved by an explicit derivation and sensitivity analysis. In revision we will add a dedicated subsection (and appendix) that (i) links each threshold to stylized facts from prediction-market literature and observed participation effects, and (ii) reports sensitivity checks under alternative threshold values (e.g., ±20 % shifts). These additions will reinforce rather than alter the non-deployability conclusion and the documented halt-versus-margin scope distinction. revision: yes
-
Referee: [E2] E2 counterfactual replay: The evaluation reports outcomes across three engine configurations but supplies no implementation detail on the free parameters (time-decay weight in the VWAP index estimator; jump-magnitude bound for tiered margin sizing). Without these, it is impossible to determine whether the mixed welfare and bad-debt results are robust or artifacts of the chosen parameter values.
Authors: We accept that the absence of explicit numerical values for the free parameters in the E2 section limits reproducibility and the ability to judge robustness. The manuscript describes the functional forms of the index estimator and tiered-margin rule but does not tabulate the concrete parameter settings used in the replay. In the revised version we will insert a complete parameter table in the empirical methods section, state the exact values employed for the time-decay weight and jump-magnitude bound, and supply pseudocode for both components. This will enable readers to replicate the three engine configurations and to assess whether the mixed welfare and bad-debt outcomes are sensitive to reasonable parameter perturbations. revision: yes
Circularity Check
Empirical evaluation on external archive with pre-registered floors shows no load-bearing circularity
full rationale
The paper's derivation chain consists of formal non-portability propositions followed by counterfactual replay on an external Polymarket archive (13,298-market sample) and pre-registered stylized-fact, welfare, and materiality floors. No equations reduce claimed performance metrics, the halt-versus-margin distinction, or the non-deployability conclusion to parameters fitted inside the paper or to self-citations whose validity depends on the present work. The evaluation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- time-decay weight in VWAP index estimator
- jump-magnitude bound used for tiered margin sizing
axioms (1)
- domain assumption Standard basis-only funding paired with continuous-vol static margin fails on bounded-event underlyings
Reference graph
Works this paper leans on
-
[1]
Zenodo.doi: 10.5281/zenodo.20107449.url: https://doi.org/10.5281/zenodo. 20107449. — (2026g). “The Signal Credibility Index for Prediction Markets: A Microstructure-Grounded Diagnostic with Weighted and Time-Varying Extensions”. arXiv:2604.27041; SSRN:6676179. doi: 10.48550/arXiv.2604.27041. arXiv:2604.27041.url: https://arxiv.org/abs/ 2604.27041. Nimmaga...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.20107449.url: 2019
-
[2]
Take the most recentbooksnapshot at or before the target timestamp
-
[3]
Applyevery price_changebetweenthatsnapshotandthetargettimestampin timestamp_received order
-
[4]
Result: the order book as the venue knew it at the target timestamp. A sanity-check spot test of this procedure on five markets is part of the G5 sample adequacy gate (Section 5.4); reconstructed best bid/ask must agree with feed best-bid/best-ask fields on at least 99% ofprice_changeevents. B.3 Polymarket Gamma API: market metadata Public REST API atgamm...
work page 2025
-
[5]
PMXT events are ingested byasset_id
-
[6]
Join is by exact-string matchasset_id=clobTokenIds[i]for some outcome indexi
Gamma metadata is fetched for the uniqueasset_id set in the empirical window. Join is by exact-string matchasset_id=clobTokenIds[i]for some outcome indexi
-
[7]
OOv2 adapters are queried via Goldsky; MOOV2 adapters are queried via GammaoutcomePrices
For each Gamma-resolved market, UMA OO records are fetched. OOv2 adapters are queried via Goldsky; MOOV2 adapters are queried via GammaoutcomePrices
-
[8]
The combined record is persisted with theasset_id as primary key andoracle_source field tagging which adapter produced the resolution. B.7 Cleaning steps
-
[9]
Sort all events by(market, timestamp_received, event_seq) where event_seq is a within-file sequence number for tie-breaking
-
[10]
Drop exact duplicate events (same market, timestamp, sequence, payload)
-
[11]
Drop events with malformed numeric fields (rare; logged at WARNING level)
-
[12]
rank distinct markets bymin(first_event_timestamp) and take the firstN
Forbook events with empty bids and asks, treat as market dormancy markers, not as data quality failures. The percentage of events dropped under each cleaning rule is reported per file in the production G5 output JSON. B.8 Diagnostic chronological-prefix subsamples The originally pre-registered analysis-sample selection rule (plan v1.2 §5.4) was: “rank dis...
work page 2026
-
[13]
Uses direct R2 endpoint enumeration per the rule in Section 5.3
Download the PMXT v2 archive for the target empirical window (default 2026-04-21 to 2026-04-27, configurable viaTARGET_WEEK_START and TARGET_WEEK_END environment variables). Uses direct R2 endpoint enumeration per the rule in Section 5.3. Verified againstMANIFEST.sha256in stage 2
work page 2026
-
[14]
Verify archive integrity by SHA-256 hash comparison against the committed manifest. Failure aborts the pipeline
-
[15]
Producestable_t_g5_*.json with class counts, gate floor evaluations, and consequence-rule status
Run the Sample Adequacy Gate evaluation (Section 5.5). Producestable_t_g5_*.json with class counts, gate floor evaluations, and consequence-rule status. The same script runs against any subsample size and rule (chronological or stratified-by-day) via CLI arguments
-
[16]
Produces SF1–SF9 (Section 5.6) as JSON time-series and as paper tables
Run E1 stylized-facts evaluation on the analysis sample. Produces SF1–SF9 (Section 5.6) as JSON time-series and as paper tables
-
[17]
Produces engine-comparison metrics and falsifiability- test outputs
Run E2 counterfactual replay (E2a position-agnostic, E2b deterministic grid, E2c synthetic-trader robustness). Produces engine-comparison metrics and falsifiability- test outputs
-
[18]
Run E3 resolution-zone protocol comparison (Section 8.3)
-
[19]
Perps are Prediction Markets. Prediction Markets are Perps
Build the paper PDF from latex source, with all empirical outputs from stages 3–6 referenced. Stagescanberunindividuallywith bash scripts/reproduce.sh –stage <N>. A–check-only mode verifies that expected artifacts exist without re-running. 1Please verify the canonical URL on the ForesightFlow organization page; the path may have been renamed since this pa...
work page doi:10.5281/zenodo.20107449)andmirroredongithubat 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.