Recognition: unknown
Information Leakage at Population Scale: An Evaluation of the Polymarket Insider-Relevant Subpopulation, 2020-2026
Pith reviewed 2026-05-09 15:27 UTC · model grok-4.3
The pith
The Information Leakage Score applies to only 0.7% of Polymarket markets because most resolution criteria are too ambiguous for consistent computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 12,708 markets only 88 yield computable ILS-dl values, with just one of 32 ForesightFlow Insider Cases markets in scope and 14 flagged unclassifiable due to genuine resolution-criterion ambiguity. Raw ILS-dl medians are negative, but a hazard-decay baseline correction produces heterogeneous results: regulatory-formal post-2024 cells near zero and regulatory-announcement post-2024 cells with 95 percent bootstrap confidence intervals entirely below zero. The constant-hazard exponential is rejected in favor of Weibull on the pooled post-2024 cell, though per-subcategory checks show the preference arises from category mixture rather than within-cell duration dependence.
What carries the argument
The deadline-resolved Information Leakage Score (ILS-dl), which quantifies potential informed trading by comparing price paths to market resolution deadlines.
If this is right
- Detection of informed flow on prediction platforms requires prior classification of resolution semantics.
- Most markets cannot be analyzed for leakage using current methods.
- Baseline corrections such as hazard-decay adjustment are necessary to interpret ILS-dl values.
- Duration-dependence models must separate category mixture effects from true within-cell patterns.
- Methodological effort should shift toward resolution typology and score baselines.
Where Pith is reading between the lines
- Prediction market platforms may need standardized resolution language to support broader monitoring of informed activity.
- Automated tools for flagging resolution ambiguity could enlarge the set of analyzable markets.
- Single-case studies of leakage may overstate how widely detectable informed flow is at population scale.
- Analogous scope restrictions are likely when the same framework is applied to other prediction or betting platforms.
Load-bearing premise
Polymarket resolution criteria are sufficiently unambiguous and stable to allow consistent ILS-dl computation across the population.
What would settle it
A systematic reclassification of the 12,708 markets that identifies a substantially larger subpopulation with resolution criteria clear enough for independent evaluators to produce matching ILS-dl values would refute the scope limitation.
read the original abstract
We carry the deadline-resolved Information Leakage Score (ILS-dl) framework of Nechepurenko (2026a, 2026b) from a single-case proof of concept to a population-scale evaluation across 12,708 Polymarket markets, October 2020 to April 2026. We frame the paper as a scope-discovery study: scaling reveals that the framework's effective domain is materially narrower than initial framing suggested, and the principal obstacle is not score computation but resolution semantics. We report four findings. First, only 88 of 12,708 candidate markets (0.7%) yield computable ILS-dl values; only 1 of 32 markets in the ForesightFlow Insider Cases (FFIC) inventory is in scope, and 14 of 32 FFIC markets are flagged unclassifiable due to genuine resolution-criterion ambiguity. Second, only 12 of the 88 computed markets (13.6%) satisfy anchor-sensitivity, and an independent-second-pass T_event validation reaches 57.8% exact-date agreement, below the 90% ex-ante criterion. Third, raw ILS-dl medians are negative across all six (sub-bucket by period) cells, but a hazard-decay baseline correction we introduce yields a heterogeneous result: regulatory_formal post-2024 shifts to near-zero (-0.21 to -0.02), while regulatory_announcement post-2024 retains a 95% bootstrap CI entirely below zero. Fourth, the constant-hazard exponential of Nechepurenko (2026b) is rejected in favor of Weibull on the pooled post-2024 cell, but a per-subcategory check confirms the preference reflects category mixture rather than within-cell duration dependence. The implication is that detection of informed flow requires methodological refinement on the resolution-typology and score-baseline axes, not only on the score-computation axis where prior work concentrated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends the deadline-resolved Information Leakage Score (ILS-dl) framework from prior single-case work to a population-scale analysis of 12,708 Polymarket markets (October 2020–April 2026). It reports that only 88 markets (0.7%) produce computable ILS-dl values, with just 1 of 32 ForesightFlow Insider Cases (FFIC) in scope and 14 flagged unclassifiable due to resolution-criterion ambiguity. Additional results include low anchor-sensitivity (13.6% of computable markets), 57.8% T_event agreement (below the 90% target), negative raw ILS-dl medians addressed by a new hazard-decay baseline correction that yields heterogeneous post-2024 signals, and a preference for Weibull over exponential duration models attributed to category mixture rather than within-cell dependence. The central implication is that refinements are needed on resolution typology and score baselines.
Significance. If the manual classifications and post-hoc corrections prove robust, the work is significant as a scope-discovery study that empirically demonstrates the narrow effective domain of ILS-dl at population scale and identifies resolution semantics as the dominant barrier rather than computational feasibility. The concrete counts, bootstrap CIs, and cross-checks against the FFIC inventory provide useful data for researchers working on prediction-market information flow and insider detection. The introduction of the hazard-decay correction and the Weibull finding add methodological elements, though their ad-hoc character limits immediate generalizability.
major comments (3)
- [the population-scale evaluation and results on computable markets (abstract and corresponding results section)] The central scope claim—that only 88 of 12,708 markets (0.7%) yield computable ILS-dl values and that resolution semantics are the binding constraint—depends entirely on the authors' binary manual classification of markets as computable versus ambiguous or unclassifiable. No explicit decision protocol, criteria for 'genuine resolution-criterion ambiguity,' or inter-rater reliability statistic is supplied. Because resolution wording is interpretive, even modest reclassification rates would materially change the reported effective domain and the paper's primary conclusion.
- [the section on ILS-dl medians, baseline correction, and post-2024 sub-bucket results] The hazard-decay baseline correction is introduced to address negative raw ILS-dl medians and produces the reported heterogeneous findings (regulatory_formal post-2024 near zero; regulatory_announcement post-2024 with 95% CI entirely below zero). This adjustment is presented without an external benchmark, pre-specified validation, or full specification of its free parameter, making the post-correction signals dependent on an ad-hoc choice whose impact on the leakage interpretation is not quantified.
- [the anchor-sensitivity and T_event validation paragraph] The independent-second-pass T_event validation is reported at 57.8% exact-date agreement against a 90% ex-ante criterion, and only 12 of 88 markets satisfy anchor-sensitivity. Without details on the exact rules applied in the second pass, error propagation, or how disagreements were resolved, it is difficult to evaluate whether this result undermines the framework's reliability claims.
minor comments (3)
- [Abstract] The abstract refers to 'six (sub-bucket by period) cells' without briefly defining the periods or subcategories; a short table or footnote would improve clarity.
- [the Weibull versus exponential discussion] The per-subcategory check confirming that Weibull preference reflects category mixture could be strengthened by presenting the relevant statistics or counts in a table rather than narrative summary.
- [methods or results on corrected medians] Bootstrap CI computation details for the corrected ILS-dl values (e.g., number of replicates, resampling unit) are not stated, which affects reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report, which highlights important transparency issues in our scope-discovery analysis. We address each major comment below with clarifications and commitments to revision where feasible.
read point-by-point responses
-
Referee: The central scope claim—that only 88 of 12,708 markets (0.7%) yield computable ILS-dl values and that resolution semantics are the binding constraint—depends entirely on the authors' binary manual classification of markets as computable versus ambiguous or unclassifiable. No explicit decision protocol, criteria for 'genuine resolution-criterion ambiguity,' or inter-rater reliability statistic is supplied. Because resolution wording is interpretive, even modest reclassification rates would materially change the reported effective domain and the paper's primary conclusion.
Authors: We agree that an explicit decision protocol is needed to support the classification. The binary distinction was based on whether resolution criteria permitted unambiguous identification of both T_event and a valid anchor date without interpretive ambiguity in outcome determination. We will add the full decision criteria and protocol as a new appendix or subsection in the revision. The classification was performed by the lead author; we will explicitly discuss this as a limitation and note that inter-rater reliability could not be computed. We maintain that the 0.7% computable rate is unlikely to change materially under modest reclassification, given that most exclusions arise from clearly ambiguous resolution language rather than borderline cases. revision: partial
-
Referee: The hazard-decay baseline correction is introduced to address negative raw ILS-dl medians and produces the reported heterogeneous findings (regulatory_formal post-2024 near zero; regulatory_announcement post-2024 with 95% CI entirely below zero). This adjustment is presented without an external benchmark, pre-specified validation, or full specification of its free parameter, making the post-correction signals dependent on an ad-hoc choice whose impact on the leakage interpretation is not quantified.
Authors: The hazard-decay correction was introduced post-hoc after observing consistently negative raw medians, which we view as indicating baseline model misspecification rather than true negative leakage. We will fully specify the free parameter (decay rate) and its selection procedure in the revised text, along with a sensitivity analysis showing how the post-2024 heterogeneous signals vary with plausible parameter values. As an exploratory adjustment developed during analysis, no pre-specified validation or external benchmark exists; we will add explicit discussion of its ad-hoc character and limitations on generalizability. revision: yes
-
Referee: The independent-second-pass T_event validation is reported at 57.8% exact-date agreement against a 90% ex-ante criterion, and only 12 of 88 markets satisfy anchor-sensitivity. Without details on the exact rules applied in the second pass, error propagation, or how disagreements were resolved, it is difficult to evaluate whether this result undermines the framework's reliability claims.
Authors: The 57.8% agreement is presented as a substantive finding that the framework does not meet the pre-specified 90% target, reinforcing the paper's conclusion that resolution semantics and event dating remain barriers. We will expand the methods section to detail the exact rules used in the independent second pass, the approach to error propagation, and the resolution of any date disagreements. This will allow readers to assess the reliability implications directly. revision: yes
- Provision of an inter-rater reliability statistic for the manual market classification, as it was performed by a single researcher.
Circularity Check
Self-cited ILS-dl framework plus ad-hoc hazard-decay correction to fix negative medians reduce independent content
specific steps
-
self citation load bearing
[Abstract]
"We carry the deadline-resolved Information Leakage Score (ILS-dl) framework of Nechepurenko (2026a, 2026b) from a single-case proof of concept to a population-scale evaluation across 12,708 Polymarket markets, October 2020 to April 2026. We frame the paper as a scope-discovery study: scaling reveals that the framework's effective domain is materially narrower than initial framing suggested, and the principal obstacle is not score computation but resolution semantics."
The headline claim that only 0.7% of markets yield computable ILS-dl values is produced by applying the ILS-dl definition and computability rules taken directly from the author's prior self-cited work; the conclusion that resolution semantics are the binding constraint therefore extends the prior framework rather than testing it against an independent standard.
-
fitted input called prediction
[Abstract, third finding]
"raw ILS-dl medians are negative across all six (sub-bucket by period) cells, but a hazard-decay baseline correction we introduce yields a heterogeneous result: regulatory_formal post-2024 shifts to near-zero (-0.21 to -0.02), while regulatory_announcement post-2024 retains a 95% bootstrap CI entirely below zero."
The hazard-decay correction is introduced by the authors precisely to address the negative raw medians; the reported heterogeneous post-correction outcomes (near-zero vs. CI below zero) are therefore generated by the choice and application of this correction rather than by a pre-specified model independent of the observed negativity.
full rationale
The paper's scope findings rest on applying the ILS-dl metric defined in the author's own prior papers and on a newly introduced hazard-decay baseline correction chosen specifically because raw medians were negative; while the manual classification of markets and the Weibull mixture check add empirical content, the central metric and its post-hoc adjustment lack external benchmarks and reduce the reported results on informed trading to extensions of self-defined inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- hazard-decay baseline correction
axioms (1)
- domain assumption Polymarket resolution criteria can be classified into unambiguous categories for ILS-dl applicability
Forward citations
Cited by 2 Pith papers
-
Manipulation, Insider Information, and Regulation in Leveraged Event-Linked Markets
Leverage scales market-price manipulation linearly while shifting outcome-manipulation thresholds and multiplying informed-trading rents in three distinct ways, calling for re-allocated regulatory attack surfaces rath...
-
A Taxonomy of Event-Linked Perpetual Futures: Variant Designs Beyond the Single-Market Binary Case
The paper organizes seven canonical variants of event-linked perpetual futures along four design axes, supplying payoff definitions, inheritance rules from prior work, and variant-specific constraints.
Reference graph
Works this paper leans on
-
[1]
2026 , journal =
Maksym Nechepurenko , title =. 2026 , journal =
2026
-
[2]
2026 , journal =
Joshua Mitts and Moran Ofir , title =. 2026 , journal =
2026
-
[3]
2026 , journal =
Roberto G\'omez-Cram and Yunhan Guo and Theis Ingerslev Jensen and Howard Kung , title =. 2026 , journal =
2026
-
[4]
Strategic Bidding Wars in
Akaki Mamageishvili and Andrey Shcherbenko and Pablo. Strategic Bidding Wars in. 2025 , howpublished =
2025
-
[5]
2026 , howpublished =
Federal Court Orders Default Judgment Against Insider Trader on. 2026 , howpublished =
2026
-
[6]
2026 , howpublished =
Maksym Nechepurenko , title =. 2026 , howpublished =
2026
-
[7]
Burnham and David R
Kenneth P. Burnham and David R. Anderson , title =. 2002 , edition =
2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.