pith. machine review for the scientific record. sign in

arxiv: 2605.12151 · v1 · submitted 2026-05-12 · 💱 q-fin.TR · q-fin.CP· q-fin.ST

Recognition: no theorem link

RED-2400: A Public Benchmark of Algorithmically-Rejected Trading Events with Outcome Labels

Arati U. Kamat

Pith reviewed 2026-05-13 03:31 UTC · model grok-4.3

classification 💱 q-fin.TR q-fin.CPq-fin.ST
keywords trading filtersbenchmark datasetSolana DEXrejection eventsoutcome labelsalgorithmic tradingdecentralized exchange
0
0 comments X

The pith

RED-2400 supplies labeled data on 6,659 algorithmically rejected trades to test filter precision on the reject side.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RED-2400, a public benchmark dataset of trading events rejected by a live Solana DEX filter. It includes 6,659 rejections connected to 169,122 post-rejection price observations and labels each with one of five outcome tiers derived from 24-hour price ratios. Most existing datasets only track accepted trades, leaving the performance of rejection filters unmeasured and potentially biased. By releasing reject-side data with outcome labels, RED-2400 allows direct replication and validation of claims about filter accuracy.

Core claim

RED-2400 is a public benchmark of algorithmically-rejected trading events from a live Solana decentralized-exchange filter stack, containing 6,659 rejection events linked to 169,122 post-rejection price and liquidity observations and 1,836 graveyard-tracker snapshots, with outcome labels in five tiers: saved (windowed), saved (early-death), missed, flat, and unclassifiable, using thresholds from trough-to-reference and peak-to-reference price ratios within a 24-hour window.

What carries the argument

The five-tier outcome classification system based on 24-hour trough-to-reference and peak-to-reference price ratios, which assigns each rejection to saved (windowed), saved (early-death), missed, flat, or unclassifiable.

If this is right

  • Filter designers can now measure how often rejections lead to missed gains or unnecessary blocks.
  • Validation of trading algorithms can include both accept and reject performance without bias.
  • Subsequent dataset windows will support analysis stratified by market regimes.
  • Researchers can replicate filter-precision claims using the provided linked observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved filter designs could emerge from identifying patterns in the missed or flat rejection outcomes.
  • This benchmark may encourage similar public datasets for other exchanges and time periods.
  • Combining this reject data with accept-side datasets could yield a more complete picture of overall filter effectiveness.

Load-bearing premise

The five-tier labels derived from 24-hour price ratios accurately capture the true impact of each rejection and the logged events represent the filter's typical behavior.

What would settle it

Finding that the outcome labels do not predict actual trading results after rejections or that the dataset events are not representative of the filter's decisions over time.

Figures

Figures reproduced from arXiv: 2605.12151 by Arati U. Kamat.

Figure 1
Figure 1. Figure 1: Distribution of rejection events across the eight anonymised filter labels. The top five filters (filter_1 to filter_5) account for 93.4 percent of all rejections; filter_6 to filter_8 carry the remaining 6.6 percent. Counts and within-cohort percentages are annotated to the right of each bar. Filter labels are anonymised; the active scanner configuration during the observation window comprised eight filte… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of post-rejection observation age (ageMin). The interquartile range is concentrated around the median of 686 minutes (≈ 11.4 hours). The outcome time series for a typical rejected token covers roughly half of the 24-hour observation horizon. Right-censoring at 1,440 minutes (24 hours) marks the dataset cutoff for early-death classification [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Liquidity and 24-hour trade-volume distribution for the rejected-token cohort. Both fields use a log scale. The gap between median and mean (red diamond) shows heavy-tailed structure. The bulk of rejected tokens trade with liquidity below USD 50,000. The filter is rejecting low-conviction venue activity. Experimental Design, Materials and Methods I logged decision events from public Solana DEX venues conti… view at source ↗
read the original abstract

RED-2400 is a public benchmark of algorithmically-rejected trading events from a live Solana decentralized-exchange filter stack. I logged the data continuously between 2026-04-10 and 2026-05-02. The benchmark contains 6,659 rejection events linked to 169,122 post-rejection price and liquidity observations and 1,836 graveyard-tracker snapshots. Outcome labels follow the five-tier classification of Kamat (2026c): saved (windowed), saved (early-death), missed, flat, and unclassifiable. Thresholds use the trough-to-reference and peak-to-reference price ratios within a 24-hour window. Most filter-design datasets cover the accept side only. That gap leaves reject-side outcomes unmeasured and biases filter validation. RED-2400 lets researchers replicate filter-precision claims directly. RED-2400 is the first window in a planned dataset series; subsequent windows will extend the time horizon and enable regime-stratified analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents RED-2400, a public benchmark of 6,659 algorithmically-rejected trading events logged from a live Solana DEX filter stack between 2026-04-10 and 2026-05-02. It includes 169,122 post-rejection price/liquidity observations and 1,836 graveyard-tracker snapshots, with events labeled into five tiers (saved windowed, saved early-death, missed, flat, unclassifiable) via 24-hour trough-to-reference and peak-to-reference price ratios following Kamat (2026c). The central claim is that existing filter-design datasets cover only the accept side, leaving reject-side outcomes unmeasured and biasing validation; RED-2400 fills this gap by enabling direct replication of filter-precision claims and is the first in a planned series for extended temporal and regime-stratified analysis.

Significance. If the logged events are representative and the five-tier labels validly proxy rejection outcomes, the benchmark would provide a novel public resource for evaluating the reject side of algorithmic trading filters on decentralized exchanges. This could reduce bias in precision assessments and support more reproducible validation of trading strategies. The public release of the dataset, linkage to post-event observations, and commitment to future windows are explicit strengths that enhance its utility for the q-fin.TR community.

major comments (2)
  1. [Abstract and Data Collection section] Abstract and Data Collection section: The manuscript states the collection period, event counts (6,659 rejections), and observation totals but supplies no description of the filter-stack implementation, logging procedures, data-validation steps, or error rates. This is load-bearing for the central claim that the benchmark enables unbiased replication of precision checks, as the representativeness of the logged rejection events cannot be assessed without these details.
  2. [Outcome Labels section] Outcome Labels section: The five-tier scheme derives labels solely from 24-hour trough-to-reference and peak-to-reference price ratios without incorporating fees, slippage, position sizing, liquidity decay, or any counterfactual PnL on acceptance. In Solana DEX data, where external factors can drive post-rejection price action, this risks systematic misclassification of 'saved' versus 'missed' events and undermines the claim that the labels accurately measure rejection impact.
minor comments (3)
  1. [Abstract] The term 'graveyard-tracker snapshots' is introduced without a definition or reference to its implementation, which reduces clarity for readers outside the immediate context.
  2. The manuscript could include a table summarizing the distribution of the five label tiers across the 6,659 events to aid immediate assessment of class balance.
  3. [Outcome Labels section] The self-reference to Kamat (2026c) for the exact classification thresholds would benefit from a brief inline summary or appendix excerpt to improve standalone reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on RED-2400. We address each major comment below and commit to revisions that improve documentation and clarify scope without altering the benchmark's core contribution of providing reject-side labels.

read point-by-point responses
  1. Referee: [Abstract and Data Collection section] Abstract and Data Collection section: The manuscript states the collection period, event counts (6,659 rejections), and observation totals but supplies no description of the filter-stack implementation, logging procedures, data-validation steps, or error rates. This is load-bearing for the central claim that the benchmark enables unbiased replication of precision checks, as the representativeness of the logged rejection events cannot be assessed without these details.

    Authors: We agree that the Data Collection section lacks necessary detail on the filter-stack implementation, logging procedures, data-validation steps, and error rates. This information is required to evaluate representativeness. In the revised manuscript we will expand this section with a description of the filter logic, the logging pipeline from the Solana DEX, and validation steps performed on the collected events. Specific quantitative error rates for rejection detection were not computed during the original collection window; we will add an explicit statement of this limitation and indicate that such metrics will be included in subsequent dataset releases. revision: yes

  2. Referee: [Outcome Labels section] Outcome Labels section: The five-tier scheme derives labels solely from 24-hour trough-to-reference and peak-to-reference price ratios without incorporating fees, slippage, position sizing, liquidity decay, or any counterfactual PnL on acceptance. In Solana DEX data, where external factors can drive post-rejection price action, this risks systematic misclassification of 'saved' versus 'missed' events and undermines the claim that the labels accurately measure rejection impact.

    Authors: The five-tier labels are adopted verbatim from the methodology in Kamat (2026c) and function as a replicable price-ratio proxy rather than a full economic-impact calculation. We acknowledge that the scheme omits fees, slippage, position sizing, liquidity decay, and counterfactual PnL, which may produce misclassifications when external factors influence post-rejection price action on Solana DEX. We will revise the Outcome Labels section to state these limitations explicitly and to clarify that the tiers provide a standardized price-based proxy for rejection outcomes, not a net-PnL measure. This adjustment preserves the benchmark's intended use for replicating filter-precision claims while addressing the referee's concern. revision: partial

Circularity Check

0 steps flagged

Minor self-citation for label scheme; dataset release has no derivation chain

full rationale

This is a data-release paper whose central contribution is the public logging and release of 6,659 rejection events plus associated observations. No equations, fitted parameters, predictions, or uniqueness theorems are claimed. The sole self-reference is the adoption of the five-tier label taxonomy from Kamat (2026c) to annotate the released data; that taxonomy is not derived or validated inside the present manuscript and does not serve as a load-bearing premise for any result. The work is therefore self-contained against external benchmarks and receives only the minimal score for a routine self-citation that is not circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data-release paper containing no mathematical derivations, fitted parameters, or postulated entities; the only external reference is the five-tier classification from a prior self-citation.

pith-pipeline@v0.9.0 · 5475 in / 1204 out tokens · 85879 ms · 2026-05-13T03:31:31.048254+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    and Langford, J

    Beygelzimer, A. and Langford, J. (2009). The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '09)

  2. [2]

    Crook, J. N. and Banasik, J. (2004). Does reject inference really improve the performance of application scoring models? Journal of Banking & Finance, 28(4): 857–874

  3. [3]

    Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1): 153–161

  4. [4]

    Kamat, A. (2026c). Outcome-Classified Precision Auditing of Filter Rules in Algorithmic DEX Trading: Evidence from 2,400 Rejection Events. SSRN Working Paper, abstract_id 6638259. https://ssrn.com/abstract=6638259. Preprint mirror: Zenodo 10.5281/zenodo.19720041. Companion dataset: Zenodo 10.5281/zenodo.19987697. López de Prado, M. (2018). Advances in Fin...