pith. machine review for the scientific record. sign in

arxiv: 2603.23055 · v3 · submitted 2026-03-24 · 📊 stat.ML · cs.IT· cs.LG· math.IT

Recognition: no theorem link

Post-Selection Distributional Model Evaluation

Amirmohammad Farzaneh, Osvaldo Simeone

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:55 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.IT
keywords post-selection biase-valuesdistributional estimationfalse coverage ratemodel evaluationsample efficiencyKPI distributions
0
0 comments X

The pith

PS-DME uses e-values to control post-selection false coverage rate for distributional KPI estimates after arbitrary pre-selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents post-selection distributional model evaluation (PS-DME) as a way to obtain statistically valid estimates of the full distributions of key performance indicators for models chosen using the same data that will be used for evaluation. Standard approaches suffer from post-selection bias when the selection step depends on the test data, but PS-DME builds on e-values to control the false coverage rate of the resulting distributional estimates. The method also identifies explicit conditions under which it requires fewer samples than the common practice of splitting data into separate selection and evaluation portions. Experiments on synthetic examples, text-to-SQL tasks with large language models, and telecom network metrics show that the approach supports reliable comparisons across different reliability levels.

Core claim

PS-DME is a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, it controls post-selection false coverage rate (FCR) for the distributional KPI estimates and establishes explicit conditions under which it is provably more sample efficient than a baseline method based on sample splitting.

What carries the argument

e-values constructed to remain valid under arbitrary data-dependent pre-selection, which are then used to bound the post-selection false coverage rate for the full distributional KPI estimates.

If this is right

  • Valid estimates of full test-time KPI distributions become available without reserving separate data for selection versus evaluation.
  • Users can compare candidate models across a range of reliability levels in a single procedure.
  • Fewer total samples suffice for the same guarantee when the stated efficiency conditions hold.
  • The same framework applies to model selection in language model decoding and network configuration tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The e-value construction may extend directly to other post-selection tasks such as confidence intervals for selected parameters rather than full distributions.
  • Integration into automated machine learning pipelines could reduce the fraction of data held out for validation.
  • Adaptive or sequential selection rules could be accommodated if the e-value update remains martingale-valid.

Load-bearing premise

e-values can be built that stay valid for any data-dependent pre-selection rule without extra restrictions on the KPI or the selection process.

What would settle it

A simulation with known ground-truth KPI distributions in which the empirical coverage of the estimated distributions drops below the nominal level after a data-dependent selection step would show that FCR control fails.

Figures

Figures reproduced from arXiv: 2603.23055 by Amirmohammad Farzaneh, Osvaldo Simeone.

Figure 1
Figure 1. Figure 1: (a) CDF of a negatively-oriented key performance indicator (KPI), such as the prefill latency [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: For a candidate hyperparameter λk = 0.0231, we show the true CDF Fk(x), the empirical CDF Fbk(x), and the post-selection confidence band [Lk(x), Uk(x)] for the data splitting, naive in-sample, and post-hoc in-sample benchmarks. 0.0 0.2 0.4 0.6 0.8 1.0 Target probability 1 − γ 1 2 3 4 5 Best guaranteed error PS-DME SS-DME (n sel n =0.1) SS-DME (n sel n =0.2) SS-DME (n sel n =0.3) (a) Calibration size n = 20… view at source ↗
Figure 3
Figure 3. Figure 3: Best guaranteed KPI for the synthetic data experiment as a function of the target probability [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Best guaranteed KPI (loss) for the Spider text-to-SQL experiment as a function of the target [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: illustrates the definition of a miscoverage event Ok, as defined in (5). −3 −2 −1 0 1 2 3 x 0.0 0.2 0.4 0.6 0.8 1.0 F(x) True CDF Empirical CDF Confidence Band (a) −3 −2 −1 0 1 2 3 x 0.0 0.2 0.4 0.6 0.8 1.0 F(x) True CDF Empirical CDF Confidence Band Miscoverage (b) [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the band widths produced by SS-DME and by PS-DME with the e-calibrator [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between the DKW-based and Berk–Jones-based implementations of PS-DME. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative post-selection CDF bands for the O-RAN buffer occupancy experiment for [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Best guaranteed buffer occupancy for the O-RAN experiment as a function of the target [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Effect of the SS-DME split ratio n sel/n on the best guaranteed buffer occupancy in the O-RAN experiment, for two target probabilities 1 − γ ∈ {0.7, 0.8}. Each panel corresponds to a different calibration regime, with (a) 30% and (b) 100% of the available data used for calibration. The vertical error bars show the standard error of the mean across 50 random splits of the calibration dataset into pre-selec… view at source ↗
Figure 11
Figure 11. Figure 11: Representative post-selection CDF bands for a selected server configuration using three [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and we establish explicit conditions under which it is provably more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance--reliability trade-offs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces post-selection distributional model evaluation (PS-DME), a framework extending e-value machinery to construct valid distributional KPI estimates after arbitrary data-dependent model pre-selection. It claims that PS-DME controls the post-selection false coverage rate (FCR) for the full distributional estimates and establishes explicit conditions under which the method is provably more sample-efficient than a sample-splitting baseline. The approach is demonstrated on synthetic data, LLM text-to-SQL decoding, and telecom network performance evaluation.

Significance. If the validity and efficiency claims hold, the work enables statistically reliable exploration of performance-reliability trade-offs without data splitting, which is valuable for applications like LLM evaluation and network optimization where post-selection bias is common. The extension of e-values to full distributional control after arbitrary selection is a technically interesting contribution.

major comments (2)
  1. [Abstract] Abstract and theoretical development: the claim that e-values remain valid for arbitrary pre-selection and control FCR for the full distributional KPI without additional restrictions on the KPI or selection rule requires explicit verification that the supermartingale property holds under optional stopping; if the KPI is unbounded and selection correlates with tail behavior, uniform integrability may be needed to prevent failure of the FCR bound.
  2. [Theoretical results] Efficiency comparison: the explicit conditions under which PS-DME is provably more sample efficient than splitting are not fully detailed in terms of the specific assumptions on the selection rule and KPI moments; without these, the efficiency gain claim cannot be assessed as general.
minor comments (2)
  1. [Experiments] Experiments: provide details on the number of independent runs, error-bar computation method, and exact KPI definitions for the telecom and LLM tasks to support reproducibility.
  2. [Method] Notation: clarify the filtration used for the e-value supermartingale to include both the selection step and the distributional statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate the revisions we will make to clarify the theoretical foundations and strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract and theoretical development: the claim that e-values remain valid for arbitrary pre-selection and control FCR for the full distributional KPI without additional restrictions on the KPI or selection rule requires explicit verification that the supermartingale property holds under optional stopping; if the KPI is unbounded and selection correlates with tail behavior, uniform integrability may be needed to prevent failure of the FCR bound.

    Authors: We appreciate the referee's observation on the technical conditions required for the supermartingale property under optional stopping. Our framework relies on e-values as non-negative supermartingales, with FCR control following from the optional stopping theorem. To address potential issues with unbounded KPIs and tail correlations, we will add an explicit uniform integrability assumption to the main theorem and provide a short verification argument in the appendix showing that the FCR bound continues to hold under this condition. revision: yes

  2. Referee: [Theoretical results] Efficiency comparison: the explicit conditions under which PS-DME is provably more sample efficient than splitting are not fully detailed in terms of the specific assumptions on the selection rule and KPI moments; without these, the efficiency gain claim cannot be assessed as general.

    Authors: We agree that the efficiency comparison would benefit from a more explicit statement of assumptions. The current result holds when the selection rule is measurable with respect to the training data and the KPI possesses finite second moments. We will revise the relevant theorem to enumerate these conditions clearly, including the precise measurability requirement on the selection rule and the moment conditions on the KPI, thereby making the scope of the efficiency gain fully assessable. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation extends e-value framework without self-referential reduction

full rationale

The paper's central claims rest on extending established e-value supermartingale properties to control post-selection FCR for distributional KPI estimates after arbitrary pre-selection, with explicit efficiency comparisons to sample splitting. No step in the abstract or described derivation reduces by construction to a fitted parameter renamed as prediction, a self-definitional loop, or a load-bearing self-citation whose validity depends on the present work. The framework is presented as building on prior e-value machinery without smuggling ansatzes or renaming known results via internal definitions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of e-value constructions for arbitrary pre-selection and the existence of explicit conditions guaranteeing FCR control and efficiency gains; these are treated as domain assumptions rather than derived from first principles in the abstract.

axioms (1)
  • domain assumption E-values can be constructed to control post-selection false coverage rate for distributional KPI estimates after arbitrary data-dependent model pre-selection
    Invoked when claiming general statistical validity of PS-DME

pith-pipeline@v0.9.0 · 5525 in / 1308 out tokens · 29616 ms · 2026-05-15T00:55:20.572534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 5 internal anchors

  1. [1]

    Conformal risk control.arXiv preprint arXiv:2208.02814,

    Anastasios N Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control.arXiv preprint arXiv:2208.02814,

  2. [2]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    Comparing three learn-then-test paradigms in a multivariate normal means problem.arXiv preprint arXiv:2601.07764,

    Abhinav Chakraborty, Junu Lee, and Eugene Katsevich. Comparing three learn-then-test paradigms in a multivariate normal means problem.arXiv preprint arXiv:2601.07764,

  4. [4]

    Post-hoc large-sample statistical inference.arXiv preprint arXiv:2603.08002,

    Ben Chugg, Etienne Gauthier, Michael I Jordan, Aaditya Ramdas, and Ian Waudby-Smith. Post-hoc large-sample statistical inference.arXiv preprint arXiv:2603.08002,

  5. [5]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751,

  6. [6]

    MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment

    URLhttps://arxiv.org/abs/2603.15954. 10 Leah Jager and Jon A. Wellner. Goodness-of-fit tests via phi-divergences.The Annals of Statistics, 35(5):2018 – 2053,

  7. [7]

    URL https://doi.org/10

    doi: 10.1214/0009053607000000244. URL https://doi.org/10. 1214/0009053607000000244. Kaggle Contributors. Http server logs dataset. https://www.kaggle.com/,

  8. [9]

    Stan Koobs and Nick W

    URL https: //arxiv.org/abs/2312.08040. Stan Koobs and Nick W. Koning. Equivalence testing with data-dependent and post-hoc equivalence margins,

  9. [10]

    Arun Kumar Kuchibhotla

    URLhttps://arxiv.org/abs/2603.16213. Arun Kumar Kuchibhotla. Post-selection inference. InInternational Encyclopedia of Statistical Science, pages 1920–1924. Springer,

  10. [11]

    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

    Bates Stephen et al. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.arXiv preprint arXiv: 2107.07511,

  11. [12]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

  12. [13]

    Hypothesis testing for community structure in temporal networks using e-values.arXiv preprint arXiv:2507.23034,

    11 Eric Yanchenko, Jonathan P Williams, and Ryan Martin. Hypothesis testing for community structure in temporal networks using e-values.arXiv preprint arXiv:2507.23034,

  13. [14]

    Qwen2 Technical Report

    URL https://arxiv.org/abs/2407.10671. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887,

  14. [15]

    12 A Additional Illustrations In this section, we provide additional illustrations for concepts introduced in the main text. Fig. 5 illustrates the definition of a miscoverage eventO k, as defined in (5). −3 −2 −1 0 1 2 3 x 0.0 0.2 0.4 0.6 0.8 1.0 F (x) True CDF Empirical CDF Confidence Band (a) −3 −2 −1 0 1 2 3 x 0.0 0.2 0.4 0.6 0.8 1.0 F (x) True CDF Emp...

  15. [16]

    E " 1 max{|K|,1} X k∈K 1{Fk /∈ Ck} K ## (20a) =E K

    for a recent review. A key limitation of this literature is that it focuses on scalar performance metrics, e.g., average risk, relative to a pre-specified target level. In contrast, this work aims atdistributionalevaluation, enabling comparison of models across all test-time reliability levels without committing to a single KPI threshold. Confidence bands...

  16. [17]

    As shown next, this implies that the following statistic pk(F) = min n 1,2 exp −2nk(Tk(F)) 2 o (24) is a valid p-value for the null hypothesis (21)

    Tk(F) = sup x∈R bFk(x)−F(x) .(22) By the DKW inequality (Dvoretzky et al., 1956), under the null hypothesis (21), the inequality P(Tk(F)> ε| H 0,k(F))≤2 exp −2nkε2 (23) holds. As shown next, this implies that the following statistic pk(F) = min n 1,2 exp −2nk(Tk(F)) 2 o (24) is a valid p-value for the null hypothesis (21). Lemma 3.For the null hypothesis ...

  17. [18]

    Note, however, that the functionf VS(p)is not a valid e-calibrator

    states that for the power family of calibrators (12) we have fVS(p) = max τ∈(0,1) fτ(p) =    −exp(−1) plogp , p≤e −1, 1, p > e −1, (33) where arg max τ∈(0,1) fτ(p) =− 1 logp (34) Hence, for allp∈(0,1)andτ∈(0,1), the inequality fτ(p)≤f VS(p)(35) 15 holds. Note, however, that the functionf VS(p)is not a valid e-calibrator. Since function fτ(·) is decreas...

  18. [19]

    = 0:312) Power (¿ = 1 2 ) (

    To study the effect of the e-calibrator on both inferential tightness and post-selection reliability, we compare three members of the power-family of e-calibrators (12), and consider three parameters τ∈ {1/3,1/2,2/3} . For each calibrator, the corresponding band width is computed according to (14), and the same selected set K is used across all calibrator...