pith. machine review for the scientific record. sign in

arxiv: 2605.08027 · v1 · submitted 2026-05-08 · 📊 stat.ME · stat.AP

Recognition: no theorem link

Randomization Tests for Distributions of Individual Treatment Effects via Combined Rank Statistics

David Kim, Jake Bowers, Xinran Li, Yongchang Su

Pith reviewed 2026-05-11 02:35 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords randomization testsindividual treatment effectsrank statisticscausal inferenceadaptive combinationstratified experimentsfinite-sample validity
0
0 comments X

The pith

Adaptive combination of rank statistics allows valid tests for individual treatment effect distributions without power loss from choosing the wrong statistic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops methods to answer questions about the distribution of individual causal effects in randomized experiments, such as the proportion of units that benefit from treatment or the median effect size. Standard rank-based tests require choosing a statistic in advance, which can lead to low power if the choice does not match the true effect pattern. The proposed approach adaptively combines several such statistics while preserving the exact finite-sample validity guaranteed by randomization inference. For experiments with strata of different sizes, weighting schemes are introduced to combine evidence appropriately. This results in a test whose power is at least as good as the best single statistic, without needing to know which one is best beforehand, as shown in an application to a teacher training program.

Core claim

The central claim is that adaptive procedures for combining multiple rank-based statistics yield randomization tests for features of the individual treatment effect distribution, such as the share of beneficiaries, that maintain exact finite-sample validity under the randomization distribution. In stratified designs, the methods include weighting to aggregate across strata of varying sizes. The resulting tests achieve power comparable to or exceeding that of the strongest single statistic without requiring the analyst to select the optimal one in advance.

What carries the argument

Adaptive combination of multiple rank-based statistics, constructed so the overall test remains exactly valid under the randomization null while data-dependently emphasizing stronger evidence.

If this is right

  • The combined test can indicate that roughly half the treated units benefited when a single poorly chosen rank test would indicate only a small minority.
  • Weighting schemes permit valid evidence aggregation in stratified experiments even when strata sizes differ substantially.
  • Questions about the median individual treatment effect or the largest effect can be addressed without committing to one rank statistic beforehand.
  • Power loss from Bonferroni adjustments is avoided when exploring several possible rank statistics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analysts using the combined procedure may reach more reproducible conclusions than when each selects a different single statistic based on intuition.
  • The framework could be extended to settings with multiple treatments by redefining the rank statistics accordingly, though new validity arguments would be needed.
  • Policy evaluations that apply this method might detect program success on a broader scale than earlier single-statistic analyses suggested.

Load-bearing premise

The particular construction used to combine the rank statistics preserves the exact known distribution of the test statistic under the randomization null of no individual treatment effects.

What would settle it

A Monte Carlo simulation that draws many datasets under the exact null of no treatment effects and finds the combined test rejects at a rate higher than the nominal alpha level would show that finite-sample validity fails.

Figures

Figures reproduced from arXiv: 2605.08027 by David Kim, Jake Bowers, Xinran Li, Yongchang Su.

Figure 1
Figure 1. Figure 1: Lower confidence bounds for treatment effect quantiles in the education experiment [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Confidence and prediction bounds for treatment effect quantiles in education experi [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗
read the original abstract

What proportion of treated units actually benefited from an experimental intervention? What is the median or the largest individual treatment effect? This paper develops methods for answering such questions about the distribution of individual causal effects in randomized experiments. Existing approaches require the analyst to select a rank-based test statistic before observing the data. A poor choice can substantially reduce power, while searching over multiple test statistics and adjusting for multiplicity using Bonferroni correction also incurs power loss. We propose inference procedures that adaptively combine multiple rank-based statistics while maintaining finite-sample validity. For stratified experiments, we further develop weighting schemes that effectively aggregate evidence across strata of heterogeneous sizes. The resulting combined test achieves power comparable to, or exceeding, that of the best individual test, without requiring prior knowledge of the optimal statistic. When applied to a randomized experiment evaluating a teacher training program, the combined test suggests that roughly half of treated teachers benefited, whereas a single rank-based test may indicate only a small minority. Thus, the choice of test determined whether the program appears broadly successful or narrowly effective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops randomization-based inference procedures for features of the distribution of individual treatment effects (ITEs) in randomized experiments. It proposes methods to adaptively combine multiple rank-based test statistics while claiming to preserve exact finite-sample validity under the randomization distribution, extends the approach with weighting schemes for stratified experiments of heterogeneous sizes, and illustrates the method on a teacher-training randomized experiment where the combined test indicates that roughly half of treated units benefited (in contrast to conclusions from any single rank statistic).

Significance. If the finite-sample validity claim holds after the adaptive combination step, the work would be a useful advance for causal inference: it removes the need to pre-specify a single rank statistic or to pay a Bonferroni penalty when exploring several, while still delivering an exact test. This could meaningfully increase power for detecting heterogeneity in ITE distributions without requiring asymptotic approximations or data splitting.

major comments (2)
  1. [§3] §3 (construction of the combined statistic): the central claim of exact finite-sample validity requires that the reference distribution of the adaptive combination fully incorporates the data-dependent choice or weighting of the component rank statistics. The manuscript must show explicitly (via algorithm or proof) that the p-value is obtained by enumerating or sampling the joint randomization distribution over all admissible treatment assignments, including the adaptation step; otherwise the test is only asymptotically valid. The abstract asserts validity but does not indicate whether this joint enumeration is performed.
  2. [§4] §4 (stratified weighting): the weighting scheme that aggregates evidence across strata of unequal sizes must be shown to preserve the exactness property under the stratified randomization distribution. If the weights are estimated from the observed outcomes, the null distribution must again condition on or include that estimation; the current description leaves open whether this is done or whether an additional adjustment is required.
minor comments (2)
  1. Notation for the combined statistic and its randomization distribution should be introduced earlier and used consistently; several symbols are defined only after first use.
  2. The teacher-training example would benefit from a table reporting the individual rank statistics, their p-values, the combined p-value, and the implied ITE distribution summary (e.g., proportion positive) so readers can directly compare.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report, which highlights important aspects of finite-sample validity that merit clearer exposition. We address each major comment below and have revised the manuscript to strengthen the explicit description of our procedures while preserving the original claims.

read point-by-point responses
  1. Referee: [§3] §3 (construction of the combined statistic): the central claim of exact finite-sample validity requires that the reference distribution of the adaptive combination fully incorporates the data-dependent choice or weighting of the component rank statistics. The manuscript must show explicitly (via algorithm or proof) that the p-value is obtained by enumerating or sampling the joint randomization distribution over all admissible treatment assignments, including the adaptation step; otherwise the test is only asymptotically valid. The abstract asserts validity but does not indicate whether this joint enumeration is performed.

    Authors: We agree that explicit demonstration of the joint randomization distribution is required to substantiate the exact finite-sample validity claim. Our procedure computes the p-value by enumerating (or Monte Carlo sampling) all admissible treatment assignments under the experimental design; for each such assignment the adaptive combination rule is re-applied identically to the observed data, so that the reference distribution fully incorporates the data-dependent choice of rank statistics. We have added a new Algorithm 1 in §3 that presents the complete procedure, including the adaptation step inside the randomization loop, together with a brief proof that the resulting p-value is exactly valid. The abstract has also been revised to note that validity is obtained by conditioning on the full randomization distribution that includes the adaptation. revision: yes

  2. Referee: [§4] §4 (stratified weighting): the weighting scheme that aggregates evidence across strata of unequal sizes must be shown to preserve the exactness property under the stratified randomization distribution. If the weights are estimated from the observed outcomes, the null distribution must again condition on or include that estimation; the current description leaves open whether this is done or whether an additional adjustment is required.

    Authors: The referee correctly notes that exactness requires the null distribution to account for any data-dependent elements. In the proposed weighting scheme the weights are functions exclusively of the fixed stratum sizes (which are known a priori and invariant to the randomization), not estimated from the observed outcomes. Consequently, the same fixed weights are applied to every stratified randomization when forming the reference distribution. We have revised §4 to state this explicitly, added a short proof that the resulting test remains exact under the stratified randomization null, and included a remark clarifying that outcome-dependent weights would require additional conditioning (an extension we do not pursue here). revision: yes

Circularity Check

0 steps flagged

No circularity: new adaptive combination procedures are constructed to preserve randomization validity

full rationale

The paper introduces explicit new procedures for adaptively combining rank statistics and weighting across strata, with finite-sample validity asserted under the standard randomization distribution induced by the experimental design. No quoted step reduces a claimed prediction or validity result to a fitted parameter, self-definition, or prior self-citation that itself assumes the target result. The central construction is presented as a novel aggregation that enumerates or samples the appropriate null distribution, independent of the paper's own outputs. This is the typical non-circular case for a methods paper extending randomization inference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is based on abstract only, so the ledger is necessarily incomplete. No free parameters or invented entities are mentioned. The approach rests on standard assumptions of randomized experiments.

axioms (2)
  • domain assumption The experiment follows a known randomization distribution that permits exact finite-sample inference.
    Core to all randomization tests; invoked implicitly throughout the abstract.
  • domain assumption Rank-based statistics can be defined on the observed outcomes to capture features of the individual treatment effect distribution.
    Stated in the abstract's description of the test statistics.

pith-pipeline@v0.9.0 · 5479 in / 1434 out tokens · 49688 ms · 2026-05-11T02:35:05.968539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    and Wager, S

    Athey, S. and Wager, S. (2021). Policy learning with observational data. Econometrica , 89(1):133--161

  2. [2]

    Bickel, P. J. and Freedman, D. A. (1984). Asymptotic Normality and the Bootstrap in Stratified Sampling . The Annals of Statistics , 12:470 -- 482

  3. [3]

    Caughey, D., Dafoe, A., Li, X., and Miratrix, L. (2023). Randomization inference beyond the sharp null: Bounded null hypotheses and quantiles of individual treatment effects. Journal of the Royal Statistical Society, Series B (Statistical Methodology) , 85:1471--1491

  4. [4]

    and Li, X

    Chen, Z. and Li, X. (2026). Enhanced inference for distributions and quantiles of individual treatment effects in various experiments. Journal of the American Statistical Association , page inpress

  5. [5]

    Chen, Z., Li, X., and Zhang, B. (2024). The role of randomization inference in unraveling individual treatment effects in early phase vaccine trials. Statistical Communications in Infectious Diseases , 16:20240001

  6. [6]

    Fisher, R. A. (1935). The D esign of E xperiments, 1st Edition . Edinburgh, London: Oliver and Boyd

  7. [7]

    H \'a jek, J. (1960). Limiting distributions in simple random sampling from a finite population. Publications of the Mathematics Institute of the Hungarian Academy of Science , 5:361--374

  8. [8]

    J., Smith, J., and Clements, N

    Heckman, J. J., Smith, J., and Clements, N. (1997). Making the most out of programme evaluations and social experiments: Accounting for heterogeneity in programme impaces. The Review of Economic Studies , 64(4):487--535

  9. [9]

    L., Shinohara, M., Miratrix, L., Hesketh, S

    Heller, J. L., Shinohara, M., Miratrix, L., Hesketh, S. R., and Daehler, K. R. (2010). Learning science for teaching: Effects of professional development on elementary teachers, classrooms, and students. Proceedings from Society for Research on Educational Effectiveness

  10. [10]

    Heng, S., Zhang, J., and Feng, Y. (2025). Design-based causal inference with missing outcomes: Missingness mechanisms, imputation-assisted randomization tests, and covariate adjustment. Journal of the American Statistical Association , in press

  11. [11]

    and Ratkovic, M

    Imai, K. and Ratkovic, M. (2013). Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics , 7(1):443--470

  12. [12]

    Koenker, R. (2017). Quantile regression: 40 years on. Annual review of economics , 9(1):155--176

  13. [13]

    and Ding, P

    Li, X. and Ding, P. (2017). General forms of finite population central limit theorems with applications to causal inference. Journal of the American Statistical Association , 112:1759--1769

  14. [14]

    Li, X., Sheng, P., and Yu, Z. (2025). Randomization inference with sample attrition. arXiv preprint arXiv:2507.00795

  15. [15]

    and Small, D

    Li, X. and Small, D. S. (2022). Randomization-based test for censored outcomes: A new look at the logrank test. Statistical Science , page To appear

  16. [16]

    and Yang, Y

    Liu, H. and Yang, Y. (2020). Regression-adjusted average treatment effect estimates in stratified randomized experiments . Biometrika , 107:935--948

  17. [17]

    Manski, C. F. (2004). Statistical treatment rules for heterogeneous populations. Econometrica , 72(4):1221--1246

  18. [18]

    Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on principles (with discussion). Section 9 (translated). reprinted ed. Statistical Science , 5:465--472

  19. [19]

    and Wager, S

    Nie, X. and Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects . Biometrika , 108(2):299--319

  20. [20]

    Puri, M. L. (1965). On the combination of independent two sample tests of a general class. Revue de l'Institut International de Statistique , pages 229--241

  21. [21]

    Qu, T., Du, J., and Li, X. (2025). Randomization-based z-estimation for evaluating average and individual treatment effects. Biometrika , 112(2):1--9

  22. [22]

    Rosenbaum, P. R. (2002). Observational Studies . Springer, New York, 2 edition

  23. [23]

    Rosenbaum, P. R. (2007). Confidence intervals for uncommon but dramatic responses to treatment. Biometrics , 63:1164--1171

  24. [24]

    Rosenbaum, P. R. and Silber, J. H. (2008). Aberrant effects of treatment. Journal of the American Statistical Association , 103(481):240--247

  25. [25]

    Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology , 66:688--701

  26. [26]

    and Li, X

    Shi, L. and Li, X. (2024). Some theoretical foundations for the design and analysis of randomized experiments. Journal of Causal Inference , 12(1)

  27. [27]

    Stephenson, W. R. and Ghosh, M. (1985). Two sample nonparametric tests based on subsamples. Communications in Statistics - Theory and Methods , 14:1669--1684

  28. [28]

    and Li, X

    Su, Y. and Li, X. (2024). Treatment effect quantiles in stratified randomized experiments and matched observational studies . Biometrika , 111(1):235--254

  29. [29]

    A., Gentles, A

    Tian, L., Alizadeh, A. A., Gentles, A. J., and Tibshirani, R. (2014). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association , 109:1517--1532

  30. [30]

    van Elteren, P. H. (1960). On the combination of independent two sample tests of wilcoxon. Bulletin of the Institute of International Statistics , 37:351--361

  31. [31]

    and Li, X

    Wu, D. and Li, X. (2025). Sensitivity analysis for quantiles of hidden biases in matched observational studies. Journal of the American Statistical Association , 120:1657--1668

  32. [32]

    A., Davidian, M., Zhang, M., and Laber, E

    Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M., and Laber, E. (2012). Estimating optimal treatment regimes from a classification perspective. Stat , 1(1):103--114

  33. [33]

    Zhao, A., Ding, P., and Li, F. (2024). Covariate adjustment in randomized experiments with missing outcomes and covariates. Biometrika , 111:1413--1420