pith. sign in

arxiv: 2405.07860 · v4 · pith:5GBWKPA4new · submitted 2024-05-13 · 💰 econ.EM · math.ST· stat.ML· stat.TH

Order-Explicit Linearization of High-Dimensional U-Statistics

Pith reviewed 2026-05-24 00:57 UTC · model grok-4.3

classification 💰 econ.EM math.STstat.MLstat.TH
keywords U-statisticsHájek projectionlarge deviation boundshigh-dimensional statisticsconcentration inequalitiesGaussian approximationrandom forestsnonparametric regression
0
0 comments X

The pith

Any U-statistic of order b on n samples deviates from its Hájek projection by at most order O_p(φ b n^{-1} log²(dn)).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives an explicit bound on the difference between a high-dimensional U-statistic and its Hájek projection. For kernels of order b whose coordinates have ψ1-Orlicz norm bounded by φ, the supremum deviation scales as φ b over n times the square of the log of d n. The authors prove this rate is optimal up to the log power and use it to obtain new concentration and approximation theorems. These theorems in turn justify resampling-based simultaneous confidence intervals for a family of subsampled-kernel nonparametric regression estimators that includes several random forest procedures.

Core claim

Any U-statistic of order b on n observations, with a d-dimensional kernel whose coordinates have ψ1-Orlicz norm at most φ, has a maximum deviation from its Hájek projection of order O_p(φ b n^{-1} log²(dn)). The proof relies on new order-explicit moment inequalities for higher-order Hoeffding components, and the rate cannot be improved beyond a polynomial factor in the logarithm.

What carries the argument

Order-explicit moment inequalities for higher-order Hoeffding components that bound the tail of the full U-statistic.

If this is right

  • Bernstein-type concentration inequalities hold for high-dimensional U-statistics of arbitrary order.
  • Gaussian approximation results follow for the same class of statistics.
  • Resampling-based simultaneous confidence intervals are consistent for nonparametric regression estimators built from subsampled kernels.
  • The consistency result covers several forms of random forest regression, including Generalized Random Forests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linearization technique may apply directly to other degenerate statistics that admit a Hoeffding decomposition.
  • Finite-sample simulations with concrete kernels could reveal whether the log-squared factor is necessary in practice.
  • The bound supplies a quantitative route to uniform inference over many kernel-based estimators without requiring separate case-by-case analysis.

Load-bearing premise

The new moment inequalities for higher-order Hoeffding components hold and are strong enough to control the deviation probability of the entire U-statistic.

What would settle it

An explicit U-statistic of order b whose coordinate-wise deviation from the Hájek projection exceeds C φ b n^{-1} log²(dn) with probability bounded away from zero, for arbitrarily large C.

read the original abstract

We give an order-explicit large deviation bound for the difference between a high-dimensional $U$-statistic and its H\'{a}jek projection. In particular, we show that any $U$-statistic of order $b$ on $n$ observations, with a $d$-dimensional kernel whose coordinates have $\psi_1$-Orlicz norm at most $\phi$, has a maximum deviation from its H\'{a}jek projection of order $O_p(\phi b n^{-1}\log^2(dn))$. The proof relies on the development of novel order-explicit moment inequalities for higher-order Hoeffding components. We show that this rate is unimprovable, up to the polynomial factor on the logarithmic term. As corollaries, we obtain new Bernstein-type concentration and Gaussian approximation results for high-dimensional $U$-statistics. We apply these results to establish the consistency of a set of resampling-based simultaneous confidence intervals built around a class of nonparametric regression estimators constructed with subsampled kernels. This class encompasses several forms of random forest regression, including Generalized Random Forests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript derives an order-explicit large-deviation bound showing that any U-statistic of order b on n observations, with a d-dimensional kernel whose coordinates have ψ₁-Orlicz norm at most φ, deviates from its Hájek projection by at most O_p(φ b n^{-1} log²(dn)). The proof rests on newly developed moment inequalities for higher-order Hoeffding components; the rate is asserted to be sharp up to the polynomial log factor. Corollaries include Bernstein-type concentration and Gaussian approximation bounds for high-dimensional U-statistics, which are then used to establish consistency of resampling-based simultaneous confidence intervals for a class of subsampled-kernel nonparametric regression estimators that includes several random-forest variants.

Significance. If the order-explicit moment inequalities hold, the result supplies a technically useful linearization tool whose explicit dependence on the kernel order b is valuable for econometric and machine-learning settings that routinely employ U-statistics of order greater than two. The sharpness claim and the concrete application to random-forest inference add to the contribution; the work is therefore of interest to both theoretical econometricians and practitioners working with high-dimensional nonparametric estimators.

minor comments (3)
  1. §2, notation for the Hoeffding decomposition: the indexing of the projection terms P_k and the associated kernels h_k is introduced without an explicit recursive definition; adding a short display equation that relates h_k to the original kernel would improve readability for readers outside the U-statistic literature.
  2. Theorem 3.1 (the main bound): the statement lists the O_p rate but does not record the implicit constant’s dependence on b; a parenthetical remark clarifying that the constant is at most polynomial in b would make the order-explicit nature of the result fully transparent.
  3. §5, application to random forests: the mapping from the general U-statistic bound to the particular subsampled-kernel estimator is sketched rather than derived step-by-step; inserting one intermediate display that identifies the effective order b and the Orlicz norm φ for the forest kernel would strengthen the link between theory and application.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its significance for econometric and machine-learning applications, and recommendation of minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via novel inequalities

full rationale

The paper develops novel order-explicit moment inequalities for higher-order Hoeffding components and uses them to derive the stated large-deviation bound on the U-statistic deviation from its Hájek projection. This is a standard forward derivation from newly established results rather than any reduction by construction, self-definition, fitted-input renaming, or load-bearing self-citation chain. The provided abstract and reader's summary confirm the inequalities are asserted as independent content sufficient for the O_p bound, with no equations or steps shown to collapse back to the target quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of newly developed order-explicit moment inequalities for Hoeffding components, which are not detailed in the abstract; the result also invokes the standard Hoeffding decomposition of U-statistics.

axioms (1)
  • standard math U-statistics admit an orthogonal Hoeffding decomposition into components of increasing order
    This is a classical result in U-statistic theory invoked as the basis for controlling the deviation from the Hájek projection.

pith-pipeline@v0.9.0 · 5729 in / 1281 out tokens · 38213 ms · 2026-05-24T00:57:04.813591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adaptive discovery of effect modification in matched observational studies

    stat.ME 2026-05 unverdicted novelty 5.0

    A finite-sample valid method discovers and selects covariate-interpretable subgroups with effect modification in matched observational studies, exactly controlling subgroup-level FDR and incorporating sensitivity anal...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    and Szepesv´ari, C

    Abou-Moustafa, K. and Szepesv´ari, C. (2019). An exponential efron-stein inequality for l q stable learning rules. In Algorithmic Learning Theory, pages 31–63. PMLR. Adler, R. J. and Taylor, J. E. (2009). Random fields and geometry. Springer Science & Business Media. Ai, C. and Chen, X. (2003). Efficient estimation of models with conditional moment restri...

  2. [2]

    and Pouzo, D

    Chen, X. and Pouzo, D. (2012). Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica, 80(1):277–321. Chernozhukov, V ., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters: Double/debiased ...

  3. [3]

    Chernozhukov, V ., Chetverikov, D., and Kato, K. (2017b). Detailed proof of nazarov’s inequality.arXiv preprint arXiv:1711.10696. Chernozhukov, V ., Chetverikov, D., Kato, K., and Koike, Y . (2022a). Improved central limit theorem and bootstrap approximations in high dimensions. The Annals of Statistics, 50(5):2562–2586. Chernozhukov, V ., Escanciano, J. ...

  4. [4]

    and Romano, J

    DiCiccio, C. and Romano, J. (2022). Clt for u-statistics with growing dimension. Statistica Sinica, 32(1). Efron, B. and Stein, C. (1981). The jackknife estimate of variance. The Annals of Statistics, pages 586–596. Fix, E. and Hodges, J. L. (1989). Discriminatory analysis. nonparametric discrimination: Consistency properties. International Statistical Re...

  5. [5]

    G¨otze, F., Sambale, H., and Sinulis, A. (2021). Concentration inequalities for polynomials in α-sub- exponential random variables. Electronic Journal of Probability, 26(none):1 –

  6. [6]

    Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, pages 315–331. H´ajek, J. (1968). Asymptotic normality of simple linear rank statistics under alternatives. The Annals of Mathematical Statistics, pages 325–346. Hanson, D. L. and Wright, F. T. (1971). A bound on tail pro...

  7. [7]

    and McKenzie, D

    Kraay, A. and McKenzie, D. (2014). Do poverty traps exist? assessing the evidence. Journal of Economic Perspectives, 28(3):127–148. Kumar, R., Lokshtanov, D., Vassilvitskii, S., and Vattani, A. (2013). Near-optimal bounds for cross-validation via loss stability. In International Conference on Machine Learning, pages 27–35. PMLR. Lewis, G. and Syrgkanis, V...

  8. [8]

    Peng, W., Coleman, T., and Mentch, L. (2022). Rates of convergence for random forests via generalized u-statistics. Electronic Journal of Statistics, 16(1):232–292. Politis, D. N. and Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics, pages 2031–2050. Politis, D. N., Romano, J. P....

  9. [9]

    and Chernozhukov, V

    Semenova, V . and Chernozhukov, V . (2021). Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal, 24(2):264–289. Sexton, J. and Laake, P. (2009). Standard errors for bagged and random forest estimators. Computational Statistics & Data Analysis, 53(3):801–811. Sherman, R. P. (1994). Maximal...

  10. [10]

    Wager, S

    Cambridge university press. Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242. Wager, S., Hastie, T., and Efron, B. (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal o...

  11. [11]

    Suppose that the Neyman orthogonal moment function M(x; θ0, g0) satisfies Assumption A.1 and Part (i) of Assumption 3.3 and that the centered empirical moment function M n(x; θ0, g0) satisfies Assumption A.2. If Assumption A.3 and Assumption A.4 hold, then the confidence region defined in Definition 2.1 satisfies sup P ∈P P n θ0(x(d)) ∈ ˆC(x(d)) o − (1 − ...

  12. [12]

    nX i=1 Zi q #1/q ≲ E

    Lemma C.2 implies that E   1 Nb X s∈Sn,b Vsh(x; Ds) q | Dn   ≤ qbq/2   1 Nb X s∈Sn,b n b −1 (h(x; Ds))2   q/2 . (C.10) Consequently, Lemma C.1 implies that E   1 Nb X s∈Sn,b h(x; Ds) q   ≤ 2bqE  E   1 Nb X s∈Sn,b Vsh(x; Ds) q | Dn     ≤ 2bqqbq/2E     1 Nb X s∈Sn,b n b −1 (h(x; Ds))2   q/2  . (C.11) To simplify the expression ...

  13. [13]

    implies that P n ∥Λ−1/2Z∥∞ ≥ C p log dn o ≤ n−1 . Thus, Theorem D.1 implies that P n ∥Λ−1/2Rn(x(dn))∥∞ ≥ C p log dn o ≲ φ2 log5(dn) λ2n 1/4 + δn p log d + ρn (D.26) 13 and that P n ∥Λ−1/2R∗ n(x(d))∥∞ ≥ C p log dn | Dn o ≲ φ2 log5(dn) λ2n 1/4 + δn p log d (D.27) with probability greater than 1 − C(n−1/2φλ−1 log3/2(dn) + ρn). In turn, to bound the discrepan...

  14. [14]

    In the main text, we assume that the nuisance parameter estimator ˆgn is computed on a separate sample, independent of the dataDn

    ■ F.3 Stochastic Equicontinuity without Sample Splitting. In the main text, we assume that the nuisance parameter estimator ˆgn is computed on a separate sample, independent of the dataDn. In this section, we give additional conditions under which the nuisance parameter estimator ˆgn can be computed on the same data used to construct the estimator ˆθn(x(d...

  15. [15]

    double-centering trick,

    First, we require that moment m(Di, g) satisfies a pair of smoothness conditions. Assumption F.1. Define the higher-order variogram V (q)(x; g, g′) = E h m(Di, g) − m(Di, g′) 2q | Xi = x i for each g and g′ in G. For each integer q > 0, the Lipschitz condition |V (q)(x; g, g′) − V (q)(x′; g, g′)| ≲ ∥x − x′∥q ∞ (F.9) holds for each x and x′ in X and the me...

  16. [16]

    (F.38) For reasons of space, we give the details of the proof of the upper bound encoded in (4.8)

    Define the normalized quantity ˆu(1)(x(d); Di) = Σ−1/2˜u(1)(x(d); Di) . (F.38) For reasons of space, we give the details of the proof of the upper bound encoded in (4.8). The lower bound will follow from an analogous argument, and we note the differences where they occur. Observe that the decomposition (F.35) implies the upper bound P r n b2 Σ−1/2U n,b(x(...

  17. [17]

    G.1 Data

    Appendix G.3 discusses our simulation calibration. G.1 Data. The data from Banerjee et al. (2015) were acquired from https://dataverse.harvard. edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NHIXNT on September 10,

  18. [18]

    Here, we consider the data from the graduation program implemented in Ghana, as it has a larger sample size

    The data from the graduation program implemented in Pakistan are considered in Chen and Ritzwoller (2023) and Ritzwoller and Romano (2023). Here, we consider the data from the graduation program implemented in Ghana, as it has a larger sample size. The data record measurements of many pre-treatment and post-treatment outcomes for 2,606 individuals in the ...

  19. [19]

    In Figure 1 and Figure 2, we set the subsample proportion b/n equal to 0.05

    G.2 Parameter Choices. In Figure 1 and Figure 2, we set the subsample proportion b/n equal to 0.05. In constructing the nuisance parameter estimate, we use set b/n = 0.025. We use r = 200 bootstrap replicates throughout. We use 20,000 trees to construct Figure 1 and 2,000 trees in each bootstrap replicate to construct Figure 2 and throughout the simulatio...

  20. [20]

    As the data Dh are drawn independently and identically with distribution P in P and we have assumed that δn/2 ≲ δn and ρn/2 ≲ ρn, by setting t = Cδn, Assumption A.3 and the bound (H.23) imply that the event En(t) occurs with probability greater than 1 − Cρn. Thus, Lemma D.1 implies that P n√nR∗ n(x(d)) ∈ R | Dn o − P {Z ∈ R} (H.46) 37 ≤ P ( 2√n nX i∈h ˆu(...

  21. [21]

    To handle the term (H.51), we apply the following quantitative central limit theorem, stated as Lemma 4.6 in Chernozhukov et al

    Lemma D.1 implies that (H.52) is less than t p log(d). To handle the term (H.51), we apply the following quantitative central limit theorem, stated as Lemma 4.6 in Chernozhukov et al. (2022a). Lemma H.2 (Lemma 4.6, Chernozhukov et al., 2022a) . Consider the setting and assumptions of Lemma H.1. Let Xn = ( X1, . . . , Xn) collect the observed data and let ...

  22. [22]

    See Chapter 1 of De la Pena and Gin´e (1999) for a textbook treatment

    The second inequality is a L´evy type inequality for independent random vectors, due to Montgomery-Smith (1993). See Chapter 1 of De la Pena and Gin´e (1999) for a textbook treatment. Lemma H.4 (Theorem 1.1.5, De la Pena and Gin´e (1999)). Let Z1, . . . , Zn be independent random vectors in Rd. There exists a universal constants C1 and C2 such that P ( ma...

  23. [23]

    (H.62) Now, the choice δ = C r log n n (H.63) gives P Qn ≥ δn 2 ≲ 1 n (H.64) by (H.58)

    In particular, as ∥ 2√n ˆu(x(j); Di)∥ψ1 ≤ 2√n φ λ , and max j∈[d] δn/2X k=1 E 2√n ˆu(x(j); Di) = 2δ , (H.61) Lemma H.3 and Lemma H.4 imply that P ( max 1≤k≤δ n 2 2√n kX i=1 ˆu x(d); Di ∞ ≥ C δ log1/2(dn) + 2√n φ λ log2(dn) ) ≲ 1 n . (H.62) Now, the choice δ = C r log n n (H.63) gives P Qn ≥ δn 2 ≲ 1 n (H.64) by (H.58). Plugging this choice into (H.62) yie...

  24. [24]

    Observe that 2 n X i∈s ˆu(x(d), Di) − 1 n nX i=1 ˆu(x(d), Di) = 1 n nX i=1 ˜Viˆu(x(d), Di)

    Let ˜Vi be a random variable taking the value 1 with i is an element of the subset s and taking the value −1 otherwise. Observe that 2 n X i∈s ˆu(x(d), Di) − 1 n nX i=1 ˆu(x(d), Di) = 1 n nX i=1 ˜Viˆu(x(d), Di) . (J.7) and that the weights ˜Vi are independent and identically distributed Rademacher random variables. Thus, on the event E ′n(t, t0), we have ...

  25. [25]

    It will suffice to show that 1 n! X π∈Pn n b −1 ⌊n/b⌋X l=1 F1(x; sπ,l) 2q ≲ 2qn−1bγ(2q)b1/2q. (J.15) 55 In particular, if (J.15) holds, then we have that P    1 n! X π∈Pn n b −1 ⌊n/b⌋X l=1 F1(x; sπ,l) > 2√e qn−1bγ(2q)b1/2q    ≤ 1 exp(q) E " | 1 n! P π∈Pn n b −1P⌊n/b⌋ l=1 F1(x; sπ,l)|2q # 2qn−1bγ(2q)b1/2q 2q ≲ exp(−q), (J.16) by Markov’s inequality. ...

  26. [26]

    , Xb be independent centered random variables

    Let X1, . . . , Xb be independent centered random variables. For any integer q ≥ 2, we have that ∥ 1 n X Xi∥1 ≲ r Var(Xi)q b + q b ∥ max i∈[b] Xi∥q. In particular, Lemma J.3 implies that ∥ b′ n bX i=1 u(1)(Di) − u(1)(D′ i) ∥q = b′b n ∥1 b bX i=1 u(1)(Di) − u(1)(D′ i) ∥q ≲ b′b n r σ2 b′q b + q ϕ b ! ≲ √ b′b n qϕ, (J.47) where the second inequality follows ...

  27. [27]

    FIGURE J.2. CATE Estimates, 2-Fold Cross-Fitting 3.0 3.5 4.0 4.5 5.0 −0.6 −0.3 0.0 0.3 Baseline Assets Baseline Consumption 0.1 0.2 0.3 CATE Notes: Figure 1 displays a heat map giving CATE estimates for the intervention studied in Banerjee et al. (2015) on post-treatment assets, using the the 2-fold cross-split estimator (F.24). Other features of the disp...

  28. [28]

    65 FIGURE J.3. Performance, 2-Fold Cross-Fitting Mean Length Max Bias Mean Bias Coverage Lower Coverage Upper Coverage 1.0 2.5 5.0 7.5 1.0 2.5 5.0 7.5 1.0 2.5 5.0 7.5 0.88 0.90 0.92 0.94 0.0020 0.0025 0.0030 0.0035 0.0040 0.90 0.91 0.92 0.93 0.94 0.030 0.035 0.040 0.045 0.050 0.055 0.800 0.825 0.850 0.875 0.900 0.20 0.25 0.30 0.35 Sample Multiplier (h) b ...

  29. [29]

    FIGURE J.4. Half-Sample Confidence Region Lower Bound, 2-Fold Cross-Fitting 3.0 3.5 4.0 4.5 5.0 −0.6 −0.3 0.0 0.3 Baseline Assets Baseline Consumption −0.2 −0.1 0.0 0.1 CATE Notes: Figure 2 displays heat maps giving half-sample lower confidence bound for the CATE of the intervention studied in Banerjee et al. (2015) on post-treatment total assets. Other f...

  30. [30]

    66 FIGURE J.5. Calibrated CATEs 3.0 3.5 4.0 4.5 5.0 −0.6 −0.3 0.0 0.3 Baseline Assets Baseline Consumption CATE 0.05 0.10 0.15 0.20 0.25 Notes: Figure J.5 displays the “true” value of CATE used in our calibrated simulation. 67 FIGURE J.6. Validation X Y Mean 1e−02 1e+00 1e+02 1e+04 1e−02 1e+00 1e+02 1e+04 1e−02 1e+00 1e+02 1e+04Generated Data X Y Variance...