pith. sign in

arxiv: 2606.18011 · v2 · pith:A2ZWJ4K5new · submitted 2026-06-16 · 📊 stat.ML · cs.LG· stat.ME

Fast Nonparametric Conditional Independence Testing via Two-Stage Regression

Pith reviewed 2026-06-26 22:32 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME
keywords conditional independence testingnonparametric statisticscausal discoveryregression residualizationtwo-stage testingchi-square approximationtree regression
0
0 comments X

The pith

BLITZ achieves fast nonparametric conditional independence testing by first regressing out broad dependence with low-order polynomials and then using shallow trees on a small nonlinear feature map to residualize the rest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BLITZ as a method for conditional independence testing that avoids the calibration problems common in fast nonparametric alternatives. It removes smooth broad dependence on the conditioning set using low-order polynomial regression. A small nonlinear feature map is then applied and the resulting features are residualized with shallow tree regressions. The test uses the residual cross-covariance statistic with a moment-matched chi-square null approximation. This two-stage structure is shown to lower the effective complexity for the trees, allowing them to control bias without overfitting that would break the null distribution. Simulations confirm better calibration than kernel, random-feature, and other regression-based tests while staying among the fastest options, and the method improves edge orientation reliability in causal discovery on synthetic graphs and flow-cytometry data.

Core claim

BLITZ performs conditional independence testing by first removing broad smooth dependence on the conditioning set via low-order polynomial regression, applying a small nonlinear feature map, and then residualizing those features with shallow tree regressions before testing residual cross-covariance under a moment-matched chi-square approximation. The two-stage design reduces the complexity faced by the tree residualizers, so that shallow trees can control residual conditional-mean bias without excessive overfitting that would invalidate the null approximation.

What carries the argument

The two-stage residualization in BLITZ: low-order polynomials remove broad dependence, then a nonlinear feature map followed by shallow trees handles the local remainder before cross-covariance testing.

If this is right

  • The method supports thousands of conditional independence queries in constraint-based causal discovery without loss of calibration or speed.
  • Better null calibration produces more reliable retained adjacencies and endpoint orientations in recovered graphs.
  • Shallow trees suffice after the first-stage preprocessing, removing the need for deep models or post-hoc tuning.
  • The design remains competitive in runtime while outperforming fast kernel and random-feature competitors on calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same broad-to-local split could be applied to other residual-based nonparametric tests where smooth trends dominate the signal.
  • Adjusting polynomial degree or feature-map size might extend reliable performance to higher-dimensional conditioning sets.
  • The moment-matched chi-square approximation may transfer to related problems in high-dimensional regression residual testing.

Load-bearing premise

Low-order polynomial regression plus a small nonlinear feature map sufficiently captures broad dependence so that shallow trees can residualize the remainder without overfitting that invalidates the chi-square null approximation.

What would settle it

A simulation in which the conditioning set has strong nonlinear dependence not captured by low-order polynomials, where the empirical type I error rate of BLITZ deviates substantially from the nominal level.

Figures

Figures reproduced from arXiv: 2606.18011 by Eric V. Strobl.

Figure 1
Figure 1. Figure 1: Representative null p-value histograms across CI tests. All histograms share the same x- and y-axis labels. Calibration was evaluated under the nonlinear null settings described in the experiments. The histograms shown here correspond to a sample size of 5000 and a conditioning set size of 3. BLITZ produces an approximately uniform null p-value distribution, whereas RCIT, RCoT, CCI, and FastKCI show excess… view at source ↗
Figure 2
Figure 2. Figure 2: Null calibration. All heatmaps share the same x- and y-axis labels. Type I error calibration was measured by the Kolmogorov–Smirnov distance between each method’s empirical p-value distribution and the Unif(0, 1) reference distribution. Lower values (greener) indicate better calibration. BLITZ achieved the smallest deviations from the uniform null distribution across the evaluated sample sizes and conditio… view at source ↗
Figure 3
Figure 3. Figure 3: Runtime under the null. Mean wall-clock runtime for each CI test. BLITZ was the fastest method for s ∈ {0, 1, 2} and remained highly competitive at larger conditioning dimensions. Although RCIT and RCoT were faster in some higher-dimensional settings, their largest runtime advantage over BLITZ was less than 0.1 seconds. FastKCI was substantially slower and did not complete within the three-minute limit for… view at source ↗
Figure 4
Figure 4. Figure 4: Power across standalone CI benchmarks. Heatmaps report AUCPC values across sample sizes and conditioning dimensions, with larger values (greener) indicating greater size-calibrated power. BLITZ retained competitive power across the full experimental grid, including the more difficult conditional settings, whereas several competing methods showed less stable gains with increasing sample size or did not comp… view at source ↗
Figure 5
Figure 5. Figure 5: Type I error ablations. All heatmaps share the same x- and y-axis labels; likewise, all histograms use the same axis labels. Each column corresponds to one CI test. BLITZ provided the best overall Type I error control as measured by the KS statistic (first row). Although the no-polynomial ablation was close to BLITZ by the ordinary KS statistic, a more targeted analysis using the positive KS statistic, whi… view at source ↗
Figure 6
Figure 6. Figure 6: Power ablations. BLITZ achieved AUCPC competitive with both single-stage regression variants. As expected, using two residualization stages can modestly reduce power relative to single-stage variants that remove less Z-dependent structure. In this experiment, the no-polynomial ablation achieved higher AUCPC, but this came at the cost of poorer Type I error calibration ( [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 7
Figure 7. Figure 7: Runtime under the alternative. Mean wall-clock runtime for each CI test under the post-nonlinear alternative models. Results mimic those observed in [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
read the original abstract

Constraint-based causal discovery relies on repeated conditional independence tests, but fast nonparametric tests often sacrifice calibration, especially when variables depend on the conditioning set through nonlinear relationships. We introduce BLITZ (Broad-to-Local Independence Testing via residualiZation), a nonparametric conditional independence test designed to run well under a second while maintaining the accuracy needed for the thousands of queries performed by constraint-based causal discovery algorithms. BLITZ first removes broad smooth dependence on the conditioning set using low-order polynomial regression, then applies a small nonlinear feature map and residualizes those features with shallow tree regressions. The resulting statistic tests residual cross-covariance, with a moment-matched chi-square approximation to the null distribution. We show theoretically that the two-stage design reduces the effective complexity faced by the tree residualizers, allowing shallow trees to control residual conditional-mean bias while avoiding excessive overfitting. In simulations, BLITZ provides better null calibration than fast kernel, random-feature, and regression-based competitors while remaining among the fastest methods tested. In causal discovery experiments on synthetic graphs and flow-cytometry data, BLITZ yields more reliable endpoint orientations among retained adjacencies and competitive structural recovery. These results suggest that broad-to-local residualization is a practical route to calibrated, scalable nonparametric conditional independence testing for causal discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BLITZ, a nonparametric conditional independence test that first removes broad smooth dependence via low-order polynomial regression on the conditioning set, applies a small nonlinear feature map, and then residualizes with shallow tree regressions. The test statistic is the residual cross-covariance, with a moment-matched chi-square null approximation. The central claim is that the two-stage design reduces effective complexity for the tree residualizers, allowing shallow trees to control residual bias without excessive overfitting that would invalidate the null approximation. Simulations show improved null calibration over fast kernel, random-feature, and regression-based methods while remaining fast; causal discovery experiments on synthetic graphs and flow-cytometry data report more reliable orientations and competitive structural recovery.

Significance. If the complexity-reduction argument holds under the stated conditions and the simulation calibration results are robust to the choice of polynomial degree and tree depth, BLITZ would offer a practical advance for constraint-based causal discovery by improving the speed-accuracy trade-off in repeated nonparametric CI tests. The explicit two-stage residualization strategy and moment-matching approach are concrete contributions that could be adopted or extended in other scalable CI testing pipelines.

major comments (2)
  1. [Abstract / theoretical claim] Abstract and theoretical section: The claim that the two-stage design 'reduces the effective complexity faced by the tree residualizers' is load-bearing for both the theoretical guarantee and the chi-square approximation. No explicit bounds, conditions on the polynomial degree relative to the smoothness of the dependence, or derivation quantifying residual bias after the first stage are visible, leaving open whether the shallow-tree regime remains valid when dependence involves higher-order or non-polynomial terms not spanned by the chosen low-order polynomial and small feature map.
  2. [Simulation experiments] Simulation results (as summarized): The reported better null calibration is central to the practical claim, yet the abstract provides no details on error-bar reporting, number of Monte Carlo repetitions, data-exclusion rules, or how the polynomial degree and tree depth were selected across different dependence structures. This makes it impossible to assess whether the calibration advantage is robust or sensitive to the free parameters listed in the axiom ledger.
minor comments (1)
  1. [Methods] Notation for the nonlinear feature map and the precise definition of the residual cross-covariance statistic should be introduced with an equation number in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. Below we respond point-by-point to the major comments, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / theoretical claim] Abstract and theoretical section: The claim that the two-stage design 'reduces the effective complexity faced by the tree residualizers' is load-bearing for both the theoretical guarantee and the chi-square approximation. No explicit bounds, conditions on the polynomial degree relative to the smoothness of the dependence, or derivation quantifying residual bias after the first stage are visible, leaving open whether the shallow-tree regime remains valid when dependence involves higher-order or non-polynomial terms not spanned by the chosen low-order polynomial and small feature map.

    Authors: We appreciate the referee highlighting the need for stronger theoretical grounding. The manuscript argues that the initial low-order polynomial regression removes broad smooth dependence, thereby lowering the complexity that must be handled by the subsequent shallow tree residualizers. However, we agree that explicit bounds on residual bias, conditions relating polynomial degree to the smoothness of the underlying dependence, and a derivation quantifying the bias after the first stage are not provided. In the revision we will add a dedicated subsection with these elements, including sufficient conditions under which the shallow-tree regime controls residual conditional-mean bias even for higher-order or non-polynomial terms outside the span of the chosen polynomial and feature map. This will also clarify the validity of the moment-matched chi-square approximation. revision: yes

  2. Referee: [Simulation experiments] Simulation results (as summarized): The reported better null calibration is central to the practical claim, yet the abstract provides no details on error-bar reporting, number of Monte Carlo repetitions, data-exclusion rules, or how the polynomial degree and tree depth were selected across different dependence structures. This makes it impossible to assess whether the calibration advantage is robust or sensitive to the free parameters listed in the axiom ledger.

    Authors: We agree that additional experimental details are required for readers to evaluate robustness. While the full simulation section reports the number of Monte Carlo repetitions and the procedure used to select polynomial degree and tree depth, the abstract indeed omits error-bar reporting, explicit data-exclusion rules, and a consolidated description of parameter selection across dependence structures. In the revision we will expand the simulation section (and add a short summary in the abstract) to include these elements, along with sensitivity checks on the free parameters, so that the calibration advantage can be assessed more transparently. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a two-stage residualization procedure (low-order polynomials followed by shallow trees on a small feature map) and claims a theoretical reduction in effective complexity for the tree stage, with a moment-matched chi-square null. No quoted equations or steps reduce a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction. The chi-square approximation is presented as moment-based rather than data-fitted in a self-referential manner, and the derivation chain remains independent of its own outputs. This is the expected non-finding for a method paper whose central claims rest on design properties and external simulation benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard regression consistency properties and the validity of the moment-matched chi-square approximation under the two-stage residualization; no new entities are postulated.

free parameters (2)
  • polynomial degree
    Low-order polynomial chosen to capture broad dependence; specific degree is a modeling choice that affects residual bias.
  • tree depth / number of features
    Shallow trees and small feature map size are design parameters whose values control the bias-variance trade-off in the second stage.
axioms (2)
  • domain assumption Low-order polynomials plus a fixed nonlinear feature map are sufficient to capture all broad smooth dependence on the conditioning set.
    Invoked to justify why shallow trees suffice for the residual stage.
  • domain assumption The moment-matched chi-square distribution accurately approximates the null distribution of the residual cross-covariance statistic after two-stage regression.
    Required for the calibrated p-value claim.

pith-pipeline@v0.9.1-grok · 5751 in / 1460 out tokens · 30327 ms · 2026-06-26T22:32:24.219755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    An approximation to the distribution of quadratic forms in normal random variables.Australian Journal of Statistics, 30(1):150–159, 1988

    Michael J Buckley and Geoffrey K Eagleson. An approximation to the distribution of quadratic forms in normal random variables.Australian Journal of Statistics, 30(1):150–159, 1988

  2. [2]

    Fast Conditional Independence Test for Vector Variables with Large Sample Sizes

    Krzysztof Chalupka, Pietro Perona, and Frederick Eberhardt. Fast conditional independence test for vector variables with large sample sizes.arXiv preprint arXiv:1804.02747, 2018

  3. [3]

    Optimal structure identification with greedy search.Journal of machine learning research, 3(Nov):507–554, 2002

    David Maxwell Chickering. Optimal structure identification with greedy search.Journal of machine learning research, 3(Nov):507–554, 2002

  4. [4]

    Conditional independence in statistical theory.Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(1):1–15, 1979

    A Philip Dawid. Conditional independence in statistical theory.Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(1):1–15, 1979

  5. [5]

    A permutation-based kernel conditional independence test

    Gary Doran, Krikamol Muandet, Kun Zhang, and Bernhard Schölkopf. A permutation-based kernel conditional independence test. InUAI, pages 132–141, 2014

  6. [6]

    Review of causal discovery methods based on graphical models

    Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10:524, 2019

  7. [7]

    Chi squared approximations to the distribution of a sum of independent random variables.The Annals of Probability, pages 1028–1036, 1983

    Peter Hall. Chi squared approximations to the distribution of a sum of independent random variables.The Annals of Probability, pages 1028–1036, 1983

  8. [8]

    Generalized score functions for causal discovery

    Biwei Huang, Kun Zhang, Yizhu Lin, Bernhard Schölkopf, and Clark Glymour. Generalized score functions for causal discovery. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1551–1560, 2018

  9. [9]

    Estimating high-dimensional directed acyclic graphs with the pc-algorithm.Journal of Machine Learning Research, 8(3), 2007

    Markus Kalisch and Peter Bühlman. Estimating high-dimensional directed acyclic graphs with the pc-algorithm.Journal of Machine Learning Research, 8(3), 2007

  10. [10]

    Moment-based approximations of distributions using mixtures: Theory and applications.Annals of the Institute of Statistical Mathematics, 52(2):215–230, 2000

    Bruce G Lindsay, Ramani S Pilla, and Prasanta Basak. Moment-based approximations of distributions using mixtures: Theory and applications.Annals of the Institute of Statistical Mathematics, 52(2):215–230, 2000

  11. [11]

    Searching for Activation Functions

    Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

  12. [12]

    FASK with Interventional Knowledge Recovers Edges from the Sachs Model

    Joseph Ramsey and Bryan Andrews. Fask with interventional knowledge recovers edges from the sachs model.arXiv preprint arXiv:1805.03108, 2018

  13. [13]

    Adjacency-faithfulness and conservative causal inference

    Joseph Ramsey, Peter Spirtes, and Jiji Zhang. Adjacency-faithfulness and conservative causal inference. InProceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 401–408, 2006. Eric V. Strobl, Two-stage regression for CI testing 19

  14. [14]

    A Scalable Conditional Independence Test for Nonlinear, Non-Gaussian Data

    Joseph D Ramsey. A scalable conditional independence test for nonlinear, non-gaussian data.arXiv preprint arXiv:1401.5031, 2014

  15. [15]

    Fast causal discovery by approximate kernel-based generalized score functions with linear computational complexity

    Yixin Ren, Haocheng Zhang, Yewei Xia, Hao Zhang, Jihong Guan, and Shuigeng Zhou. Fast causal discovery by approximate kernel-based generalized score functions with linear computational complexity. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 1197–1208, 2025

  16. [16]

    Conditional independence testing based on a nearest-neighbor estimator of conditional mutual informa- tion

    Jakob Runge. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual informa- tion. InInternational Conference on Artificial Intelligence and Statistics, pages 938–947. Pmlr, 2018

  17. [17]

    Causal protein-signaling networks derived from multiparameter single-cell data.Science, 308(5721):523–529, 2005

    Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data.Science, 308(5721):523–529, 2005

  18. [18]

    A fast kernel-based conditional independence test with application to causal discovery

    Oliver Schacht and Biwei Huang. A fast kernel-based conditional independence test with application to causal discovery. arXiv preprint arXiv:2505.11085, 2025

  19. [19]

    The hardness of conditional independence testing and the generalised covariance measure.The Annals of Statistics, 48(3):1514–1538, 2020

    Rajen D Shah and Jonas Peters. The hardness of conditional independence testing and the generalised covariance measure.The Annals of Statistics, 48(3):1514–1538, 2020

  20. [20]

    Causal discovery with fewer conditional independence tests

    Kirankumar Shiragur, Jiaqi Zhang, and Caroline Uhler. Causal discovery with fewer conditional independence tests. In Proceedings of the 41st International Conference on Machine Learning, pages 45060–45078, 2024

  21. [21]

    MIT press, 2000

    Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, prediction, and search. MIT press, 2000

  22. [22]

    Approximate kernel-based conditional independence tests for fast non-parametric causal discovery.Journal of Causal Inference, 7(1):20180017, 2019

    Eric V Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery.Journal of Causal Inference, 7(1):20180017, 2019

  23. [23]

    Kernel-based conditional independence test and application in causal discovery

    Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. InProceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 804–813, 2011. 7 Supplementary Materials 7.1 Computational Complexity Let n denote the sample size and lets = dim(Z...