arxiv: 2605.05743 · v2 · submitted 2026-05-07 · 📊 stat.ML · cs.AI· cs.LG

Recognition: no theorem link

Fourier Feature Methods for Nonlinear Causal Discovery: FFML Scoring, TRFF Scoring, and FFCI Testing in Mixed Data

Joseph D. Ramsey

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:59 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords causal discoveryrandom Fourier featuresGaussian processesnonlinear causal inferencemixed dataconditional independence testingscore-based methodsconstraint-based methods

0 comments

The pith

Random Fourier features approximate Gaussian process scores and tests to enable scalable nonlinear causal discovery on mixed data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces three complementary methods that use random Fourier features to approximate the expensive computations in Gaussian process-based causal discovery. FFML replaces the full kernel matrix in the GP marginal likelihood with a finite feature map to retain probabilistic scoring at much lower cost while handling mixed continuous and discrete variables. TRFF provides a robust regression-based alternative with a penalty term, and FFCI delivers a fast nonparametric conditional independence test using feature-space residuals. These form a toolkit that integrates into existing algorithms like BOSS and PC-Max, showing competitive structural accuracy on nonlinear benchmarks through complementary strengths in precision and recall.

Core claim

The central claim is that finite random Fourier feature representations can replace the n by n kernel Gram matrix in Gaussian process marginal likelihoods to produce a fast score-based method (FFML) that preserves the original probabilistic interpretation and automatic complexity penalty, with a product-kernel construction for mixed data. A complementary BIC-style score (TRFF) uses penalized Student-t regression on the features for robustness to heavy tails. A nonparametric CI test (FFCI) applies ridge residualization in feature space and approximates a Frobenius-norm statistic as a sum of chi-squared variables. When plugged into score-based and constraint-based pipelines, the methods yield

What carries the argument

The random Fourier feature map, which projects inputs onto a finite set of random trigonometric basis functions to approximate kernel matrices and enable O(n m squared plus m cubed) computations while retaining GP semantics.

If this is right

BOSS combined with FFML achieves the lowest overall structural Hamming distance on nonlinear benchmarks.
BOSS with TRFF delivers the highest precision among the tested configurations.
PC-Max with FFCI achieves better recall and substantially lower SHD than RCIT while running at roughly twice the speed.
The three methods together support hybrid causal discovery workflows that trade off precision against recall according to the data characteristics.
Product-kernel constructions allow the scores and tests to handle mixed continuous-discrete parent sets without separate case handling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The complementary precision-recall profiles suggest that ensemble or adaptive selection among FFML, TRFF, and FFCI could improve recovery rates beyond any single method.
Because the approximations reduce cost from cubic in sample size to linear in n with fixed m, the approach could open causal discovery to sample sizes where exact GP methods become infeasible.
Similar Fourier-feature substitutions might be applied to other kernel-based causal procedures, such as kernel-based structural equation models or independence measures beyond the Frobenius norm used here.
Empirical validation on real-world mixed datasets with known ground-truth graphs would test whether the observed benchmark gains translate outside synthetic nonlinear settings.

Load-bearing premise

The finite random Fourier feature representation preserves enough of the exact GP marginal likelihood's automatic complexity penalty and probabilistic semantics to yield reliable causal graph scores on mixed data without systematic approximation bias.

What would settle it

On synthetic nonlinear mixed-data graphs where the exact GP marginal likelihood recovers the true structure, finding that FFML or TRFF consistently returns graphs with substantially higher structural Hamming distance would show the approximation fails to preserve the necessary scoring properties.

Figures

Figures reproduced from arXiv: 2605.05743 by Joseph D. Ramsey.

**Figure 1.** Figure 1: CPDAGs returned by BOSS + FFML on the Auto MPG dataset ( view at source ↗

**Figure 1.** Figure 1: DAGs returned by BOSS + FFML (DAG mode) on the Auto MPG dataset ( [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗

read the original abstract

Gaussian process (GP) marginal likelihood scores and kernel conditional independence tests are theoretically appealing for nonlinear causal discovery but computationally prohibitive at scale. We present three complementary RFF-based methods forming a practical toolkit for score-based, constraint-based, and hybrid causal discovery. The Fourier Feature Marginal Likelihood (FFML) score approximates the exact GP marginal likelihood by replacing the $n x n$ kernel Gram matrix with a finite-dimensional feature representation, reducing cost to $O(nm^2 + m^3)$ while retaining the probabilistic interpretation and automatic complexity penalty of the exact score. FFML extends to mixed (continuous and discrete) parent sets via a product-kernel construction, with a Kronecker path for small discrete parent sets and a Hadamard-product path otherwise. The Tetrad Random Fourier Feature (TRFF) score is a complementary BIC-style alternative using penalized Student-t regression with random Fourier features. TRFF offers robustness to heavy-tailed noise and faster runtime than FFML. Empirically, TRFF and FFML exhibit a complementary precision-recall profile: TRFF achieves higher precision while FFML achieves better recall and lower SHD overall. The Fourier Feature Conditional Independence (FFCI) test is a fast nonparametric CI test for mixed data, using ridge residualization in feature space and a Frobenius-norm cross-covariance statistic approximated as a weighted sum of chi-squared variables. Empirically, BOSS+FFML achieves the lowest SHD on nonlinear data, while BOSS+TRFF offers the highest precision. When run through PC-Max, FFCI and RCIT exhibit complementary precision-recall profiles: RCIT is more precise while FFCI achieves better recall and substantially lower SHD, at approximately twice the runtime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete RFF approximations for scaling GP-style scores and tests to mixed-data causal discovery, but leaves the approximation bias on model scores unanalyzed.

read the letter

The main thing here is a practical toolkit of three RFF-based methods that target the speed problem in nonlinear causal discovery on mixed continuous and discrete variables. FFML replaces the full GP Gram matrix with a finite feature map while keeping a product-kernel construction for mixed parents. TRFF adds a penalized Student-t regression version that is faster and more robust to tails. FFCI turns the same features into a nonparametric CI test via ridge residuals and a Frobenius statistic. These are specific, named constructions that are not just generic RFF reuse; they directly address parent-set scoring and testing in search algorithms like BOSS and PC-Max. The simulations show complementary behavior—FFML better recall and lower SHD, TRFF higher precision—which is useful to know for different applications. The complexity claims are clear and the mixed-data handling looks workable on paper. The soft spot is exactly the one the stress-test flags. The log-det and quadratic terms in the marginal likelihood are nonlinear in the kernel, so finite-m Monte Carlo error can introduce systematic bias in the automatic complexity penalty even if the kernel itself is unbiased in expectation. The paper supplies no derivation showing the bias vanishes uniformly across the parent sets used in search, no ablation on feature dimension m, and no real-data checks. That leaves the central claim that the scores retain reliable probabilistic semantics only moderately supported. The work is still worth referee time because the methods are explicit, the empirical trade-offs are reported, and the problem it attacks is real. Readers who build or apply causal discovery tools on larger mixed datasets will get immediate value from the constructions and the reported precision-recall patterns, even if they will want tighter error analysis before relying on the scores for final graphs. I would send it to review with a request for bounds or sensitivity checks on the approximation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces three complementary random Fourier feature (RFF) approximations for nonlinear causal discovery on mixed continuous-discrete data: the Fourier Feature Marginal Likelihood (FFML) score, which replaces the n×n GP Gram matrix with an m-dimensional feature map to approximate the marginal likelihood at O(nm² + m³) cost while claiming to retain the probabilistic interpretation and automatic complexity penalty; the Tetrad Random Fourier Feature (TRFF) score, a BIC-style penalized Student-t regression alternative; and the Fourier Feature Conditional Independence (FFCI) test, which uses ridge residualization and a Frobenius-norm statistic approximated via weighted chi-squared variables. The methods are positioned as a practical toolkit for score-based (BOSS+FFML/TRFF), constraint-based (PC-Max+FFCI), and hybrid discovery, with empirical claims of complementary precision-recall behavior and lower SHD than baselines on nonlinear simulations.

Significance. If the finite RFF approximations preserve sufficient fidelity to the exact GP marginal likelihood semantics and CI test properties, the work supplies a scalable toolkit that could extend nonlinear causal discovery to larger mixed-data regimes where exact GP methods are prohibitive. The explicit product-kernel construction for mixed parents and the reported complementary profiles (FFML better recall/lower SHD, TRFF higher precision) are concrete strengths; the absence of machine-checked proofs or parameter-free derivations is offset by the reproducible simulation framework implied by the empirical sections.

major comments (2)

[Abstract / FFML Score] Abstract and FFML derivation: the central claim that the RFF map 'retains the probabilistic interpretation and automatic complexity penalty' of the exact GP marginal likelihood is load-bearing for the score-based component, yet no derivation or uniform bound is supplied showing that the bias in the log-det and quadratic terms vanishes over the model space used in parent-set search; the product-kernel (Kronecker/Hadamard) construction for mixed parents makes the nonlinearity of these functionals especially relevant, and finite-m Monte-Carlo error could systematically favor or penalize certain cardinalities or variable types.
[Empirical Evaluation] Empirical sections: no ablation on feature dimension m is reported despite the approximation quality depending directly on m; without this, the reported SHD, precision, and recall advantages of BOSS+FFML and PC-Max+FFCI cannot be assessed for sensitivity to the hyperparameter that controls the central approximation.

minor comments (2)

[Abstract] The complexity statement O(nm² + m³) is given without an explicit comparison to the exact GP O(n³) cost for representative m/n ratios, which would clarify the practical regime of applicability.
[FFML Extension to Mixed Data] Notation for the product-kernel paths (Kronecker vs. Hadamard) could be introduced with a small table or diagram to avoid ambiguity when the discrete parent set size crosses the 'small' threshold.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the presentation.

read point-by-point responses

Referee: [Abstract / FFML Score] Abstract and FFML derivation: the central claim that the RFF map 'retains the probabilistic interpretation and automatic complexity penalty' of the exact GP marginal likelihood is load-bearing for the score-based component, yet no derivation or uniform bound is supplied showing that the bias in the log-det and quadratic terms vanishes over the model space used in parent-set search; the product-kernel (Kronecker/Hadamard) construction for mixed parents makes the nonlinearity of these functionals especially relevant, and finite-m Monte-Carlo error could systematically favor or penalize certain cardinalities or variable types.

Authors: We agree that the manuscript does not supply a derivation or uniform bound establishing that the approximation bias in the log-determinant and quadratic terms vanishes uniformly over the parent-set model space, and that the nonlinearity introduced by the product-kernel construction for mixed parents makes such analysis particularly relevant. The FFML score is obtained by direct substitution of the finite RFF feature map into the standard GP marginal likelihood expression, so it inherits the same functional form (and thus an approximate complexity penalty) but is not identical to the exact GP quantity. In the revised manuscript we will (i) qualify the abstract claim to state that FFML 'approximately retains' the probabilistic interpretation and complexity penalty, (ii) add a dedicated paragraph in the FFML derivation section that explicitly acknowledges the Monte-Carlo error and its potential differential effect on models of different cardinality or variable type, and (iii) cite existing RFF convergence results to contextualize the practical reliability of the approximation for ranking purposes. These changes will be made without altering the empirical claims. revision: yes
Referee: [Empirical Evaluation] Empirical sections: no ablation on feature dimension m is reported despite the approximation quality depending directly on m; without this, the reported SHD, precision, and recall advantages of BOSS+FFML and PC-Max+FFCI cannot be assessed for sensitivity to the hyperparameter that controls the central approximation.

Authors: We concur that the absence of an ablation on the feature dimension m prevents readers from evaluating the sensitivity of the reported performance advantages to this central hyperparameter. The current experiments employ a single fixed m selected for computational tractability after preliminary tuning; however, we recognize that this choice leaves open questions about robustness. In the revised version we will add a new ablation subsection (or appendix figure) that varies m over a representative range (e.g., 50–500) on the nonlinear mixed-data simulation suites and reports the resulting SHD, precision, and recall trajectories for BOSS+FFML, BOSS+TRFF, and PC-Max+FFCI. This addition will directly address the referee’s concern and allow assessment of stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; RFF-based approximations are explicit constructions from established theory

full rationale

The paper defines FFML as a direct replacement of the n×n Gram matrix in the exact GP marginal likelihood with an m-dimensional RFF map, preserving the log-det and quadratic form by algebraic substitution rather than by redefining the score in terms of its own outputs. TRFF is introduced as a separate BIC-penalized Student-t regression using the same features, and FFCI as a ridge-residualized Frobenius statistic; none of these reduce to fitted parameters or self-citations by construction. The product-kernel extension for mixed data is a standard Kronecker/Hadamard construction, not an ansatz smuggled via self-citation. The derivations are therefore self-contained against external GP and RFF benchmarks, with no load-bearing uniqueness theorems or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard kernel-method assumptions rather than new postulates; no free parameters or invented entities are introduced in the abstract description.

axioms (2)

domain assumption Random Fourier features yield a sufficiently accurate finite-dimensional approximation to the GP kernel for the purposes of marginal likelihood scoring and CI testing
Invoked to replace the n×n Gram matrix while retaining probabilistic interpretation
domain assumption The product-kernel and Hadamard-product constructions correctly handle mixed continuous-discrete parent sets
Required for the FFML extension to mixed data

pith-pipeline@v0.9.0 · 5628 in / 1414 out tokens · 67603 ms · 2026-05-12T01:59:59.745417+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

Generalized score functions for causal discovery , author=. Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

work page
[2]

Journal of Causal Inference , volume=

Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery , author=. Journal of Causal Inference , volume=. 2019 , publisher=

work page 2019
[3]

Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI) , pages=

Kernel-based Conditional Independence Test and Application in Causal Discovery , author=. Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI) , pages=

work page
[4]

Advances in neural information processing systems , volume=

Fast scalable and accurate discovery of dags using the best order score search and grow shrink trees , author=. Advances in neural information processing systems , volume=

work page
[5]

2000 , publisher=

Causation, prediction, and search , author=. 2000 , publisher=

work page 2000
[6]

Advances in Neural Information Processing Systems , volume=

Random Features for Large-Scale Kernel Machines , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Advances in Neural Information Processing Systems , volume=

Orthogonal Random Features , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Journal of Machine Learning Research , volume=

A million variables and more: the fast greedy equivalence search algorithm for continuous variables and its extensions , author=. Journal of Machine Learning Research , volume=

work page
[9]

Journal of Machine Learning Research , volume=

Causal-learn: Causal discovery in python , author=. Journal of Machine Learning Research , volume=

work page
[10]

, title =

Quinlan, R. , title =. 1993 , howpublished =

work page 1993
[11]

Causal Analysis Workshop Series , pages=

Py-tetrad and rpy-tetrad: A new python interface with r support for tetrad causal search , author=. Causal Analysis Workshop Series , pages=. 2023 , organization=

work page 2023
[12]

Journal of machine learning research , volume=

Optimal structure identification with greedy search , author=. Journal of machine learning research , volume=

work page
[13]

arXiv preprint arXiv:1610.00378 , year=

Improving accuracy and scalability of the pc algorithm by maximizing p-value , author=. arXiv preprint arXiv:1610.00378 , year=

work page arXiv
[14]

Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence , series =

Meek, Christopher , title =. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence , series =. 1995 , publisher =

work page 1995