stat — Pith

0

stat.ML 2026-05-13 Recognition

Bootstrap yields valid CIs for offline RL value functions

Model-based Bootstrap of Controlled Markov Chains

New resampling scheme for Markov transition kernels attains nominal coverage at small samples where plug-in and episodic baselines fail.

abstract click to expand

We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.

0

stat.ML 2026-05-13 2 theorems

Conformal prediction optimizes sets without data splits

Multi-Variable Conformal Prediction: Optimizing Prediction Sets without Data Splitting

Joint design and calibration via multiple variables keeps coverage guarantees while using every sample

abstract click to expand

Conformal prediction constructs prediction sets with finite-sample coverage guarantees, but its calibration stage is structurally constrained to a scalar score function and a single threshold variable - forcing shapes of prediction sets to be fixed before calibration, typically through data splitting. We introduce multi-variable conformal prediction (MCP), a framework that extends conformal prediction to vector-valued score functions with multiple simultaneous calibration variables. Building on scenario theory as a principled framework for certifying data-driven decisions, MCP unifies prediction set design and calibration into a single optimization problem, eliminating data splitting without sacrificing coverage guarantees. We propose two computationally efficient variants: RemMCP, grounded in constrained optimization with constraint removal, which admits a clean generalization of split conformal prediction; and RelMCP, based on iterative optimization with constraint relaxation, which supports non-convex score functions at the cost of possibly greater conservatism. Through numerical experiments on ellipsoidal and multi-modal prediction sets, we demonstrate that RemMCP and RelMCP consistently meet the target coverage with prediction set sizes smaller than or comparable to those of baselines with data split, while considerably reducing variance across calibration runs - a direct consequence of using all available data for shape optimization and calibration simultaneously.

0

stat.ML 2026-05-13 Recognition

Online deferral algorithm manages varying experts with sublinear regret

Online Learning-to-Defer with Varying Experts

First method for streaming multiclass tasks routes to changing experts and bounds regret at O(T^{2/3}) or better under low noise.

abstract click to expand

Learning-to-Defer (L2D) methods route each query either to a predictive model or to external experts. While existing work studies this problem in batch settings, real-world deployments require handling streaming data, changing expert availability, and shifting expert distribution. We introduce the first online L2D algorithm for multiclass classification with bandit feedback and a dynamically varying pool of experts. Our method achieves regret guarantees of $O((n+n_e)T^{2/3})$ in general and $O((n+n_e)\sqrt{T})$ under a low-noise condition, where $T$ is the time horizon, $n$ is the number of labels, and $n_e$ is the number of distinct experts observed across rounds. The analysis builds on novel $\mathcal{H}$-consistency bounds for the online framework, combined with first-order methods for online convex optimization. Experiments on synthetic and real-world datasets demonstrate that our approach effectively extends standard Learning-to-Defer to settings with varying expert availability and reliability.

0

stat.ME 2026-05-13 Recognition

Surrogates estimate time-dependent failure probabilities efficiently

Time-variant reliability using time-dependent surrogate models

mNARX and F-NARX with biased designs capture tail responses in stochastic dynamical systems for first-passage analysis.

abstract click to expand

Time-variant reliability analysis is a critical task for ensuring the safety of engineering dynamical systems subjected to stochastic excitations. However, assessing failure probability for realistic systems with Monte-Carlo simulation-based methods is often computationally intractable due to the high cost of the underlying models and the large number of simulations required. While surrogate models such as polynomial chaos expansions or Kriging are well-established for time-invariant reliability problems, their direct application to time-dependent systems remains challenging. This chapter introduces two advanced surrogate modeling frameworks designed specifically for dynamical systems: manifold-NARX (mNARX) and functional NARX (F-NARX). The mNARX approach constructs the surrogate on a reduced-order manifold of auxiliary state variables, enabling the efficient handling of high-dimensional inputs by embedding physical insight into a regression formulation. Conversely, the F-NARX framework exploits the functional nature of system trajectories, extracting principal component features from continuous time windows to mitigate issues associated with discrete lag selection and long-memory effects. We demonstrate the efficacy of these methods on two benchmark reliability problems: a stochastic quarter-car model and a hysteretic Bouc-Wen oscillator. The results highlight that, when combined with suitably biased experimental designs, both frameworks accurately capture the tail behavior of the system response, enabling precise and efficient estimation of first-passage probabilities.

0

stat.ML 2026-05-13 Recognition

Budget-coverage policy learning reduces to affine threshold rule

Optimal Policy Learning under Budget and Coverage Constraints

The LP relaxation has an O(1) gap and is asymptotically equivalent to the optimal discrete allocation, with greedy methods near-optimal in 0

abstract click to expand

We study optimal policy learning under combined budget and minimum coverage constraints. We show that the problem admits a knapsack-type structure and that the optimal policy can be characterized by an affine threshold rule involving both budget and coverage shadow prices. We establish that the linear programming relaxation of the combinatorial solution has an O(1) integrality gap, implying asymptotic equivalence with the optimal discrete allocation. Building on this result, we analyze two implementable approaches: a Greedy-Lagrangian (GLC) and a rank-and-cut (RC) algorithm. We show that the GLC closely approximates the optimal solution and achieves near-optimal performance in finite samples. By contrast, RC is approximately optimal whenever the coverage constraint is slack or costs are homogeneous, while misallocation arises only when cost heterogeneity interacts with a binding coverage constraint. Monte Carlo evidence supports these findings.

0

stat.ML 2026-05-13 2 theorems

Self-predicted data calibrates Bayesian regression better than Laplace

Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

Refitting on model outputs approximates the posterior predictive directly, yielding improved calibration without sampling or full posterior

abstract click to expand

Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main object of interest is often a model's predictions rather than its parameters. In this work, we propose to bypass the parameter posterior and focus directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-training within self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model's predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. This yields a deterministic, sampling-free approximation of the posterior predictive. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows us to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically in regression models ranging from Bayesian linear models to Bayesian neural networks. Across a wide array of regression tasks with simulated and real-world datasets, our methods outperform classical Laplace approximations in predictive calibration while remaining computationally efficient.

0

stat.ML 2026-05-13 2 theorems

Sequential CMI bounds adaptive generalization gaps

Information-Theoretic Generalization Bounds for Sequential Decision Making

New supersample framework sums round-wise information terms to control the gap under exchangeability assumption.

abstract click to expand

Information-theoretic generalization bounds based on the supersample construction are a central tool for algorithm-dependent generalization analysis in the batch i.i.d.~setting. However, existing supersample conditional mutual information (CMI) bounds do not directly apply to sequential decision-making problems such as online learning, streaming active learning, and bandits, where data are revealed adaptively and the learner evolves along a causal trajectory. To address this limitation, we develop a sequential supersample framework that separates the learner filtration from a proof-side enlargement used for ghost-coordinate comparisons. Under a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI, a sum of roundwise selector--loss information terms. We also establish a Bernstein-type refinement that yields faster rates under suitable variance conditions. The selector-SCMI proof strategy applies to online learning, streaming active learning with importance weighting, and stochastic multi-armed bandits.

0

stat.ME 2026-05-13 2 theorems

Mixed-frequency synthetic controls reach optimal prediction error

Synthetic Control Method with Mixed Frequency Data

The estimator minimizes squared error for counterfactuals while delivering confidence intervals for average treatment effects.

abstract click to expand

Mixed-frequency data, where variables are observed at different temporal resolutions, commonly occur in economic and financial studies. Classical synthetic control methods (SCM) are ill-suited for such data, often necessitating aggregation or prefiltering that may discard valuable information. This paper proposes a novel Mixed-Frequency Synthetic Control Method (MF-SCM) to integrate mixed-frequency data into the synthetic control framework effectively. We develop a flexible estimation procedure to construct synthetic control weights under mixed-frequency settings and establish the theoretical properties of the MF-SCM estimator. Specifically, we first prove that the estimator achieves asymptotic optimality, in the sense that it achieves the lowest possible squared prediction error among all potential treatment effect estimators from averaging outcomes of control units. We then derive the asymptotic distribution of the average treatment effect (ATE) estimator using projection theory and construct confidence intervals for the ATE estimator. The method's effectiveness is demonstrated through numerical simulations and two empirical applications concerning the 2017 Tax Cuts and jobs Act in US and air pollution alerts.

0

stat.ML 2026-05-13 2 theorems

Score gradients cut simulation needs for neural surrogates

Keeping Score: Efficiency Improvements in Neural Likelihood Surrogate Training via Score-Augmented Loss Functions

Augmenting the loss with exact scores matches inference quality from ten times more data at minimal extra training cost.

abstract click to expand

For stochastic process models, parameter inference is often severely bottlenecked by computationally expensive likelihood functions. Simulation-based inference (SBI) bypasses this restriction by constructing amortized surrogate likelihoods, but most SBI methods assume a black-box data generating process. While these surrogates are exact in the limit of infinite training data, practical scenarios force a strict tradeoff between model quality and simulation cost. In this work, we loosen the black-box assumption of SBI to improve this tradeoff for structured stochastic process models. Specifically, for neural network likelihood surrogates trained via probabilistic classification, we propose to augment the standard binary cross-entropy loss with exact score information $\nabla_\theta \log p(x \mid \theta)$ and adaptive weighting based on loss gradients. We evaluate our approach on case studies involving network dynamics and spatial processes, demonstrating that our method improves surrogate quality at a drastically lower computational cost than generating more training data. Notably, in some cases, our approach achieves downstream inference performance equivalent to a 10x increase in training data with less than a 1.1x increase in training time.

0

stat.ME 2026-05-13 2 theorems

Prior evidence boosts power in sequential multiple testing

Informative Simultaneous Confidence Intervals for Graphical Group Sequential Test Procedures

A graphical test strategy adjusts current significance levels using past repeated p-values but decides on current evidence alone, enabling

abstract click to expand

Test procedures for multiple hypotheses in a group sequential clinical trial that control the family-wise error rate are considered. Several graphical group sequential tests suggested in the literature, which are special cases of Bonferroni-closure tests, are discussed. The focus is on the question of whether to consider at the current stage only the evidence of the current repeated p-value or the evidence over all repeated p-values from the previous stages. A new test strategy controlling the family-wise error rate is introduced that consistently works across all hypotheses, with the evidence (i.e., repeated p-value) from the current stage. The strategy is more powerful than similar previously suggested test procedures. This is achieved by using the evidence from previous stages to increase the significance levels. For the test procedures, corresponding compatible simultaneous confidence intervals are presented, having the disadvantage of often not providing additional information on the treatment effects. For this reason, we extend previous work about informative simultaneous confidence intervals for one-stage graphical tests to graphical group sequential trials. Iterative algorithms are introduced that calculate these informative bounds that have a small power loss compared to the original graphical group sequential test. The boundaries can be calculated after each stage. In addition, previous work is extended by a criterion to estimate the accuracy of the numerically calculated boundaries. The suggested informative bounds can be used to provide median-conservative, i.e., reliable estimators, for estimating the treatment effects in a group sequential test with multiple hypotheses.

0

stat.ME 2026-05-13 Recognition

Bayesian model links realized volatility to prices for better forecasts

Bayesian Dynamic Modeling of Realized Volatility in Financial Asset Price Forecasting

Dynamic gamma process for intraday volatility feeds into price models to capture leverage effects and outperforms standard approaches in S&P

abstract click to expand

We present a new class of Bayesian dynamic models for bivariate price-realized volatility time series in financial forecasting. A novel dynamic gamma process model adopted for realized volatility is integrated with traditional Bayesian dynamic linear models (DLMs) for asset price series. This represents reduced-form volatility leverage and feedback effects through use of realized volatility proxies in conditional DLMs for prices or returns, coupled with the synthesis of higher frequency data to track and anticipate volatility fluctuations. Analysis is computationally straightforward, extending conjugate-form Bayesian analyses for sequential filtering and model monitoring with simple and direct simulation for forecasting. A main applied setting is equity return forecasting with daily prices and realized volatility from high-frequency, intraday data. Detailed empirical studies of multiple S&P sector ETFs highlight the improvements achievable in asset price forecasting relative to standard models and deliver contextual insights on the nature and practical relevance of volatility leverage and feedback effects. The analytic structure and negligible extra computational cost will enable scaling to higher dimensions for multivariate price series forecasting for decouple/recouple portfolio construction and risk management applications.

0

stat.ME 2026-05-13 Recognition

Laplacian-P-splines yield fast Gamma frailty fits for clustered survival

Laplacian-P-splines for shared Gamma frailty models applied to clustered right-censored time-to-event data

Analytical gradient and Hessian replace MCMC, giving parameter estimates and uncertainty for shared frailty models without sampling.

abstract click to expand

Shared frailty models have been proposed to accommodate unmeasured cluster-specific risk factors through the inclusion of a common latent frailty term. Among possible frailty distributions, the Gamma distribution is appealing due to its non-negativity, flexibility, and algebraic tractability leading to closed-form marginal survival or hazard function expressions. Under the Bayesian paradigm, the posterior distributions of model parameters are usually explored with computationally intensive procedures relying on Markov chain Monte Carlo sampling. As an alternative, Laplacian-P-splines (LPS) provide a flexible and sampling-free alternative by relying on Gaussian approximations of the posterior target distributions. In this model class, analytical formulas are obtained for the gradient and Hessian, yielding a computationally efficient inference scheme for estimation of model parameters with a natural way of quantifying uncertainty. This article extends the LPS toolbox to the inclusion of shared Gamma frailty models for clustered time-to-event data. We assess the finite-sample performance of the LPS estimation procedure through an extensive simulation study and compare estimates with those obtained using penalized partial likelihood estimation, without specification of the baseline hazard, and with the variance of the frailty term being estimated using profile likelihood. Finally, the proposed LPS estimation method is exemplified using three publicly available biomedical datasets on: (i) recurrent infections in children, (ii) cancer prevention, and (iii) kidney transplantation.

0

stat.ME 2026-05-13 2 theorems

No single test dominates power across multivariate problems

Power Studies For Two-Sample and Goodness-of-Fit Methods For Multivariate Data

Simulations show every method fails in some cases, so a small complementary set is recommended instead.

abstract click to expand

We present the results of a large number of simulation studies regarding the power of various goodness-of-fit as well as non-parametric two-sample tests for multivariate data. In two dimensions this includes both continuous and discrete data, in higher dimensions continuous data only. In general no single method can be relied upon to provide good power, any one method may be quite good for some combination of null hypothesis and alternative and may fail badly for another. Based on the results of these studies we propose a fairly small number of methods chosen such that for any of the case studies included here at least one of the methods has good power. The studies were carried out using the R packages MD2sample and MDgof, available from CRAN.

0

stat.ME 2026-05-13 2 theorems

Bayesian mixture clusters mixed health outcomes with low-rank regressions

Bayesian low-rank latent-cluster regression for mixed health outcomes

Posterior contraction holds for surfaces and mean shifts, with consistent partition recovery under separation.

abstract click to expand

High-dimensional health and surveillance studies often involve many collinear predictors, multiple correlated outcomes of different types, and latent heterogeneity across observational units. We propose a Bayesian latent-cluster reduced-rank regression model for multivariate mixed outcomes. The model is a finite mixture of regression surfaces: each latent cluster has a cluster-specific mean shift and a low-rank coefficient matrix, yielding simultaneous clustering, dimension reduction, and component-wise interpretability. Response coordinates may be Gaussian, Bernoulli, or negative binomial. Multiplicative gamma process shrinkage adapts the effective rank within each cluster, and a WAIC-based criterion is used to tune the number of clusters and the nominal maximal rank. We establish posterior contraction for the identifiable component-specific regression surfaces and mean shifts, up to label permutation, and derive corresponding contraction for predictor-side singular subspaces. We also analyze the default label-invariant reporting pipeline based on the posterior similarity matrix: an eigenspace embedding followed by mean shift is shown to consistently recover the latent partition under an additional strong separation margin. Simulation experiments spanning all-Gaussian, all-Bernoulli, all-negative-binomial, and mixed Gaussian--Bernoulli--negative-binomial regimes show accurate recovery of the number of clusters and competitive clustering performance against $K$-means, mclust, PCA-based clustering, and a Gaussian reduced-rank mixture benchmark. We illustrate the method in three applications that show how the model separates individual-level utilization groups and produces interpretable county- and state-level cluster maps together with response-specific posterior predictive maps.

0

stat.AP 2026-05-13 1 theorem

Ensemble models forecast daily tree water use from weather data

An ensemble prediction method for forecasting sap flux density and water-use in temperate trees

Additive model ensemble tested across nine species and three seasons enables irrigation planning under climate stress.

abstract click to expand

Efficient irrigation management is crucial to agriculture, forestry and horticulture, especially under climate change. Developments in novel sensors and Internet of Things technology provide an opportunity to carry out real-time monitoring of tree sap flux density, which, when coupled with advanced modelling techniques, enables online prediction of tree water-use suitable for irrigation planning. This manuscript proposes one such pipeline that integrates tree sap flow sensors, weather station sensors, and statistical models to predict tree daily water-use. In particular, an ensemble prediction approach based on additive models has been developed, using weather data as the main predictors of sap flux density. The method simultaneously considers the non-linear relationships and interactions between sap flux density and its environmental drivers, as well as the variability among individual trees over different growing seasons. Using field data collected on nine species of trees over the 2022, 2023 and 2024 growing seasons, this manuscript demonstrates the ability of the proposed ensemble prediction method in producing reliable daily water-use forecasts. The challenge of predicting tree water-use under climate stress, such as heatwaves, and the impact of tree sizes on prediction have also been discussed. Despite the complexity of the problem, the proposed method provides a general framework which can be used in a variety of settings, from commercial tree growers to conversation work. The model can be integrated into an online monitoring platform, assisting real-time decision making on irrigation management.

0

stat.ML 2026-05-13 2 theorems

Two anchors make reward variance identifiable from preferences

Variance-aware Reward Modeling with Anchor Guidance

Coarse response labels break non-identifiability so Gaussian models can capture human disagreement and strengthen RLHF.

abstract click to expand

Standard Bradley--Terry (BT) reward models are limited when human preferences are pluralistic. Although soft preference labels preserve disagreement information, BT can only express it by shrinking reward margins. Gaussian reward models provide an alternative by jointly predicting a reward mean and a reward variance, but suffer from a fundamental non-identifiability from pairwise preferences alone. We propose Anchor-guided Variance-aware Reward Modeling, a framework that resolves this non-identifiability by augmenting preference data with two coarse response-level anchor labels. Building on this, we prove that two anchors are sufficient for identification, develop a joint training objective and establish a non-asymptotic convergence rate for both the estimated reward mean and variance functions. Across simulation studies and four real-world diverging-preference datasets, our method consistently improves reward modeling performance and downstream RLHF, including PPO training and best-of-$N$ selection.

0

stat.ML 2026-05-13 2 theorems

Kernel eigenvalue decay determines random forest rates

Minimax Rates and Spectral Distillation for Tree Ensembles

This also allows spectral distillation to produce much smaller models that match full performance.

abstract click to expand

Tree ensembles such as random forests (RFs) and gradient boosting machines (GBMs) are among the most widely used supervised learners, yet their theoretical properties remain incompletely understood. We adopt a spectral perspective on these algorithms, with two main contributions. First, we derive minimax-optimal convergence for RF regression, showing that, under mild regularity conditions on tree growth, the eigenvalue decay of the induced kernel operator governs the statistical rate. Second, we exploit this spectral viewpoint to develop compression schemes for tree ensembles. For RFs, leading eigenfunctions of the kernel operator capture the dominant predictive directions; for GBMs, leading singular vectors of the smoother matrix play an analogous role. Learning nonlinear maps for these spectral representations yields distilled models that are orders of magnitude smaller than the originals while maintaining competitive predictive performance. Our methods compare favorably to state of the art algorithms for forest pruning and rule extraction, with applications to resource constrained computing.

0

stat.ME 2026-05-13 1 theorem

Counterfactual probability identifies root causes from data

Probability of Root Cause: A Counterfactual Definition and Its Identification

New measure gives the chance a variable set is the true origin of an outcome, conditional on evidence and under standard assumptions.

abstract click to expand

Attributing an observed outcome to its root cause is a central task in domains ranging from medical diagnosis to engineering fault diagnosis. Existing approaches either equate the root cause with a root node of the causal graph, as in causal-discovery-based root cause analysis, or target causes more broadly and thereby favour proximate ones, as with the probability of causation and posterior causal effects. We argue that this issue stems from the absence of a formal definition of a root cause, which has led to methods designed for other purposes being applied to root cause attribution by default. We address this by giving a formal, individual-level definition of a root cause within the potential outcomes framework, based on the notion of an individual cause and a counterfactual root condition motivated by mediation analysis. Building on this definition, we propose the probability of root cause (PRC), which quantifies how probable it is that a candidate variable set is the root cause of a given outcome, conditional on observed evidence. Under standard assumptions, we establish the identifiability of the PRC and derive an explicit identification formula. Two numerical examples illustrate the approach.

0

stat.ME 2026-05-13 2 theorems

Nontargeted HPV infections isolate vaccine direct immune effect

Using NonTargeted HPV Infections in Studies with Risk Compensation

In observational data with possible behavior changes, the method removes confounding and mediation to reveal pure immunological protection.

abstract click to expand

Studies of HPV vaccine efficacy usually record infections with vaccine targeted and nontargeted strains. Contrary to blinded randomized controlled trials, confounding bias can be a threat and risk compensation may occur in observational studies. Etievant et al. (Biometrics, 2023) proposed to use cervical infections with nontargeted HPV strains to reduce or remove confounding bias of estimates of vaccine efficacy on targeted strains. However, they assumed that vaccinated women could not change their behavior after vaccination. We consider a more plausible setting where unmeasured sexual behavior acts as both a confounder and a mediator, and investigate if the quantity estimated in practice with their method has a clear causal meaning. We demonstrate that using nontargeted HPV infections can remove both confounding bias and the portion of the vaccine effect on the targeted HPV strains that is mediated through the change of behavior. In that case, the estimated quantity has a clear causal interpretation as it represents the direct immunological effect of the vaccine. However, it could be considered misleading from a public health perspective, as in the presence of risk compensation it would suggest higher protection than what women effectively experience. An unblinded randomized controlled trial would allow estimation of the total causal effect of the vaccine, and infections with nontargeted HPV strains could then be used to isolate the indirect behavioral effect of the vaccine.

0

stat.ME 2026-05-13 2 theorems

Local clr LIMA detects composition mark clusters better than global averages

Uncovering Local Heterogeneity: Local Summary Characteristics for Spatial Point Processes with Composition-Valued Marks

New point-specific functions for spatial processes uncover local economic clusters and drainage effects hidden from global metrics

abstract click to expand

Traditional analysis of marked spatial point processes often relies on global summary statistics, which tend to obscure local spatial heterogeneity by averaging dependencies across the entire observation window. To overcome this limitation, this paper introduces a framework for Local Indicators of Mark Association (LIMA) specifically designed for composition-valued marks. Such marks, characterized by their non-negative components and sum-to-constant constraint, require a specialized treatment within the Aitchison geometry. By employing log-ratio transformations, we project these constrained marks into a Euclidean space, enabling the point-specific decomposition of global mark characteristics. The efficacy of the proposed clr-based LIMA functions is validated through extensive simulation studies. The results demonstrate a superior capacity to detect localized mark clusters, achieving detection accuracies consistently higher than their global counterparts. The practical utility of this framework is demonstrated using an empirical dataset of economic sector compositions in Castile-La Mancha, Spain. The analysis uncovers latent economic clustering patterns and localized \textit{drainage} effects that are invisible to global metrics, providing granular insights into regional spatial dynamics. Our findings suggest that the extended LIMA framework serves as a vital diagnostic tool for high-dimensional, non-stationary marked point patterns.

0

stat.ML 2026-05-13 2 theorems

Sparse Bayesian KANs achieve near-minimax contraction

Posterior Contraction Rates for Sparse Kolmogorov-Arnold Networks in Anisotropic Besov Spaces

With a hyperprior on model size they adapt to unknown anisotropic smoothness at fixed depth.

abstract click to expand

We study posterior contraction rates for sparse Bayesian Kolmogorov-Arnold networks (KANs) over anisotropic Besov spaces, providing a statistical foundation of KANs from a Bayesian point of view. We show that sparse Bayesian KANs equipped with spike-and-slab-type sparsity priors attain the near-minimax posterior contraction. In particular, the contraction rate depends on the intrinsic anisotropic smoothness of the underlying function. Moreover, by placing a hyperprior on a single model-size parameter, the resulting posterior adapts to unknown anisotropic smoothness and still achieves the corresponding near-minimax rate. A distinctive feature of our results, compared with those for standard sparse MLP-based models, is that the KAN depth can be kept fixed: owing to the flexibility of learnable spline edge functions, the required approximation complexity is controlled through the network width, spline-grid range and size, and parameter sparsity. Our analysis develops theoretical tools tailored to sparse spline-edge architectures, including approximation and complexity bounds for Bayesian KANs. We then extend to compositional Besov spaces and show that the contraction rates depend on layerwise smoothness and effective dimension of the underlying compositional structure, thereby effectively avoiding the curse of dimensionality. Together, the developed tools and findings advance the theoretical understanding of Bayesian neural networks and provide rigorous statistical foundations for KANs.

0

stat.ML 2026-05-13 1 theorem

Active label queries cut U-statistic variance with fixed budget

Learning U-Statistics with Active Inference

An augmented weighting scheme uses machine learning predictions to select informative samples while keeping asymptotic normality and valid p

abstract click to expand

$U$-statistics play a central role in statistical inference. In many modern applications, however, acquiring the labels required for $U$-statistics is costly. Motivated by recent advances in active inference, we develop an active inference framework for $U$-statistics that selectively queries informative labels to improve estimation efficiency under a fixed labeling budget, while preserving valid statistical inference. Our approach is built on the augmented inverse probability weighting $U$-statistic, which is designed to incorporate the sampling rule and machine learning predictions. We characterize the optimal sampling rule that minimizes its variance and design practical sampling strategies. We further extend the framework to $U$-statistic-based empirical risk minimization. Experiments on real datasets demonstrate substantial gains in estimation efficiency over baseline methods, while maintaining target coverage.

0

stat.AP 2026-05-13 Recognition

Corrected audits flag discrimination in every Illinois insurer

Fairness Testing for Algorithmic Pricing

Standard errors fail for deterministic pricing; new variance formulas show minority zip codes pay $34-$158 more than comparable-risk white z

abstract click to expand

Algorithmic systems now set prices across auto insurance, credit, and lending markets, and regulators increasingly require firms to demonstrate that these systems do not discriminate against protected groups. The standard audit regresses pricing output on a protected attribute and legitimate rating factors, then tests the resulting coefficient using ordinary least squares standard errors. We show that this approach is structurally invalid. Pricing algorithms are usually deterministic, so residuals reflect approximation error rather than sampling variability, rendering classical standard errors invalid in both direction and magnitude. We derive correct asymptotic variance estimators for OLS and GLM audit regressions and the correct cross-covariance formula for proxy discrimination testing. Applied to quoted premiums from 34 Illinois auto insurers, every insurer fails the conditional demographic parity test, with minority zip codes paying $34-$158 more per year than comparable-risk white zip codes. The standard proxy discrimination formula flags zero insurers. However, our corrected formula identifies all 34 as statistically significant, of which 16 exceed the substantive threshold. Our framework provides statistically valid audit tools for any deterministic algorithmic system subject to regression-based fairness testing.

0

stat.ML 2026-05-13 2 theorems

Noise-subspace estimator matches minimax rate for probabilistic PLS

Exact Stiefel Optimization for Probabilistic PLS: Closed-Form Updates, Error Bounds, and Calibrated Uncertainty

Exact Stiefel optimization yields closed-form updates and signal-independent error bounds while full-spectrum estimation is inconsistent.

abstract click to expand

Probabilistic partial least squares (PPLS) is a central likelihood-based model for two-view learning when one needs both interpretable latent factors and calibrated uncertainty. Building on the identifiable parameterization of Bouhaddani et al.\ (2018), existing fitting pipelines still face two practical bottlenecks: noise--signal coupling under joint EM/ECM updates and nontrivial handling of orthogonality constraints. Following the fixed-noise scalar-likelihood line of Hu et al.\ (2025), we develop an end-to-end framework that combines noise pre-estimation, constrained likelihood optimization, and prediction calibration in one pipeline. Relative to Hu et al.\ (2025), we replace full-spectrum noise averaging with noise-subspace estimation and replace interior-point penalty handling with exact Stiefel-manifold optimization. The noise-subspace estimator attains a signal-strength-independent leading finite-sample rate and matches a minimax lower bound, while the full-spectrum estimator is shown to be inconsistent under the same model. We further extend the framework to sub-Gaussian settings via optional Gaussianization and provide closed-form standard errors through a block-structured Fisher analysis. Across synthetic high-noise settings and two multi-omics benchmarks (TCGA-BRCA and PBMC CITE-seq), the method achieves near-nominal coverage without post-hoc recalibration, reaches Ridge-level point accuracy on TCGA-BRCA at rank $r=3$, matches or exceeds PO2PLS on cross-view prediction while providing native calibrated uncertainty, and improves stability of parameter recovery.

0

stat.ME 2026-05-13 Recognition

Unified theory supplies non-asymptotic bounds on conditional conformal errors

A Unified Theory of Conditional Coverage in Conformal Prediction with Applications

Pointwise and L_p routes clarify error sources and enable extensions to shifts and structured data.

abstract click to expand

Conformal prediction provides finite-sample marginal validity, but many applications require coverage that adapts to heterogeneous test points or subpopulations. Existing methods for conditional coverage are largely analyzed case by case, leaving limited general theory for how asymptotic conditional validity arises, how different procedures should be compared, and how such guarantees extend to structured data. We develop a unified framework and theory for conformal methods targeting conditional coverage. Within this framework, we derive non-asymptotic bounds for conditional miscoverage through two complementary routes: a pointwise route for direct score control and an $L_p$ route for quantile-centered methods. The theory clarifies the error sources governing asymptotic conditional validity, yields a common interpretation of existing methods, and supports applications and extensions to conditional-coverage-oriented model selection, localization under covariate shift, structured-data settings through a weighted symmetry-based formulation and more. Numerical results support the theoretical conclusions.

0

stat.ME 2026-05-13 Recognition

Model-matched designs raise accuracy in plant selection trials

The design of selection experiments using a model-based approach

Aligning layouts with the linear mixed model later used for analysis improves genetic gain from fixed plot resources.

abstract click to expand

Plant breeding programs use data obtained from multi-environment selection experiments to produce improved varieties with the ultimate aim of maintaining high levels of genetic gain. Selection accuracy can be improved with the use of advanced statistical analytical methods that use informative and parsimonious variance models for the set of genotype by environment interaction effects, include information on genetic relatedness and appropriately accommodate non-genetic sources of variation within the framework of a single step estimation and prediction algorithm. Maximal gains from using these advanced techniques are more likely to be achieved if the designs used match the aims of the selection experiment and make full use of the available resources. In this paper we present an approach for constructing designs for selection experiments which are optimal or near optimal against a robust and sensible linear mixed model. This model reflects the models used for analysis. The approach is flexible and introduces an additional step to accommodate efficient resource allocation of replication status to genotypes, which is undertaken prior to the allocation of plots to genotypes. A motivating example is used to illustrate the approach, two illustrative examples are presented one each for single and multiple environment selection experiments and several in-silico simulation studies are used to demonstrate the advantages of these approaches.

0

stat.ME 2026-05-13 2 theorems

Graph independences sharpen causal effect bounds

Exploiting independence constraints for efficient estimation of bounds on causal effects in the presence of unmeasured confounding

Projecting influence functions onto independence constraints reduces variance in sensitivity analysis bounds when unmeasured confounding is

abstract click to expand

Causal graphs may inform covariate adjustment for estimating causal effects and improve estimation efficiency by exploiting the graphical structure. In many applications, however, the target causal parameter may not be point-identified due to the presence of unmeasured confounding. Sensitivity analysis methods address this challenge by characterizing bounds on the causal parameter under varying assumptions about the magnitude or form of unmeasured confounding. We focus on semiparametric efficient estimation of causal effects in non-identifiable settings, assuming a known (or hypothesized) causal graph. We propose an influence function projection approach that exploits the conditional independence constraints implied by the graph to improve the efficiency of semiparametric estimators of upper and lower bounds on the average causal effect under a given sensitivity analysis model. Our approach applies across multiple sensitivity analysis frameworks and causal estimands, thereby connecting knowledge of graphical structure with the sensitivity analysis literature. We illustrate our approach through simulations and real data examples thought to be affected by unmeasured confounding, including the effect of labor training program on post-intervention earnings, and the effect of low ejection fraction on heart failure death.

0

stat.ML 2026-05-13 2 theorems

Post-ADC inference restores valid stats after adaptive sampling

Post-ADC Inference: Valid Inference After Active Data Collection

The framework adjusts for biases in both collection and target choice using only noise assumptions, with checks on GP-UCB and TPE.

abstract click to expand

The validity of statistical inference depends critically on how data are collected. When data gathered through active data collection (ADC) are reused for a post-hoc inferential task, conventional inference can fail because the sampling is adaptively biased toward regions favored by the collection strategy. This issue is especially pronounced in black-box optimization, where sequential model-based optimization (SMBO) methods such as the tree-structured Parzen estimator (TPE) and Gaussian process upper confidence bound (GP-UCB) preferentially concentrate evaluations in promising regions. We study statistical inference on actively collected data when the inferential target is constructed in a data-dependent manner after data collection. To enable valid inference in this setting, we propose post-ADC inference, a framework that accounts for the biases arising from both the active data collection process and the subsequent data-driven target construction. Our method builds on selective inference and provides valid $p$-values and confidence intervals that correct for both sources of bias. The framework applies to a broad class of ADC processes by imposing only assumptions on the observation noise, without requiring any assumptions on the underlying black-box function or the surrogate model used by the SMBO algorithm. Empirical results also show that post-ADC inference provides valid inference for data collected by GP-UCB and TPE.

0

stat.ME 2026-05-13 2 theorems

Copula fixes dependence parameter to identify ordinal causal effects

Causal inference with ordinal outcomes: copula-based identification, estimation and sensitivity analysis

Treating association strength as sensitivity yields doubly robust point estimates whose curves stay inside sharp bounds.

abstract click to expand

In causal inference with ordinal outcomes, several interpretable estimands are functions of the probability that the potential outcome under one treatment is larger than that under another treatment for the same unit. This probability depends on the joint distribution of both potential outcomes and is generally not identifiable. Existing work has focused on sharp bounds of this probability based on partial identification, but bounds are often too wide to be informative. We propose a copula-based method that links the identifiable marginal distributions of the potential outcomes via a parametric copula, treating the copula association parameter as a sensitivity parameter. With a fixed copula parameter, the estimands become identified functionals of the observed data. Working under unconfoundedness, we derive the efficient influence function in the nonparametric model and construct one-step estimators that accommodate flexible nuisance estimation. The resulting procedure is rate-doubly-robust and attains the semiparametric efficiency bound under standard conditions. Varying the copula parameter yields a sensitivity curve with point-wise confidence bands that typically lie within the sharp bounds, providing an interpretable bridge between partial identification and point estimation. We further provide a comprehensive sensitivity analysis with respect to both the copula specification and the unconfoundedness assumption. We develop an associated R package \texttt{ordinalCI}.

0

stat.ML 2026-05-13 1 theorem

Adapter adds closed-form spatial covariance to frozen predictors

Spatial Adapter: Structured Spatial Decomposition and Closed-Form Covariance for Frozen Predictors

A post-hoc layer decomposes residuals into a regularized basis for kriging-style prediction at new locations without retraining the original

abstract click to expand

We present the Spatial Adapter, a parameter-efficient post-hoc layer that equips any frozen first-stage predictor with a structured spatial representation of its residual field and an induced closed-form spatial covariance. The adapter operates as a cascade second stage on residuals, jointly learning a spatially regularized orthonormal basis and per-sample scores via a tractable mini-batch ADMM procedure, without modifying any first-stage parameter. Because the first-stage parameters are frozen, the adapter does not retrain the backbone; its role is to supply a compressed distributional summary of the residual field. Smoothness, sparsity, and orthogonality together turn a generic low-rank factorization into an identifiable spatial representation whose induced residual covariance admits a closed-form low-rank-plus-noise estimator; the effective rank is determined data-adaptively by spectral thresholding, while the nominal rank K is an optimization-side upper bound only. This covariance enables kriging-style spatial prediction at unobserved locations, with plug-in uncertainty quantification as a secondary downstream use. Across synthetic data, Weather2K for spatial-holdout prediction, and GWHD patch grids as a basis-transferability diagnostic, the adapter recovers residual spatial structure when paired with frozen first stages from linear models to deep spatiotemporal and vision backbones; the added representation uses fewer than K(N+T) parameters alongside a compact residual-trend network.

0

stat.AP 2026-05-13 Recognition

Balanced designs give exact ANOVA estimators for dose-response precision

Statistical evaluation of measurement precision in linear dose-response relationships via interlaboratory studies

Closed-form decomposition separates repeatability from between-lab variance and tests whether labs differ in baseline or in slope.

abstract click to expand

This paper proposes a framework for evaluating the statistical precision of measurement methods from interlaboratory studies where the outcome is a dose-response relationship summarized by a regression line. For such measurement methods, where a linear mixed-effects model is applied that allows laboratories to differ in both baseline level and dose-response slope, we define precision evaluation metrics specified in ISO 5725, repeatability and between-laboratory variances. These are method-level precision metrics, and the latter are constructed as design-averaged dose-specific between-laboratory variances over the dose levels and the participating laboratories. For fully balanced designs with common dose levels and equal replication, we obtain an exact decomposition of the total sum of squares, closed-form analysis of variance (ANOVA) estimators of the precision variances, and three associated $F$-tests targeting (i) the overall dose-response trend, (ii) homogeneity of intercepts, and (iii) homogeneity of slopes across laboratories. This formulation enables precision to be quantified and estimated directly and supports an evaluation of whether between-laboratory discrepancies are caused primarily by baseline shifts or by differences in sensitivity, in contrast to fixed-effect comparisons that only detect the presence of differences. Furthermore, we analyze data obtained from an interlaboratory study on observations in bronchoalveolar lavage fluid from experiments involving the intratracheal administration of nanomaterials to rats, using the proposed method as a case study.

0

stat.ME 2026-05-12 Recognition

Inferring temperature improves hyperbolic models of tree-like networks

Hyperbolic Latent Space Models for Network Embedding: Model Specification and Bayesian Inference

The sharpness parameter controls hierarchy depth; fixing it ahead of time reduces the model's ability to reconstruct observed networks.

abstract click to expand

Many real-world networks exhibit hierarchical, tree-like structure and heavy-tailed degree distributions, phenomena not readily captured by standard statistical models for network data. Extensions of the popular continuous latent space modeling framework have been proposed to accommodate such networks. Drawing on insights from statistical physics, continuous latent space models with underlying hyperbolic geometry have been proposed as a natural framework, probabilistically embedding nodes in a latent Riemannian manifold with constant negative curvature. Most statistical implementations, however, simplify the original physics-based model by omitting the ``temperature parameter," which controls the sharpness of the latent distance-to-probability mapping. We argue this omission is critical. We demonstrate that temperature is the fundamental parameter governing a network's tree-like topology, and that failing to infer it weakens model expressiveness. We formalize a Bayesian hyperbolic continuous latent space model with an unknown, learnable temperature parameter. We then develop two inferential procedures: a Hamiltonian Monte Carlo approach for rigorous posterior characterization and a scalable auto-encoding variational Bayes algorithm for large-scale networks. Through simulation and real data examples, we show that our model outperforms models with fixed temperature and misspecified Euclidean geometries in graph reconstruction tasks in most settings, confirming temperature is a crucial and inferable feature of complex networks.

0

stat.ME 2026-05-12 Recognition

One operator unifies all regression types via measure choice

Unified Operator Framework for Functional and Multivariate Regression

Scalar, functional, and multivariate models become special cases; discretization reduces exactly to classical multivariate regression.

abstract click to expand

We develop a unified operator framework for scalar, multivariate, and functional regression based on integral operators defined with respect to general measures. Within this framework, classical regression models, including scalar-on-function, function-on-scalar, function-on-function, and multivariate multiple regression, arise as special cases corresponding to different choices of input and output measures. We establish three main results. First, we show that the standard regression taxonomy can be expressed as a single operator under varying measures. Second, we demonstrate that discrete representations correspond to exact operator evaluations under discrete measures and converge to the continuous operator as the observation grid is refined. Third, we show that estimation under the discrete-measure formulation reduces to standard multivariate regression, with statistical properties governed by classical results. A simulation study illustrates these principles, highlighting the roles of discretization, conditioning, and estimation. Overall, the proposed framework clarifies the relationship between functional and multivariate regression and provides a meaningful interpretation of discretized modeling approaches as operator estimation under different measure specifications. This perspective also explains why vectorized multivariate regression is often competitive with functional methods in linear settings: it directly estimates the discrete-measure representation of the underlying operator.

0

stat.ME 2026-05-12 2 theorems

Similarity subgroups unmask hidden performance gaps in external validation

Rethinking external validation for the target population: Capturing patient-level similarity with a generative model

Autoencoders group external patients by alignment to development data and show when overall results hide subgroup successes or failures.

abstract click to expand

Background: External validation is essential for assessing the transportability of predictive models. However, its interpretation is often confounded by differences between external and development populations. This study introduces a framework to distinguish model deficiencies from case-mix effects. Method: We propose a framework that quantifies each external patient's similarity to the development data and measures performance in subgroups with varying levels of alignment to the development distribution. We use generative models, specifically autoencoders, to estimate similarity, offering a more flexible alternative to traditional linear approaches and enabling validation without sharing the original development data. The utility of autoencoder-based similarity measure is demonstrated using synthetic data, and the framework's application is illustrated using data from the Netherlands Heart Registration (NHR) to predict mortality after transcatheter aortic valve implantation. Results: Our framework revealed substantial variation in model performance across similarity-defined subgroups, differences that remain hidden under conventional external validation yet can meaningfully alter conclusions. In several settings, conventional external validation suggested poor overall performance. However, after accounting for differences in patient characteristics, for some sub-groups, the model performance was consistent with internal validation results. Conversely, apparently acceptable overall performance could mask clinically relevant performance deficits in specific subgroups. Conclusion: The proposed framework enhances the interpretability of external validation by linking model performance to population alignment with the development data. This provides a more principled basis for deciding whether a model is transportable and to which patients it can be safely applied.

0

stat.AP 2026-05-12 2 theorems

Prediction markets lag behind statistical models for flu and measles

Prediction Markets Underperform Simple Baselines For Infectious Disease Forecasting

Market prices for US hospitalizations and case counts are beaten by expert ensembles and basic baselines, with no gain from combining them.

abstract click to expand

Prediction markets (e.g., Polymarket, Kalshi) allow participants to bet on future events, producing real-time forecasts based on collective judgment. In domains such as elections and finance, markets have been effective at aggregating information, often rivaling or outperforming expert forecasters or polls. Whether this performance extends to infectious disease dynamics is unclear. Participants are self-selected and typically lack epidemiological expertise. However, markets can respond in real time to emerging news and unstructured signals in ways that standard forecasting pipelines cannot. Also, substantial financial stakes encourage participants to make an effort to be accurate. We evaluate Polymarket forecasts during 2025 and 2026 for two settings: weekly cumulative influenza hospitalizations in the US, which have an established expert-curated forecasting ensemble (CDC FluSight), and monthly measles cases, which do not. Across both settings, prediction markets fail to outperform standard benchmarks. For influenza, markets are competitive with low-performing individual FluSight models but are dominated by the FluSight ensemble: even when we combine market forecasts with the ensemble, the best combination puts zero weight on the markets. For measles, markets are outperformed by simple statistical baselines. We diagnose two sources of market inefficiency: placement of probability mass on impossible outcomes (e.g., decreasing values in cumulative forecasts) and low trading volume. These results suggest that current prediction markets are not reliable forecasters of infectious disease dynamics on their own or useful as complementary features for existing forecasting systems.

0

stat.ML 2026-05-12 Recognition

Thompson sampling learns unknown networks while optimizing treatments

Adaptive Policy Learning Under Unknown Network Interference

Joint discovery of interference structure and individual allocations yields sublinear regret and downstream effect estimates.

abstract click to expand

Adaptive experimentation under unknown network interference requires solving two coupled problems: (i) learning the underlying dynamics of interference among units and (ii) using these dynamics to inform treatment allocation in order to maximize a cumulative outcome of interest (e.g. revenue). Existing adaptive experimentation methods either assume the interference network is fully known or bypass the network by operating on coarse cluster-level randomizations. We develop a Thompson sampling algorithm that jointly learns the interference network and adaptively optimizes individual-level treatment allocations via a Gibbs sampler. The algorithm returns both an optimized treatment policy and an estimate of the interference network; the latter supports downstream causal analyses such as estimation of direct, indirect, and total treatment effects. For additive spillover models, we show that total reward is linear in the treatment vector with coefficients given by an $n$-dimensional latent score. We prove a Bayesian regret bound of order $\sqrt{nT \cdot B \log(en/B)}$ for exact posterior sampling; empirically, our Gibbs-based approximate sampler achieves regret consistent with this rate and remains sublinear when the additive spillovers assumption is violated. For general Neighborhood Interference, where this reduction is unavailable, we analyze an explore-then-commit variant with $O(n^2 \log T)$ graph-discovery cost. An information-theoretic $\Omega(n \log T)$ lower bound complements both results. Empirically, our method achieves more than an order-of-magnitude reduction in regret in head-to-head comparisons. On two real-world networks, the algorithm achieves sublinear regret and yields downstream effect estimates with small RMSE relative to the truth.

0

stat.ML 2026-05-12 2 theorems

Kernel makes rotated 3D anisotropy explicit in Gaussian processes

Interpretable Machine Learning for Spatial Science: A Lie-Algebraic Kernel for Rotationally Anisotropic Gaussian Processes

Axis-angle vectors plus three length scales allow direct priors and summaries while spanning all valid metrics.

abstract click to expand

Many three-dimensional spatial fields are anisotropic, with directions of rapid and slow variation that need not align with the coordinate axes. Standard Gaussian process kernels with Automatic Relevance Determination (ARD) capture only axis-aligned anisotropy, while generic full symmetric positive definite (SPD) metrics can represent rotated anisotropy but do not parameterise principal length-scales and directions directly. We introduce an interpretable rotationally anisotropic GP kernel that parameterises a three-dimensional SPD covariance metric using three principal length-scales and an explicit SO(3) rotation. The rotation is represented by an axis-angle vector and mapped to SO(3) via the Lie-algebra exponential map, giving unconstrained Euclidean coordinates for inference while always inducing a valid SPD metric. The construction spans the same family of three-dimensional SPD covariance metrics as a generic full-SPD parameterisation, but exposes the geometry differently: length-scales and orientation are explicit, interpretable, and directly available for prior specification and posterior summaries. We perform Bayesian inference on these quantities using Markov Chain Monte Carlo (MCMC), and characterise the resulting symmetries and weakly identified regimes. On synthetic data with rotated anisotropy, the posterior recovers the generating metric and improves prediction relative to an axis-aligned ARD baseline, while matching the predictive performance of a generic full SPD baseline. When the ground truth is axis-aligned, posterior mass concentrates near the identity rotation and predictive performance matches ARD. On a material-density dataset from a laboratory-fabricated nano-brick, the inferred metric reveals rotated anisotropy that is not captured by axis-aligned kernels.

0

stat.ME 2026-05-12 Recognition

VPR with mean-field predictives matches exact posteriors

Variational predictive resampling

Iterative imputation from variational predictives closes the asymptotic gap left by direct mean-field VI in Gaussian models.

abstract click to expand

Bayesian inference provides principled uncertainty quantification, but accurate posterior sampling with MCMC can be computationally prohibitive for modern applications. Variational inference (VI) offers a scalable alternative and often yields accurate predictive distributions, but cheap variational families such as mean-field (MF) can produce over-concentrated approximations that miss posterior dependence. We propose variational predictive resampling (VPR), a scalable posterior sampling method that exploits VI's predictive strength within a predictive-resampling framework to better approximate the Bayesian posterior. Given a prior-likelihood pair, VPR repeatedly imputes future observations from the current variational predictive, updates the variational approximation after each imputation, and records the parameter value implied by the completed sample. We establish conditions under which the law of the parameter returned by VPR is well defined and show that its finite-horizon approximation converges to this limit. In a tractable Gaussian location model, we show that VPR with MF variational predictives converges to the exact Bayesian posterior, whereas the optimal MF-VI approximation retains a non-vanishing asymptotic gap. Experiments on linear regression, logistic regression, and hierarchical linear mixed-effects models demonstrate that VPR substantially improves posterior uncertainty quantification and recovers posterior dependence missed by MF-VI, while remaining computationally competitive with, and often more efficient than, MCMC.

0

stat.CO 2026-05-12 2 theorems

Writer monads automate MCMC kernel composition

gemlib.mcmc: composable kernels for Metropolis-within-Gibbs sampling schemes

Researchers chain parameter-estimation and data-augmentation steps for epidemic models with less manual coding while keeping statistical rig

abstract click to expand

State-transition models are essential across epidemiology and ecology, but statistical inference remains challenging owing to high-dimensional latent state spaces, temporal dependence, and intractable likelihood functions. Bayesian inference via Markov Chain Monte Carlo (MCMC) enables joint estimation of model parameters and missing event times through data augmentation, but Metropolis-within-Gibbs (MWG) schemes that combine multiple specialised kernels are notoriously difficult to implement. Current probabilistic programming frameworks face a trade-off: automation sacrifices extensibility, whilst flexibility demands substantial implementation overhead. This divide has created a software landscape characterised by tightly coupled, model-specific implementations that resist reuse and extension. We introduce gemlib.mcmc, an MCMC module designed to bridge methodological and applied communities through principled, composable kernel abstractions. The framework employs writer monads from category theory to formalise kernel composition, enabling seamless integration of parameter-estimation and data-augmentation kernels without manual state management. Built on JAX and TensorFlow Probability for high-performance computation, gemlib.mcmc provides an ergonomic interface -- leveraging Python's right-shift operator for intuitive kernel chaining -- whilst maintaining statistical rigour and transparency. Developers can extend the library by implementing only two methods; composition and hardware acceleration are automated. We demonstrate the framework through parameter inference on partially observed epidemic models, showing how complex inference algorithms can be expressed concisely and reused across applications. By reducing implementation burden we provide access to sophisticated MCMC methods and enable applied researchers to employ state-of-the-art algorithms without reimplementation overhead.

0

stat.ML 2026-05-12 2 theorems

Transformer states converge uniformly to ODEs at rate O(1/L + 1/(L^{1/3} sqrt(H)))

Uniform Scaling Limits in AdamW-Trained Transformers

The joint hidden-state and back-propagated dynamics approach a continuous forward-backward system whose error shrinks with depth and head ,

abstract click to expand

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in $L^2$, uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate $\mathcal O(L^{-1}+L^{-1/3}H^{-1/2})$. Here, $L$ and $H$ denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are uniform over compact sets of initial conditions. As this is achieved without resorting to a covering argument, the constants in our bounds are independent of the number of tokens. Furthermore, under a suitable adaptation to AdamW, the bounds become independent of the token embedding dimension.

0

stat.ML 2026-05-12 2 theorems

Linear networks store facts up to p log p = d²/2

Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

Decoupled model gives exact capacity for keeping each input-output pair separated from all other targets.

abstract click to expand

Large language models demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input--output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps $p$ input embeddings in $\mathbb{R}^d$ to their corresponding~$d$-dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~$p$ constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to $p_c \log p_c / d^2 = 1 / 2$ associations, and generalise the computation of $p_c$ to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a na\"ive Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.

0

stat.ML 2026-05-12 2 theorems

LASSO matches homogeneous threshold for mixed-quality sparse data

Price of Quality: Sufficient Conditions for Sparse Recovery using Mixed-Quality Data

One high-quality sample replaces at most two low-quality ones when the decoder is unaware of quality differences.

abstract click to expand

We study sparse recovery when observations come from mixed-quality sources: a small collection of high-quality measurements with small noise variance and a larger collection of lower-quality measurements with higher variance. For this heterogeneous-noise setting, we establish sample-size conditions for information-theoretic and algorithmic recovery. On the information-theoretic side, we show that it is sufficient for $(n_1, n_2)$ to satisfy a linear trade-off defining the Price of Quality: the number of low-quality samples needed to replace one high-quality sample. In the agnostic setting, where the decoder is completely agnostic to the quality of the data, it is uniformly bounded, and in particular one high-quality sample is never worth more than two low-quality samples for this sufficient condition to hold. In the informed setting, where the decoder is informed of per-sample variances, the price of quality can grow arbitrarily large. On the algorithmic side, we analyze the LASSO in the agnostic setting and show that the recovery threshold matches the homogeneous-noise case and only depends on the average noise level, revealing a striking robustness of computational recovery to data heterogeneity. Together, these results give the first conditions for sparse recovery with mixed-quality data and expose a fundamental difference between how the information-theoretic and algorithmic thresholds adapt to changes in data quality.

0

stat.ME 2026-05-12 2 theorems

Indirect comparisons gain reliability when methods fit evidence strength

Indirect Comparisons For Health Technology Assessment: A Practical Methodological Guide And Tips With Insights From The French Transparency Commission

French authority insights guide choices for network analyses and adjusted comparisons when direct trials are missing.

abstract click to expand

Context: Indirect treatment comparisons (ITC) are essential when direct head-to-head evidence is unavailable. Their reliability depends on rigorous methodological choices and careful assessment of underlying assumptions. Appropriate methodological choices can help address challenges such as cross-country variations in treatment practices, ethical constraints, and evolving treatment landscapes during trial conduct. This opinion and perspective paper provides practical guidance to strengthen the quality, robustness and accuracy of ITCs in the context of health technology assessment (HTA) in France. Methods: A panel of experts in ITCs and French market access environment developed the present strategic guidance, informed by previous work reviewing HTA methodological guidelines and complemented by a systematic review of Transparency Committee opinions from the French National Authority for Health (HAS). Results: Key considerations include early anticipation of ITCs, justification of potential confounding factors, and rigorous assessment of similarity and transitivity in randomized trial-based comparisons. In network meta-analysis, the structure of the evidence network should be adapted to the specific decision context. Population-Adjusted Indirect Comparisons require careful reporting and interpretation of the effective sample size. When evidence relies on non-randomized clinical trials, comparisons between single-arm studies and external control arms may be appropriate under different scenarios, depending on the feasibility of conducting subsequent randomized studies. Conclusions: Robust and reliable ITCs require methods consistent with the validity of their assumptions and the strength of the available evidence. This practical guidance supports the development of rigorous ITCs to inform decision-making in complex medical contexts where direct comparisons are not feasible.

0

stat.ML 2026-05-12 2 theorems

Amortized networks speed up causal sensitivity bounds by orders of magnitude

Amortizing Causal Sensitivity Analysis via Prior Data-Fitted Networks

Prior-data fitted networks learn to output bounds on causal effects under confounding without per-query recomputation.

abstract click to expand

Causal sensitivity analysis aims to provide bounds for causal effect estimates in the presence of unobserved confounding. However, existing methods for causal sensitivity analysis are per-instance procedures, meaning that changes to the dataset, causal query, sensitivity level, or treatment require new computation. Here, we instead present an in-context learning approach. Specifically, we propose an amortized approach to causal sensitivity analysis based on prior-data fitted networks. A key challenge is that the sensitivity bounds are not directly available when sampling training data. To address this, we develop a general prior-data construction that is applicable across the class of generalized treatment sensitivity models. Our construction involves a Lagrangian scalarization of the objective to generate training labels for the bounds through a tradeoff between causal effect min/max-imization and sensitivity model violation, which avoids model-specific analytical derivations. We further show that, under standard convexity and linearity conditions, our objective recovers the full Pareto frontier of solutions. Empirically, we demonstrate our amortized approach across various datasets, causal queries, and sensitivity levels, where our approach achieves a test-time computation that is orders of magnitude faster than per-instance methods. To the best of our knowledge, ours is the first foundation model for in-context learning for causal sensitivity analysis.

0

stat.ML 2026-05-12 Recognition

Bayesian linear solvers are special cases of affine PIMs

Affine Tracing: A New Paradigm for Probabilistic Linear Solvers

The unification proves all realistic affine versions are calibrated and supplies an automatic tracing tool to build them from ordinary code.

abstract click to expand

Probabilistic linear solvers (PLSs) return probability distributions that quantify uncertainty due to limited computation in the solution of linear systems. The literature has traditionally distinguished between Bayesian PLSs, which condition a prior on information obtained from projections of the linear system, and probabilistic iterative methods (PIMs), which lift classical iterative solvers to probability space. In this work we show this dichotomy to be false: Bayesian PLSs are a special case of non-stationary affine PIMs. In addition, we prove that any realistic affine PIM is calibrated. These results motivate a focus on (non-stationary) affine PIMs, but their practical adoption has been limited by the significant manual effort required to implement them. To address this, we introduce affine tracing, an algorithmic framework that automatically constructs a PIM from a standard implementation of an affine iterative method by passing symbolic tracers through the computation to build an affine computational graph. We show how this graph can be transformed to compute posterior covariances, and how equality saturation can be used to perform algebraic simplifications required for computation under specific prior choices. We demonstrate the framework by automatically generating a probabilistic multigrid solver and evaluate its performance in the context of Gaussian process approximation.

0

stat.ME 2026-05-12 Recognition

Quantiles of AR errors recovered from observed series

Estimation of the Risk Measure under a Nuisance Autoregression

R-estimators of coefficients and autoregression quantiles enable estimation of risk measures for unobservable increments Z.

abstract click to expand

The goal of an experiment is to evaluate the profit, loss, or the amount of a physical entity over a period. The measurements $X_t$ can be influenced by the values measured in the past; hence we describe the situation with an autoregression model, whose autoregression coefficients are generally unknown. The variable of interest is the error term $Z_t$ of the model, which is the increment of $X_t$ with respect to the past, but itself unobservable. The problem is to estimate various quantile functions of $Z$, as the risk measure of the loss or the related economic indicators. We construct an estimate of quantile functions of $Z$ in the situation that the inference is possible only by means of observations $X$. The proposed estimates are based on the R-estimators of autoregression coefficients, combined with the autoregression quantiles.

0

stat.ME 2026-05-12 2 theorems

This paper introduces two new stability measures for Bayesian decisions under prior…

Robust Bayes Acts under Prior Perturbations: Contamination, Stability, and Selection Paths

In finite problems, these enable cost-adjusted selection paths that balance stability against selection costs under prior uncertainty.

abstract click to expand

This paper develops a quantitative framework to assess the robustness of Bayes-optimal decisions in finite decision problems under model uncertainty. We introduce two complementary stability notions for acts: the robustness radius, measuring the largest perturbation of a reference prior under which an act remains Bayes-optimal, and the contamination need, quantifying the minimal perturbation required for an act to become Bayes-optimal under some nearby prior. Both concepts are characterized via linear programming formulations and computed efficiently using bisection methods exploiting monotonicity properties. Building on these stability measures, we propose a cost-adjusted stability criterion that integrates robustness considerations with act-specific selection costs, yielding a parametric family of decision rules indexed by a regularization parameter. We analyze how optimal act selection evolves along this parameter and derive selection paths that reveal structural transitions between stability-driven and cost-driven regimes. The framework is applied to a portfolio choice problem under uncertainty between different economic regimes. Concretely, using data on historical ETF returns, we compute robustness and contamination profiles for six portfolio strategies and analyze their behavior under heterogeneous belief specifications. The results illustrate that robustness-based selection refines classical expected utility by accounting for prior misspecification.

0

stat.ME 2026-05-12 2 theorems

Covariate-dependent level links low-fidelity quantiles to high-fidelity ones

Multi-Fidelity Quantile Regression

When distributions share shapes the level varies smoothly, reducing error with scarce high-fidelity samples.

abstract click to expand

High-fidelity (HF) data are often expensive to collect and therefore scarce, making conditional quantiles difficult to estimate accurately. We propose a two-stage, model-agnostic method for multi-fidelity quantile regression. The central idea is a local quantile link: at each covariate value, the HF quantile is represented as a low-fidelity (LF) quantile evaluated at a covariate-dependent level. This reformulation reduces the problem to estimating the level function, which can be smoother than the HF quantile itself when the LF and HF conditional distributions have similar shapes. We also study the complementary regime in which this advantage weakens and introduce a correction step to improve robustness. Our theory characterizes when the proposed estimator converges faster than direct quantile regression using HF data alone and when the correction step provides further improvement. Experiments on synthetic and real data show that our method yields more accurate quantile estimates and tighter conformal prediction intervals.

0

stat.ML 2026-05-12 2 theorems

Sharp jumps in feature overlap set optimal neural scaling laws

Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

Effective width k_c unifies feature-learning and refinement regimes into error scaling Θ(k_c d / n) for hierarchical teacher networks.

abstract click to expand

We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2\beta)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $\beta>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=\Theta(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.

0

stat.ML 2026-05-12 2 theorems

Mass lift certifies regret in guided diffusion optimization

Regret Analysis of Guided Diffusion for Black-Box Optimization over Structured Inputs

The probability shift toward near-optimal designs explains convergence without information-gain or kernel assumptions.

abstract click to expand

Guided-diffusion black-box optimization (BO) has shown strong empirical performance on structured design problems such as molecules and crystals, but its regret behavior remains poorly understood. Existing BO regret analyses typically rely on maximum information gain, non-pretrained surrogate models, or exact acquisition maximization -- assumptions that break down in modern diffusion -- BO pipelines, where pretrained diffusion models serve as powerful priors over valid structures and acquisition maximization is replaced by approximate sampling over astronomically large discrete spaces. We develop a first certificate-based expected simple-regret framework for guided-diffusion BO that avoids maximum-information-gain bounds, RKHS assumptions, and exact acquisition maximization. The central quantity in our analysis is mass lift: the increase in probability mass assigned to near-optimal designs relative to the pretrained generator. This view explains how exponential-looking finite-budget convergence and polynomial acceleration can all arise from the same mechanism. We also give practical diagnostics for estimating search exponents from finite candidate pools and a proposal-corrected resampling construction that provides a fully certified sampler instance.

0

stat.ML 2026-05-12 2 theorems

Low-fidelity data yields kernels for high-fidelity PDE solving

Multifidelity Gaussian process regression for solving nonlinear partial differential equations

Cokriging fits non-stationary kernels to cheap simulations and derives high-fidelity means for Gaussian process regression on nonlinear PDEs

abstract click to expand

Solving nonlinear partial differential equations (PDEs) using kernel methods offers a compelling alternative to traditional numerical solvers. However, the performance of these methods strongly depends on the choice of kernel. In this work, as the available information is inherently multifidelity, we propose a kernel learning approach based on cokriging, leveraging empirical information from multifidelity simulations. In the first step, we fit a differentiable non-stationary kernel to an empirical kernel obtained from low-fidelity simulations. In the second step, we derive a high-fidelity kernel with estimated hyperparameters, and construct a corresponding high-fidelity mean using the multifidelity framework. These components can then be used within a Gaussian process framework for solving PDEs. Finally, we demonstrate the performance of the proposed physics-informed method on the Burgers' equation.

0

stat.ML 2026-05-12 Recognition

Unified taxonomy clarifies ML uncertainty for physics

Uncertainty in Physics and AI: Taxonomy, Quantification, and Validation

It organizes predictive and inference uncertainties in frequentist and Bayesian views and supplies validation tools such as calibration and

abstract click to expand

Reliable uncertainty quantification is essential for the use of machine learning in physics, where scientific discoveries depend on validated probabilistic statements. We provide a structured overview of uncertainty quantification in ML for physics, introducing a unified taxonomy of uncertainty and clarifying the interpretation of predictive and inference uncertainties across frequentist and Bayesian frameworks. We discuss principled validation tools, including coverage, calibration, bias tests, and proper scoring rules, and illustrate them with simple regression and classification examples.

0

stat.ML 2026-05-12 Recognition

Expert losses cut MoE training time for time series

Fast Training of Mixture-of-Experts for Time Series Forecasting via Expert Loss Integration

Adding per-expert prediction errors to the objective and using partial online updates yields better forecasts than full retraining or heavy

abstract click to expand

We propose a novel adaptive Mixture-of-Experts (MoE) framework for time series forecasting that enhances expert specialization by incorporating expert-specific loss information directly into the training process. Notably, the overall objective comprises the base forecasting loss and expert-specific losses, allowing expert-level prediction errors to jointly shape training alongside the global forecasting loss. This framework is further combined with a partial online learning strategy, enabling incremental updates of both the gating mechanism and expert parameters. This approach significantly reduces computational cost by eliminating the need for repeated full model retraining. By integrating expert-level loss awareness with efficient online optimization, the proposed method achieves improved learning efficiency while maintaining strong predictive performance. Empirical results across economic, tourism, and energy datasets with varying frequencies demonstrate that the proposed approach generally outperforms both statistical methods and state-of-the-art neural network models, such as Transformers and WaveNet, in forecasting accuracy and computational efficiency. Furthermore, ablation studies confirm the effectiveness of the expert-specific loss integration strategy, highlighting its contribution to enhancing predictive performance.

0

stat.ML 2026-05-12 Recognition

Test error in augmented random features depends only on data and augmentation moments

Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation

In the proportional regime the asymptotic mean squared error reduces to population quantities of the data and the first two moments of the,

abstract click to expand

This paper aims at analyzing the regularization effect that data augmentation induces on supervised regression methods in the proportional regime, where the number of covariates grows proportionally to the number of samples. We provide a tight characterization of the test error, measured in mean squared error, in terms only of the population quantities of the true data, as well as first and second order statistics of the augmentation scheme. Our results are valid under misspecified feature maps, and for any network architecture where only the last readout layer is trained, and the rest of the network is either frozen or randomly initialized. We specify our results in the case of Gaussian data, and show that our asymptotic characterization is tight in this setting.

0

stat.ML 2026-05-12 Recognition

Neural feature maps scale exact GP inference

Scalable Gaussian process inference via neural feature maps

Learned maps approximate kernels to deliver fast, accurate predictions on tabular and image data for regression and classification.

abstract click to expand

We present a theoretically grounded Gaussian process framework that leverages neural feature maps to construct expressive kernels. We show that the learned feature map can be interpreted as an optimal low-rank approximation to a Gram matrix derived from an implied RKHS, from which we establish consistency of the GP posterior. We further analyse the spectral properties of the induced kernels and introduce product feature-map kernels to address oversmoothing. This simple yet powerful approach enables fast, scalable, and accurate exact GP inference with minimal upfront work. The flexibility of kernel design supports seamless application to both regression and classification tasks across diverse data modalities, including tabular inputs and structured domains such as images. On benchmark datasets, this approach surpasses pre-existing methods in terms of accuracy and training and prediction efficiency.

0

stat.ME 2026-05-12 2 theorems

Drop factor loadings below 0.70 in measurement models

Rethinking Factor Loading Thresholds: A Case for a Strict {λ} >= .70 Rule

Items under this threshold explain less variance than they introduce as error, according to AVE and communality logic.

abstract click to expand

This paper challenges the prevailing practice of accepting standardized factor loadings as low as .50 in confirmatory factor analysis. Drawing on the logic of Average Variance Extracted (AVE) and communality, the author argues for a stricter item level threshold: only indicators with loadings of {\lambda} >= .70 (implying {\lambda}sq >= .50) should be retained in final measurement models. The rationale is that indicators with {\lambda} < .70 contain more error than explained variance, undermining both construct validity and the stability of factor solutions. The paper reviews theoretical foundations, simulation evidence, and implications for structural equation modeling, showing that weak loadings degrade measurement quality, factor score determinacy, and model fit. Adopting a minimum {\lambda} >= .70 rule aligns item level standards with established construct level criteria and enhances the rigor and interpretability of latent variable models.

0

stat.ME 2026-05-12 Recognition

LDDMM distances enable Bayesian calibration of infinite-dimensional models

Diffeomorphic registration distances for Bayesian calibration of infinite-dimensional computer models

The distances measure minimal deformation energy and support a predictive posterior over output shapes.

abstract click to expand

The simulation of physical phenomena with computer models relies on the estimation of physical and/or numerical parameters calibrated to fit experimental data. The approximations within the computer model and the errors in the measurements lead to uncertainties in the calibrated parameters. Bayesian calibration offers a well-studied framework to provide reliable uncertainty quantification on the calibrated parameters. When dealing with complex computer codes whose outputs are infinite-dimensional, Bayesian calibration may be extended by providing a relevant distance in the output space. In this paper, Bayesian calibration is performed using distances from the large deformation diffeomorphic metric matching (LDDMM) framework. LDDMM distances can provide a suitable metric for infinite-dimensional shapes such as scalar fields (i.e. images) or function graphs. This metric can be interpreted as the minimal energy deformation required to transform one shape into another. As such, it provides a readily interpretable metric for Bayesian calibration. On top of this, the representation of the diffeomorphism group as an exponential transformation of an RKHS is compatible with Bayesian inference and allows to define a predictive posterior distribution on the infinite-dimensional space shape.

0

stat.ML 2026-05-12 Recognition

Cyclic LiNG models coarsen to identifiable low-dimensional DAGs

Coarsening Linear Non-Gaussian Causal Models with Cycles

The resulting DAG stays fixed across all cycle-reversed equivalents and can be learned in cubic time with sample bounds.

abstract click to expand

Recent work on causal abstraction, in particular graphical approaches focusing on causal structure between clusters of variables, aims to summarize a high-dimensional causal structure in terms of a low-dimensional one. Existing methods for learning such summaries from data assume that both the high- and low-dimensional structures are acyclic, which is helpful for causal effect identification and reasoning but excludes many high-dimensional models and thus limits applicability. We show that in the linear non-Gaussian (LiNG) setting, the high-dimensional acyclicity assumption can be relaxed while still allowing recovery of a low-dimensional causal directed acyclic graph (DAG). We further connect identifiability of this low-dimensional DAG to existing results: LiNG models with cycles are observationally identifiable only up to an equivalence class whose members differ by reversals of directed cycles; our low-dimensional DAG, which is invariant across all members of a given equivalence class, thus forms a natural representative of the class. While existing approaches for learning this observational equivalence class over high-dimensional variables have exponential time complexity, our low-dimensional summary is learned in worst-case cubic time and comes with explicit bounds on the sample complexity. We provide open source code and experiments on synthetic data to corroborate our theoretical results.

0

stat.ML 2026-05-12 2 theorems

Subsampled CLT turns PFN predictions into valid Thompson samples

PFN-TS: Thompson Sampling for Contextual Bandits via Prior-Data Fitted Networks

The method reuses cached representations, proves consistency of its variance estimator, and decomposes regret into exact posterior-sampling悔

abstract click to expand

Thompson sampling is a widely used strategy for contextual bandits: at each round, it samples a reward function from a Bayesian posterior and acts greedily under that sample. Prior-data fitted networks (PFNs), such as TabPFN v2+ and TabICL v2, are attractive candidates for this purpose because they approximate Bayesian posterior predictive distributions in a single forward pass. However, PFNs predict noisy future rewards, while Thompson sampling requires uncertainty over the latent mean reward function. We propose PFN-TS, a Thompson sampling algorithm that converts PFN posterior predictives into mean-reward samples using a subsampled predictive central limit theorem. The method estimates posterior variance from a geometric grid of $O(\log n)$ dataset prefixes rather than the full $O(n)$ predictive sequence used in previous predictive-sequence approaches, and reuses TabICL's cached representations across rounds. We prove consistency of the subsampled variance estimator and give a Bayesian regret bound that decomposes PFN-TS regret into exact posterior-sampling regret under the PFN prior plus approximation terms. Empirically, PFN-TS achieves the best average rank across nonlinear synthetic and OpenML classification-to-bandit benchmarks, remains competitive on linear and BART-generated rewards, and attains the highest estimated policy value in an offline mobile-health evaluation. Code is available at https://anonymous.4open.science/r/PFN_TS-36ED/.

0

stat.ME 2026-05-12 2 theorems

New sample size formula for causal survival analysis uses few inputs

Sample size and power calculations for causal inference with time-to-event outcomes

It needs only treatment proportion, effect size and event rate for randomized trials, plus one overlap coefficient for observational data.

abstract click to expand

This paper develops power and sample size formulas for causal inference with time-to-event outcomes. The target estimand is the marginal hazard ratio: the coefficient of a marginal structural Cox proportional hazard model with treatment as the only predictor. We extend the robust sandwich variance theory and derive the analytical form of the asymptotic variance for the inverse probability weighted partial likelihood estimator. Building on this, we derive a new sample size formula valid at any prespecified effect size, applicable to both randomized trials and observational studies. For randomized trials, the formula requires only the canonical inputs of treatment proportion, effect size, and event rate. The new formula corrects the mischaracterization of classic log-rank-based formulas. For observational studies, one additional input suffices: an overlap coefficient summarizing covariate similarity between comparison groups. We further develop a variance inflation approach applicable to any propensity score balancing weights, anchored to the corrected baseline variance.

0

stat.AP 2026-05-12 1 theorem

Consensus SEIR trajectories obtained via constrained Fréchet mean

Estimating Consensus Epidemic Trajectories via a Constrained Power Fr\'echet Mean with Functional Registration

Exposed-infectious curve pairs are averaged under differential-equation constraints so that full dynamics and parameters can be recovered.

abstract click to expand

We propose a method for summarizing multiple solutions to SEIR-type compartmental models on a functional space by computing a constrained power Fr\'echet mean with functional registration to obtain consensus epidemic trajectories with partial mechanistic interpretability. In our method, we regard the pairs of exposed and infectious compartments as objects in a Hilbert space, and the consensus trajectory is defined as the solution to a constrained optimization problem. Differential equation constraints and population constraints are incorporated in the optimization to preserve a partially mechanistic interpretation regarding the infectious compartment. The full dynamics with additional susceptible and removed compartments can then be recovered from the estimated trajectories and parameters. We develop an efficient block-optimization algorithm based on functional data analysis and illustrate the method using simulated and literature-derived epidemiological parameters for COVID-19 in the early phase of the pandemic that began in 2020. The proposed approach provides a generalized trajectory-summarization framework that includes mean- and median-type estimators on a functional space and holds potential for model averaging and ensemble forecasting in infectious disease modeling.

0

stat.ME 2026-05-12 2 theorems

Past selections guide future choices via monotonicity model

A Statistical Framework for Learning Preferences from the Past

A statistical estimator enforces that frequently chosen options remain favored, with guarantees and tests on real data.

abstract click to expand

In many real-world settings such as online recommendation or consumer choice modeling, individuals make repeated choices from a fixed set of options. Accurately estimating their underlying preferences is essential for generating personalized future recommendations. Probabilistic models for understanding user choice behavior from past decisions can serve as a valuable addition to existing recommender systems and choice prediction methods. To this end, in this article, we introduce a novel statistical framework for predicting user preferences based on their past choices, under a natural monotonicity assumption: options that were chosen more frequently or more intensely in the past are more likely to be chosen again in the future. Our approach builds on a parametric model proposed by Le Goff and Soulier (2017), originally used to describe how ants in an ant colony select a path among many pre-existing paths. We propose a non-parametric generalization of this model, drawing inspiration from the generalized elephant random walk introduced by Maulik et al. (2024). We develop a method of maximum likelihood estimation of the user preference probabilities under the above-mentioned monotonicity constraint. We also derive theoretical guarantees for our estimator and demonstrate the effectiveness of our method through both simulated experiments and real-world datasets.

0

stat.ML 2026-05-12 2 theorems

Wasserstein projection gives optimal private sampling

Differentially Private Sampling from Distributions via Wasserstein Projection

The mechanism projects distributions to meet privacy rules while using geometry-aware distance, unlike KL methods that ignore shape and fail

abstract click to expand

In this paper, we study the problem of sampling from a distribution under the constraint of differential privacy (DP). Prior works measure the utility of DP sampling with density ratio-based measures such as KL divergence. However, such formulations suffer from two key limitations: 1) they fail to capture the geometric structure of the support, and 2) they are not applicable when the supports of the distributions differ. To deal with these issues, we develop a novel framework for DP sampling with Wasserstein distance as the utility measure. In this formulation, we propose Wasserstein Projection Mechanism (WPM), a minimax optimal mechanism based on Wasserstein projection. Furthermore, we develop efficient algorithms for computing the proposed mechanisms approximately and provide convergence guarantees.

0

stat.ML 2026-05-12 2 theorems

Federated LLMs keep explicit consistency and coverage under bandwidth budgets

Federated Language Models Under Bandwidth Budgets: Distillation Rates and Conformal Coverage

KL rates for distillation and a novel retrieval-bandwidth slack for conformal RAG both improve predictably with more nodes.

abstract click to expand

Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and none makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count $K$, per-node sample size $n$, quantization budget $B$, probe-set size $m$, and vocabulary size $V$; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack $\Delta_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$ makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across $K$ nodes shrinking the slack as $K^{-1/2}$ in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds' parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.

0

stat.ME 2026-05-12 1 theorem

Domino framework controls k-bFDR under any dependence

Generalized Boundary FDR Control under Arbitrary Dependence: An Approach on Closure Principle

Guarantees error control for the k least significant discoveries with p-values or e-values and no independence assumption.

abstract click to expand

False discovery rate (FDR) is a cornerstone of modern multiple testing. However, it often fails to guarantee the reliability of "marginal" discoveries that lie at the boundary of the rejection set, which are often crucial in high-precision applications. While recent works (Soloff et al., 2024; Xiang et al., 2025) introduced the boundary false discovery rate (bFDR) to control the error probability at the marginal discovery, their method relies on restrictive assumptions such as independence or specific prior distributions. In this paper, we first propose $k$-bFDR, a novel generalization that controls the error probability of the $k$ least significant discoveries. We then provide a systematic investigation into the theoretical relationship between $k$-bFDR and existing error metrics. Furthermore, building upon the closure principle, we develop Domino, a unified framework that guarantees $k$-bFDR control under arbitrary dependence, applicable for both p-values and e-values. We prove the theoretical validity of the proposed Domino algorithm and demonstrate through extensive numerical experiments that it consistently achieves rigorous $k$-bFDR control while identifying trustworthy marginal discoveries. Analyses of real data reveal that $k$-bFDR control yields higher-quality rejection sets with greater practical significance.

0

stat.ML 2026-05-12 2 theorems

Multicalibration corrected without clean labels using contamination matrices

Unified Approach for Weakly Supervised Multicalibration

Risk rewrites and witness constraints yield finite-sample guarantees for calibration moments under positive-unlabeled and related regimes.

abstract click to expand

Multicalibration requires predicted scores to agree with label probabilities across rich families of subgroups and score-dependent tests, but existing methods require clean input-label pairs for evaluation and post-processing. This assumption fails in weakly supervised learning (WSL) regimes -- including positive-unlabeled, unlabeled-unlabeled, and positive-confidence learning -- where clean labels are costly or unavailable even though reliable uncertainty estimates may be crucial. We address this gap by developing estimators of multicalibration error and post-hoc correction methods for WSL settings in which clean input-label pairs are unavailable. We propose a unified framework for estimating and correcting multicalibration under weak supervision by combining contamination-matrix risk rewrites with witness-based calibration constraints, yielding corrected multicalibration moments with finite-sample guarantees. We further propose weak-label multicalibration boost (WLMC), a generic post-hoc recalibration algorithm under weak supervision. Finally, we conduct experiments across multiple weak-supervision settings to evaluate multicalibration behavior and offer empirical insight into uncertainty estimation under weak supervision.

0

stat.ME 2026-05-12 2 theorems

Proxy spectral structure identifies causal effects with hidden outcomes

Proximal Causal Inference for Hidden Outcomes

Influence-function estimators achieve multiple robustness and efficiency without needing unbiased measurements or any outcome observation.

abstract click to expand

Methods that rely on proxies, without imposing strong parametric structure, are increasingly used to deal with unobserved variables in causal inference. One influential line of this work reconstructs latent distributions used to identify the target functional by exploiting eigenvalue eigenvector structure. Within this framework, we first establish identification of the full data law in the presence of hidden outcomes, and then develop influence function based estimators for causal effects. To the best of our knowledge, this is the first work to develop influence function based estimators in this setting without relying on unbiased proxy measurements or partial observation, while achieving multiple robustness and desirable efficiency properties. We demonstrate the performance of our approach through simulation studies.

0

stat.ML 2026-05-12 2 theorems

Rectified AI laws cut bias in Bayesian priors from limited data

Supercharging Bayesian Inference with Reliable AI-Informed Priors

Correcting the distribution that generates synthetic data before prior construction improves interval coverage and reliability when data are

abstract click to expand

Modern predictive systems encode beliefs that can act as useful prior information for statistical inference in data-limited settings. Using them for prior construction introduces a tradeoff: an informative prior built from a predictive model can sharpen inference from limited data, but also risks propagating error from the model into the posterior. We propose a framework for AI-informed prior elicitation that mitigates this tension by rectifying the AI-induced law that generates synthetic data before using it to inform a prior. The rectified law can be embedded into synthetic data-driven prior elicitation techniques, including as a base measure in a Dirichlet process (DP) prior on the data-generating process. We refer to the resulting prior and corresponding posterior as the rectified AI prior and rectified AI posterior. We establish Gaussian asymptotics for the rectified AI posterior under non-vanishing prior strength and derive a first-order expression for its centering bias. Our rectified AI priors substantially reduce bias compared to standard approaches, improve the coverage of credible intervals, and make AI-powered prior information more reliable. We additionally apply the rectified AI prior to a real skin disease classification task and show that it can meaningfully boost predictive performance.

0

stat.ME 2026-05-11 2 theorems

Procedure finds subgroups with different treatment effects and exact FDR control

Adaptive discovery of effect modification in matched observational studies

It accounts for unmeasured confounding through sensitivity models and gains power from multiple matched controls in observational data.

abstract click to expand

Understanding effect modification -- how treatment effects vary across subpopulations -- is practically important in observational studies, as it helps identify which subgroups are likely to benefit from a given treatment. In this paper, we study the discovery of effect modification in matched observational studies, where each treated unit may be matched to multiple controls. We develop a finite-sample valid procedure for identifying and selecting covariate-interpretable subgroups, with exact control of the subgroup-level false discovery rate (FDR). Our method explicitly accounts for unmeasured confounding via sensitivity models, and leverages multiple matched controls to improve statistical power. We demonstrate the favorable performance of our method relative to baseline methods through extensive simulation studies and a real-world application to the economic returns to college education.

0

stat.ML 2026-05-11 2 theorems

Normalizing flows recover fast equilibrium from slow data alone

Learning stochastic multiscale models through normalizing flows

The flow is trained end-to-end on one observed trajectory to parameterize the hidden invariant density and produce a reduced stochastic slow

abstract click to expand

Many systems in physics, engineering, and biology exhibit multiscale stochastic dynamics, where low-dimensional slow variables evolve under the influence of high-dimensional fast processes. In practice, observations are often limited to a single trajectory of the slow component, while the fast dynamics remain unobserved, making statistical learning challenging. Approaches based on partial differential equations (PDE), such as Fokker-Planck formulations, aim to characterize the evolution of probability densities, typically requiring dense space-time data or grid-based solvers. In contrast, we adopt a trajectory-based perspective and develop a data-driven framework for learning effective stochastic dynamics from a single observed path. We model the dynamics by coupled multiscale stochastic differential equations (SDEs) and first obtain a principled model reduction through stochastic averaging. Unlike generic model reduction techniques such as PCA, this respects the dynamical structure of the original system and explicitly incorporates the interaction between slow and fast scales. A central challenge, however, is that the reduced model depends on the invariant distribution of the fast process, which is a solution to an intractable and often unknown PDE. We introduce a novel learning framework that parameterizes the invariant distribution using normalizing flows, enabling expressive density modeling in the latent fast-variable space. The flow is trained end-to-end by optimizing a penalized likelihood objective induced by the reduced stochastic dynamics. Furthermore, we develop a Bayesian variational inference procedure for uncertainty quantification, employing a second normalizing flow to approximate the posterior distribution over model parameters. This yields a scalable approach to capturing epistemic uncertainty in multiscale systems.

0

stat.ME 2026-05-11 2 theorems

Calibrating every LLM judge beats selecting only the accurate ones

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

Full panels after calibration cut error in half on RewardBench2 compared with accuracy-ranked top-5 subsets.

abstract click to expand

Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilistic evaluation from a labeled calibration set. Holding the aggregation and calibration procedures fixed, we compare accuracy-ranked top-$k$ judge selection with using the full judge panel. Across four labeled pairwise-evaluation benchmarks spanning LLM-as-judge and reward-model settings, the calibrated full panel consistently outperforms accuracy-based selection. On RewardBench2, retaining all judges achieves negative log-likelihood (NLL) of $0.006$ versus $0.013$ under top-5 selection, halving the calibration error. This advantage persists after judge-family deduplication and against stronger same-pipeline subset search. We explain this reversal with oracle analyses showing that the optimal calibrated risk under proper scoring rules cannot increase when additional judge signals are made available, and that even below-chance judges can be useful when their biases are learnable and their signals are non-redundant. The resulting operating principle is simple: in multi-judge evaluation with labeled calibration data, do not discard weak judges by accuracy alone; keep them when they are parseable, non-redundant, and calibratable.

0

stat.ME 2026-05-11 2 theorems

Threshold m* decides when spatial effects matter in areal regression

On the Need for Spatial Random Effects in Bayesian Regression Models for Multilevel Areal Data

In multilevel areal data, this m* based on correlation, variance ratio and covariate alignment shows nonspatial models suffice above it, as

abstract click to expand

Although spatial models for areal data are widely used in multilevel settings, the conditions under which spatial and nonspatial random effects yield equivalent posterior inference for regression coefficients have never been formally characterized. We address this question within a hierarchical Bayesian framework for Gaussian outcomes, using the Leroux conditional autoregressive (CAR) prior distribution as a representative specification. We derive a closed-form sample size threshold, $m^*$, below which spatial modeling materially affects inference on regression coefficients and above which a simpler nonspatial model yields effectively equivalent results, and show that the absolute relative difference in posterior variances converges to zero at rate $O(m^{-1})$. The threshold depends on three interpretable quantities: the spatial correlation parameter, the ratio of between-area to within-area variance, and the alignment between the covariate and dominant spatial patterns in the data. Because each can often be estimated prior to model fitting, $m^*$ can serve as a practical study design tool. Simulation studies confirm that $m^*$ accurately identifies this threshold across a range of settings. However, when the covariate does not vary within a given location, spatial modeling remains necessary regardless of within-area sample size. These results offer formal guidance for practitioners deciding whether the added complexity of spatial modeling is warranted.

0

stat.ML 2026-05-11 2 theorems

Metropolis-Hastings steps fix discretization bias in diffusion correctors

Metropolis-Adjusted Diffusion Models

Exact and approximate methods using only the score function yield better FID scores on images while keeping computation low.

abstract click to expand

Sampling from score-based diffusion models incurs bias due to both time discretisation and the approximation of the score function. A common strategy for reducing this bias is to apply corrector steps based on the unadjusted Langevin algorithm (ULA) at each noise level within a predictor-corrector framework. However, ULA is itself a biased sampler, as it discretises a continuous diffusion process. In this work, we consider adjusted Langevin correctors that employ Metropolis--Hastings (MH) or Barker's accept-reject steps to correct for this bias. Since the target density ratio typically required by MH-based algorithms is unavailable, we propose methods that instead utilise the score function to compute the correct acceptance probability. We introduce the first exact method for adjusting Langevin corrections in diffusion models, based on a two-coin Bernoulli factory algorithm. We also propose an efficient approximation based on Simpson's rule that achieves accuracy of order $5/2$ in the step size at near-zero marginal cost. We demonstrate that these procedures improve sample quality on both synthetic and image datasets, yielding consistent gains in Fr\'echet Inception Distance (FID) on the latter.

0

stat.ME 2026-05-11 Recognition

Bayesian model clusters marked point processes and estimates continuous intensities

Laplace Variational Inference for Dirichlet Process Mixtures of Marked Poisson Point Processes

Dirichlet process mixtures with constrained Laplace variational inference recover groups and mark-specific surfaces without gridding the map

abstract click to expand

Marked point process data arise when events occur in a space with event-level marks. We study clustering of replicated marked Poisson point processes and introduce Dirichlet process mixtures of marked Poisson point processes, a Bayesian nonparametric model that jointly infers latent cluster structure, the number of clusters, and continuous mark-specific intensity surfaces. We use a squared link intensity representation to obtain tractable continuous domain likelihood terms without gridding or thinning. For posterior inference, we develop an efficient variational Bayes algorithm with a constrained Laplace approximation for the nonconjugate basis-coefficient block. The resulting coefficient update is formulated as a constrained optimization problem, which avoids the sign ambiguity and nodal-line issue of squared-link models. We further establish theoretical guarantees for mode finding optimization. We demonstrate the performance of the proposed model and algorithm through synthetic experiments and real-data analysis.

0

stat.ME 2026-05-11 Recognition

Standard BH controls FDR curve across all parameters

Simultaneous false discovery rate control in location families

In location family data the usual procedure bounds false discoveries simultaneously for every insignificant location shift.

abstract click to expand

When testing a number of statistical hypotheses using data from location families, it is often useful to control the false discovery rate (FDR) not just for hypotheses of the null values but also of other parameter values that are deemed practically insignificant. Here we consider FDR as a curve indexed by the location parameter and suggest a simple generalization of the Benjamini-Hochberg procedure that controls the FDR curve below any user-specified level. As a corollary of our main result, we show that the standard Benjamini-Hochberg procedure -- designed to control the FDR at the null -- also provides simultaneous control of the whole FDR curve for free. We further demonstrate the implications of our results and some practical considerations with a numerical example.

0

stat.ML 2026-05-11 2 theorems

Empirical Bayes shrinkage completes 1-bit matrices with balanced accuracy and calibration

Empirical Bayes 1-bit matrix completion

Adapting the Efron-Morris estimator to binary observations exploits low-rank structure for predictions and uncertainty estimates.

abstract click to expand

The problem of predicting unobserved entries in a binary matrix, known as 1-bit matrix completion, has found diverse applications in fields such as recommendation systems. In this study, we develop an empirical Bayes method for 1-bit matrix completion motivated by the Efron--Morris estimator, a matrix generalization of the James--Stein estimator that shrinks singular values toward zero. The proposed method exploits the underlying low-rank structure of binary matrices, drawing parallels with multidimensional item response theory. Simulation studies and real-data applications demonstrate that the proposed method achieves a superior balance of predictive accuracy, calibration reliability (uncertainty quantification), and computational efficiency compared to existing methods.

0

stat.ME 2026-05-11 Recognition

Random forest surrogate cuts likelihood evaluations in phylogenetic SMC

Accelerating Bayesian Phylogenetic Inference via Delayed Acceptance Sequential Monte Carlo with Random Forest Surrogates

Predicting the effect of tree rearrangements lets the sampler reject bad moves early while recovering accurate posteriors.

abstract click to expand

In Bayesian phylogenetics, our goal is to estimate the posterior distribution over phylogenetic trees. Markov chain Monte Carlo methods are widely used to approximate the phylogenetic posterior distributions. For large-scale sequence data, repeated evaluation of the likelihood function incurs a high computational cost. In this article, we propose a machine-learning algorithm with over 35 topological and branch-length features to predict the changes in the likelihood function caused by tree moves (\eg,~eSPR, stNNI) used in standard MCMC approaches. This algorithm is then used to design a delayed acceptance MCMC kernel, which utilized the predicted surrogate function for preliminary rejection, to accelerate tree space searches. Furthermore, we integrate our proposed MCMC kernel into the sequential Monte Carlo sampler framework. We validate the proposed delayed-acceptance sequential Monte Carlo approach (DA-SMC) on simulation and real data sets. Our delayed acceptance kernel can maintain robust estimation while reduces the number of likelihood evaluations significantly, yielding substantial computational time savings. We develop a Python package that is available at https://github.com/wentYu/DAphyloSMC.

0

stat.ME 2026-05-11 Recognition

Bridge functions identify path-specific effects with hidden confounders

Proximal Path-Specific Inference

Four nonparametric strategies and a quadruply robust estimator achieve sqrt(n) consistency even when nuisance functions converge slowly.

abstract click to expand

Causal mediation analysis has been extended to estimate path-specific effects with multiple intermediate variables, isolating treatment effects through a mediator of interest while excluding pathways through its ancestors. Such analyses address bias from recanting witnesses, i.e., treatment-induced mediator-outcome confounders. However, existing methods typically rely on stringent assumptions precluding general unmeasured confounding, which are often violated in practice. In this paper, we relax these restrictions by leveraging observed covariates as proxy variables to accommodate unmeasured confounding among the treatment, recanting witness, mediator, and outcome. Using proximal confounding bridge functions, we develop four nonparametric identification strategies for the path-specific effect. We further derive the efficient influence function and propose a quadruply robust, locally efficient estimator. To handle high-dimensional nuisance parameters, we propose a proximal debiased machine learning approach. We theoretically guarantee that our estimator achieves $\sqrt{n}$-consistency and asymptotic normality even when machine learning estimators for nuisance functions converge at slower rates. Our approaches are validated via semiparametric and nonparametric simulations and an application to the CDC WONDER Natality study, estimating the path-specific effect of prenatal care on preterm birth through preeclampsia, independent of maternal smoking during pregnancy.

0

stat.ML 2026-05-11 Recognition

Mean-field SVGD converges in L2 at explicit polynomial rates

Quantitative Local Convergence of Mean-Field Stein Variational Gradient Flow

Rates depend on dimension and regularity when initial and target densities are smooth and close in L2 on the torus

abstract click to expand

Stein Variational Gradient Descent (SVGD) is a deterministic interacting-particle method for sampling from a target probability measure given access to its score function. In the mean-field and continuous-time limit, it is known that the flow converges weakly toward the target, but no quantitative rate is known for the last iterate. In this paper, we establish quantitative local convergence in strong norms for this dynamics, when the interaction kernel is of Riesz type on the $d$-dimensional torus. Specifically, assuming that the initial density and the target are smooth and close in $L^2$-norm, we obtain explicit polynomial convergence rates in $L^2$-norm that depend on the dimension and on the regularity parameters of the kernel, the initialization and the target. We further show that these rates are sharp in certain regimes, and support the theory with numerical experiments. In the edge case of kernels with a Coulomb singularity, we recover the global exponential convergence result established in prior work. Our analysis is inspired by recent results on Wasserstein gradient flows of kernel mean discrepancies.

0

stat.ML 2026-05-11 2 theorems

Single-index bandits admit optimal regret of order T to the two-thirds

Optimal Regret for Single Index Bandits

A Stein estimator finds the projection direction, discretization reduces to one dimension, and UCB delivers the tight rate.

abstract click to expand

We study the $\textit{single-index bandit}$ problem, where rewards depend on an unknown one-dimensional projection of high-dimensional contexts through an unknown reward function. This model extends linear and generalized linear bandits to a nonparametric setting, and is particularly relevant when the reward function is not known in advance. While optimal regret guarantees are known for monotone reward functions, the general non-monotone case remains poorly understood, with the best known bound being $\tilde{\mathcal{O}}(T^{3/4})$ (under standard boundedness and Lipschitz assumptions on the reward function [Kang et al., 2025]). We close this gap by establishing the optimal regret for general single-index bandits. We propose a simple two-phase algorithm, namely, Zoomed Single Index Bandit with Upper Confidence Bound ($\texttt{ZoomSIB-UCB}$), that first estimates the projection direction via a normalized Stein estimator, and then reduces the problem to a one-dimensional bandit using discretization and finally use UCB. This approach achieves a regret of $\tilde{\mathcal{O}}(T^{2/3})$, and improves significantly upon prior work without any additional assumptions. We also prove a matching minimax lower bound of $\tilde{\Omega}(T^{2/3})$, showing that the upper bound is essentially tight. Our upper and lower bounds together provide a sharp characterization of the regret in single-index bandits. Moreover, the empirical results further demonstrate the effectiveness and robustness of our approach.

0