stat.CO — Pith

0

stat.CO 2026-05-12 2 theorems

Writer monads automate MCMC kernel composition

gemlib.mcmc: composable kernels for Metropolis-within-Gibbs sampling schemes

Researchers chain parameter-estimation and data-augmentation steps for epidemic models with less manual coding while keeping statistical rig

abstract click to expand

State-transition models are essential across epidemiology and ecology, but statistical inference remains challenging owing to high-dimensional latent state spaces, temporal dependence, and intractable likelihood functions. Bayesian inference via Markov Chain Monte Carlo (MCMC) enables joint estimation of model parameters and missing event times through data augmentation, but Metropolis-within-Gibbs (MWG) schemes that combine multiple specialised kernels are notoriously difficult to implement. Current probabilistic programming frameworks face a trade-off: automation sacrifices extensibility, whilst flexibility demands substantial implementation overhead. This divide has created a software landscape characterised by tightly coupled, model-specific implementations that resist reuse and extension. We introduce gemlib.mcmc, an MCMC module designed to bridge methodological and applied communities through principled, composable kernel abstractions. The framework employs writer monads from category theory to formalise kernel composition, enabling seamless integration of parameter-estimation and data-augmentation kernels without manual state management. Built on JAX and TensorFlow Probability for high-performance computation, gemlib.mcmc provides an ergonomic interface -- leveraging Python's right-shift operator for intuitive kernel chaining -- whilst maintaining statistical rigour and transparency. Developers can extend the library by implementing only two methods; composition and hardware acceleration are automated. We demonstrate the framework through parameter inference on partially observed epidemic models, showing how complex inference algorithms can be expressed concisely and reused across applications. By reducing implementation burden we provide access to sophisticated MCMC methods and enable applied researchers to employ state-of-the-art algorithms without reimplementation overhead.

0

stat.CO 2026-05-08

QUBO reformulation finds higher-quality splits for regression trees

QUBO-Based Calibration for Regression Trees

Trees match standard CART accuracy yet use superior splits on categorical predictors by converting the fractional selection problem into QUB

abstract click to expand

Tree-based regression models are widely used in supervised learning, with the Classification and Regression Tree (CART) algorithm serving as a standard reference. CART construction involves solving a sequence of split-selection optimization problems. For categorical predictors, this problem can be formulated as a combinatorial fractional optimization problem. This structure makes the exact optimization computationally challenging and leads to standard implementations that rely on greedy heuristics, which may result in suboptimal splits. In this work, we reformulate this fractional problem and apply Dinkelbach (1967) algorithm to convert it into a Quadratic Unconstrained Binary Optimization (QUBO) problem. Using state-of-the-art QUBO solvers, we obtain QUBO-based regression trees with predictive performance comparable to standard CART while yielding higher-quality split solutions. These results highlight the potential of QUBO formulations for improving tree-based learning methods and open perspectives for future hybrid classical--quantum implementations.

0

stat.CO 2026-05-06 2 theorems

Power can fall when adding more permutations to Monte Carlo tests

More Permutations Do Not Always Increase Power: Non-monotonicity in Monte Carlo Permutation Tests

Discreteness of the permutation distribution produces saw-tooth patterns where extra samples sometimes reduce detection ability.

abstract click to expand

Monte Carlo permutation tests are a cornerstone of valid, model-free statistical inference. A widely held practical intuition is that increasing the number of sampled permutations improves test performance, in particular that statistical power tends to increase with the Monte Carlo budget. In this paper, we show that these intuitions are false in general. Leveraging the saw-toothed structure of power arising from distributional discreteness, we provide a simple structural explanation for why power can decrease as the number of sampled permutations increases, and we prove that such decreases occur infinitely often as the Monte Carlo budget grows.

0

stat.CO 2026-05-06

Covariance decomposition scales multi-fidelity spatio-temporal GPs

A new framework for non-stationary spatio-temporal data fusion of multi-fidelity models

A split of the joint covariance lets Vecchia approximations evaluate likelihoods for fusing high- and low-fidelity data without building the

abstract click to expand

We propose a new scalable framework for spatio-temporal data fusion with multi-fidelity Gaussian processes (MFGPs) that enables fully likelihood-based inference for both stationary and non-stationary fidelity integration. The framework is designed for environmental applications, where abundant but noisy low-fidelity data (e.g., satellite or reanalysis products) must be fused with sparse yet accurate high-fidelity in-situ observations to obtain high-resolution reconstructions. Our key methodological contribution is a decomposed multi-fidelity covariance formulation that allows the Vecchia approximation to be applied directly to the latent low-fidelity and discrepancy processes. Combined with a Woodbury-based reconstruction, this yields a numerically stable and computationally efficient evaluation of the joint marginal likelihood without ever forming the full multi-fidelity covariance matrix. In addition, we introduce a generalized least squares (GLS) mean-removal strategy with fidelity-specific offsets, preventing systematic biases from being absorbed into cross-fidelity dependence. We validate the proposed approach through extensive experiments on synthetic data and a large-scale real-world application to wind speed reconstruction in the Lombardy region of Italy. The results show that the proposed Vecchia-based MFGP closely matches exact multi-fidelity inference in controlled settings, while substantially outperforming standard single-fidelity spatio-temporal Gaussian processes in terms of predictive accuracy, correlation, and representation of local variability in realistic large-data scenarios.

0

stat.CO 2026-05-06

Bayesian recursions track if a process is currently in control

Sequential Bayesian Monitoring for Recoverable and Drifting Processes

New updates give the live probability of acceptable operation for recoverable and drifting processes, even after signals or fixes.

abstract click to expand

In many Phase II statistical process control (SPC) problems, the main concern is not whether a monitored process has ever changed, but whether it is currently operating at an acceptable level. This distinction is especially important when monitoring continues after a signal, or when corrective action may restore the process. We develop Bayesian monitoring procedures for this formulation of the Phase II task. For recoverable processes that may alternate between in-control and out-of-control states, we derive recursions for the posterior probability that the process is presently in control. For sequential tracking problems in which a latent parameter evolves over time, we monitor the posterior probability that the parameter lies inside an acceptable region of behavior. The methods are studied through calibrated time-between-failure experiments, Gaussian and Binomial tracking examples, and a held-out multivariate data illustration using white wine quality measurements.

0

stat.CO 2026-05-05

Few linear contrasts recover exact GP conditionals

Fast and accurate conditioning for large-scale and online Gaussian process prediction problems

For smooth kernels, a small set of designed data combinations matches full conditional distributions to machine precision while enabling O(1

abstract click to expand

Gaussian Process (GP) models provide a flexible framework for prediction and uncertainty quantification. For most covariance functions, however, exact GP prediction with $n$ points scales as $\mathcal{O}(n^3)$, making it prohibitively expensive for large datasets or large numbers of prediction points. While nearest neighbor-based prediction can work well in certain settings, non-pathological circumstances (for example measurement noise) can severely restrict its efficiency. This work presents a complementary approach where one conditions on carefully designed linear combinations of data, which is particularly effective in the setting of predicting many values in large connected regions of the data domain. For kernel functions that are smooth away from the origin, conditioning on a small number $r$ of such data contrasts can be machine-precision accurate for the full exact conditional distributions. These contrasts cost $\mathcal{O}(T r^2)$ work to compute where $T$ is the cost of solving a linear system with the data covariance matrix, and so in many cases can be computed in linear or near-linear cost by exploiting rank structure in well-behaved covariance matrices. At the cost of $\mathcal{O}(nr^2)$ additional precomputation work, this approach can also provide predictions at arbitrary points of a designated region in $\mathcal{O}(1)$ online work, making it particularly attractive for problems where prediction points are not known in advance.

0

stat.CO 2026-05-04

R package fits Dirichlet process models without custom MCMC code

dirichletprocess: An R Package for Fitting Complex Bayesian Nonparametric Models

Users apply pre-built or custom models for density estimation and clustering while the software manages the sampling.

abstract click to expand

The dirichletprocess package provides software for creating flexible Dirichlet process objects. Users can perform nonparametric Bayesian analysis using Dirichlet processes without the need to program their own inference algorithms. Instead, the user can utilise our pre-built models or specify their own models whilst allowing the dirichletprocess package to handle the Markov chain Monte Carlo sampling. Our Dirichlet process objects can act as building blocks for a variety of statistical models including: density estimation, clustering and prior distributions in hierarchical models.

0

stat.CO 2026-05-04

Uniform generators speed Pearson IV sampling for all shapes

The Pearson IV distribution: Random variate generation and applications

New methods generate random numbers from this distribution at constant speed regardless of shape parameters and support Bayesian models.

abstract click to expand

We develop uniformly fast random variate generators for the Pearson IV distribution that can be used over the entire range of both shape parameters and highlight some applications in a Bayesian setting.

0

stat.CO 2026-05-04

Parallel subset chains boost MCMC sampling for multimodal targets

Modular Markov chain Monte Carlo with application to multimodal sampling

Modular weighting by transition probabilities combines estimates while handling modes of different scales.

abstract click to expand

We develop a modular approach to Markov chain Monte Carlo (MCMC) sampling for unnormalized target densities. In this approach, Markov chains are constructed in parallel, each constrained to a subset of the target space. The Monte Carlo estimates from the constrained chains are then combined with appropriate weights, calculated from the transition probabilities between subsets. In addition to the computational advantages arising from its parallelized structure, this modular MCMC approach enables variance reduction for Monte Carlo estimation in settings where sampling from low-density regions is required. We develop a central limit theorem-type result for the resulting Monte Carlo estimates and propose a method for estimating their standard errors. Furthermore, by applying this modular sampling technique to simulated tempering, we propose a method for Monte Carlo estimation of expectations with respect to multimodal target distributions. This approach effectively addresses a well-known challenge of tempering-based methods: sampling efficiency can be greatly reduced when separated modes of the target distribution have different scales. We demonstrate the efficiency of the proposed methods through numerical examples, including one arising from Bayesian sparse regression with a spike-and-slab prior.

0

stat.CO 2026-05-01

Three streaming covariance algorithms match exactly in exact math

2B or Not 2B: A Tale of Three Algorithms for Streaming: Covariance Estimation after Welford and Chan-Golub-LeVeque

Gram favors batch speed, Welford resists mean shifts, CGL enables merging, and conformal sets give valid entrywise intervals.

abstract click to expand

We place three algorithms for computing the unbiased sample covariance matrix in streaming and distributed settings on a common algebraic, numerical, and statistical foundation. The Gram algorithm, derived from the variance reformulation, maintains the running cross-product matrix $G_t = \sum_{i=1}^t x_i x_i^\top$ and the column-sum vector $s_t = \sum_{i=1}^t x_i$, yielding the unbiased covariance estimator $S_t = (t-1)^{-1}(G_t - t^{-1}s_t s_t^\top)$ in $O(p^2)$ time per update. The Welford algorithm propagates a running mean $m_t$ and outer-product corrections $M_t$, with updates $m_t = m_{t-1} + (x_t - m_{t-1})/t$ and $M_t = M_{t-1} + (x_t - m_{t-1})(x_t - m_t)^\top$, achieving the same asymptotic cost with improved numerical stability under large data shifts. The Chan-Golub-LeVeque algorithm supports block-parallel merging through the exact identity $M = M_A + M_B + \frac{n_A n_B}{n_A+n_B}(m_B - m_A)(m_B - m_A)^\top$, making it the natural choice for distributed and map-reduce architectures. All three algorithms produce the same estimator $S_t = M_t/(t-1)$ in exact arithmetic, although their finite-precision behavior differs markedly. Beyond runtime and numerical comparisons, we introduce a conformal prediction framework for streaming covariance estimation that yields finite-sample, distribution-free confidence sets $C_{t,jk}$ for each entry $S_{t,jk}$ of the covariance matrix at any step $t$ of the data stream. Experiments confirm that the Gram algorithm is fastest for batch computation, Welford is uniquely robust to catastrophic cancellation under large mean shifts, CGL is optimal for distributed settings, and conformal intervals achieve the nominal coverage level across all three algorithms.

0

stat.CO 2026-05-01

R packages unify forecast reconciliation across three frameworks

FoReco and FoRecoML: A Unified Toolbox for Forecast Reconciliation in R

FoReco and FoRecoML implement linear and machine-learning methods for cross-sectional, temporal, and cross-temporal settings in one toolbox.

abstract click to expand

Forecast reconciliation has become key to improving the accuracy and coherence of forecasts for linearly constrained multiple time series, such as hierarchical and grouped series. Yet, comprehensive software that jointly covers cross-sectional, temporal, and cross-temporal reconciliation has so far been lacking. The R packages FoReco and FoRecoML address this gap by offering a comprehensive and unified framework. The packages respectively implement classical and regression-based linear reconciliation approaches, and non-linear approaches based on machine learning for cross-sectional, temporal and cross-temporal frameworks. Designed for accessibility and flexibility, these packages provide sensible default options that allow new users to apply reconciliation methods with minimal effort, while still giving expert users full control to explore state-of-the-art extensions through customized settings. With this dual focus, FoReco and FoRecoML are versatile tools for practitioners and researchers working on forecast reconciliation.

0

stat.CO 2026-05-01

Bridge sampling approximates martingale posteriors with O(Δ) bias

Martingale Posteriors for Discretely Observed Diffusions

The method controls time-discretization error at order O(Δ) and delivers orders of magnitude faster inference than MCMC for diffusion params

abstract click to expand

In this paper we consider parameter estimation for discretely observed diffusion processes. In particular, we focus on data that are observed at low frequency and methodology that can estimate parameters with uncertainty quantification. Most statistical work in this domain develops advanced Markov chain Monte Carlo (MCMC) algorithms for sampling from the posterior of the parameters, a task which is often complicated by the fact that one seldom has access to the transition density of the diffusion process; one has to combine sophisticated MCMC methods which are robust to the required time discretization of the diffusion, which can yield expensive algorithms. We focus on developing the martingale posterior method for the context of interest, when one can only numerically approximate the transition density of the diffusion. Based on using types of diffusion bridges we introduce a new martingale posterior method for parameter estimation for discretely observed diffusion processes. We prove that this algorithm approximates, in some sense, the martingale posterior which has no time-discretization bias up-to $\mathcal{O}(\Delta)$ if $\Delta$ is the time discretization step. Our approach is illustrated on several examples, showing orders of magnitude speed up versus state-of-the-art MCMC algorithms.

0

stat.CO 2026-04-29

Skew-Laplace cuts Dirichlet mixture posterior error by ~30% vs Laplace

Laplace and skew-Laplace approximations for Dirichlet process mixture posterior density

The skew correction improves recovery of complex densities while staying far faster than slice-sampling MCMC across simulations and realdata

abstract click to expand

Posterior inference for Dirichlet process mixture models is analytically intractable and typically relies on Markov chain Monte Carlo methods, which can become computationally prohibitive at moderate to large sample sizes. In this work, we investigate the performance of Laplace and skew-Laplace posterior approximations for density estimation in this setting. Through an extensive numerical study covering four simulation scenarios with sample sizes ranging from n = 20 to n = 2,000 and four standard real datasets, we compare the standard Laplace approximation, its skew-corrected extension, and a slice sampling benchmark, assessing accuracy through total variation distance and computational efficiency through runtime. Our results show that the Gaussian Laplace approximation is more effective in this setting than might be anticipated, and that the skew-Laplace approximation consistently improves posterior recovery while remaining substantially faster than state-of-the-art Markov chain Monte Carlo samplers across all settings considered. In particular, the use of skew-Laplace in place of the standard Laplace approximation is especially beneficial in more complex density structures, where we observe error reductions typically on the order of 30%.

0

stat.CO 2026-04-28

First-order bias bounds for stochastic gradient Langevin

Theoretical guarantees for stochastic gradient sampling methods via Gaussian convolution inequalities

New convolution inequalities show invariant measures accurate to stepsize order under weak noise assumptions.

abstract click to expand

We derive first-order (in the stepsize) bounds on the bias in Wasserstein distances of the invariant measure of stochastic gradient kinetic Langevin dynamics with minimal assumptions on the stochastic gradient noise. These bounds sharpen existing non-asymptotic guarantees for stochastic-gradient MCMC methods and provide a quantitative resolution of a previously open problem on invariant measure accuracy. The main technical ingredients are new Gaussian convolution inequalities controlling the Wasserstein-$p$ distance between a Gaussian convolved with a mean-zero perturbation and the Gaussian itself. We anticipate that these inequalities will be of independent interest beyond the present application.

0

stat.CO 2026-04-27

GPU workflow computes stats for 10 billion rows in one pass

Building a GPU-Accelerated Multivariate Statistics Platform

Column sums and cross-product matrices enable covariance and PCA without reloading the raw data

abstract click to expand

Classical multivariate statistical methods such as covariance estimation and principal component analysis are well understood mathematically, yet their application at extreme data scales remains challenging. When the number of observations reaches billions, performance is limited by data movement, input-output bottlenecks, and numerical stability rather than arithmetic complexity. This work presents a case study of scaling classical multivariate statistics on a single multi-GPU node. Using C++ and CUDA, a GPU-accelerated workflow was developed to compute sufficient statistics in a single pass over a 10-billion-row dataset. Column sums and cross-product matrices are used to enable downstream computation of means, covariance, correlation, and principal component analysis without revisiting the raw data. The results highlight the importance of data representation, validation using known invariants, and careful numerical treatment when applying established statistical methods at large scale.

0

stat.CO 2026-04-27

R package ragR matches Python RAGAS for RAG evaluation

ragR: Retrieval-Augmented Generation and RAG Assessment in R

It unifies document retrieval, generation, and metric scoring inside R for reproducible workflows.

abstract click to expand

Retrieval-augmented generation (RAG) combines document retrieval with large language models to produce responses grounded in external evidence. While several R packages support core components of RAG workflows, integrated evaluation of RAG systems in R remains limited and is often conducted through Python-based tools, most notably the RAG assessment (RAGAS) framework. To address this gap, we introduce ragR, an R package that unifies document ingestion, embedding and vector storage, similarity-based retrieval, grounded generation, structured question-answer logging, and RAGAS-style evaluation within a single R-native workflow. The current implementation provides LLM-based scoring for four core RAGAS metrics: context precision, context recall, faithfulness, and answer relevance. Validation experiments under controlled settings show that ragR captures similar metric behavior to the reference Python RAGAS workflow across multiple use cases. By integrating RAG construction and evaluation within a reproducible workflow in R, ragR provides a practical framework for research, teaching, and moderate-scale experimentation on RAG systems entirely within the R ecosystem.

0

stat.CO 2026-04-24

Constrained particle filters enforce compact state support

On a class of constrained particle filters for continuous-discrete state space models

Acting on the dynamics at observation times produces uniform error bounds and accounts for SDE solver errors.

abstract click to expand

Particle filters (PFs) are recursive Monte Carlo algorithms for Bayesian tracking and prediction in state space models. This paper addresses continuous-discrete filtering problems, where the hidden state evolves as an It\^o stochastic differential equation (SDE) and observations arrive at discrete times. We propose a novel class of constrained PFs that enforce compact support on the state at each observation instant, thereby limiting exploration to plausible regions of the state space. Unlike earlier approaches that truncate the likelihood, the proposed method constrains the dynamics directly, yielding improved numerical stability. Under standard regularity assumptions, we prove convergence of the constrained filter, derive uniform-in-time error estimates, and extend the analysis to account for discretisation errors arising from numerical SDE solvers. A numerical study on a stochastic Lorenz-96 system demonstrates the practical application of the methodology when the constraint is implemented via barrier functions.

0

stat.CO 2026-04-22

Annealed Langevin Monte Carlo yields low-variance flow ODE estimates

Annealed Langevin Monte Carlo for Flow ODE Sampling

Jarzynski reweighting from annealed chains achieves O(1/n) error bound for sampling multimodal targets.

abstract click to expand

We propose Annealed Langevin Monte Carlo for Flow ODE Sampling (ALMC-ODE), a method for generating samples from unnormalized target distributions, with a particular emphasis on multimodal densities that are challenging for standard Markov chain Monte Carlo methods. ALMC-ODE is based on a probability-flow ordinary differential equation (ODE) derived from stochastic interpolants, which continuously transports a standard Gaussian reference distribution at $t = 0$ to the target distribution $\rho$ at $t = 1$. The key innovation lies in an annealed Langevin Markov chain that evolves through a sequence of intermediate distributions bridging the reference and the target. The resulting importance-weighted particles, reweighted via a Jarzynski-based scheme, yield a low-variance estimator of the velocity field governing the ODE. On the theoretical side, we establish a Jarzynski-type reweighting identity for general time-inhomogeneous transition kernels, characterize the optimal backward kernel that minimizes the variance of the importance weights, and prove an $\mathcal{O}(1/n)$ mean squared error bound for the resulting velocity-field estimator. Numerical experiments on challenging benchmarks, including Gaussian mixture models and a 64-dimensional Allen--Cahn field system, demonstrate that ALMC-ODE significantly outperforms both direct Monte Carlo ODE approaches and Hamiltonian Monte Carlo when applied to highly multimodal target distributions.

0

stat.CO 2026-04-22

Hybrid model forecasts steam generator clogging life from physics and data

Digital twin-based hybrid framework for steam generator clogging prognostics

The framework merges simulations with limited observations and uncertainty methods to support maintenance planning in nuclear plants.

abstract click to expand

We present a hybrid framework to support prognostics of the clogging degradation phenomenon in tube support plates for digital twins of steam generators in pressurized water reactors. The proposed approach combines a physics-based simulation code, heterogeneous and sparse observational data, and several uncertainty quantification techniques to obtain a robust estimate of the steam generator remaining useful life associated with the clogging rate. The proposed framework is compatible with a digital twin platform to assist maintenance planning of EDF steam generators.

0

stat.CO 2026-04-21

Simulations settle conflicting MANOVA error-rate reports

A simulation study to resolve conflicting evidence on the error rates from MANOVA group tests

A systematic evaluation finds the four standard tests maintain type I error rates near nominal levels under common conditions.

abstract click to expand

Popular software packages report four generalizations of the ANOVA F test when conducting a multivariate analysis of variance (MANOVA). The reported operating characteristics of these fours tests vary widely depending on which research article the reader chooses. Some studies report extremely high type I error rates for a particular test even under ideal assumptions of multivariate normality and homoskedasticity; other studies report rates near the nominal level despite violations of the model assumptions. This simulation study seeks to clarify this apparent contradiction by providing a systematic evaluation of the type I error rates of the four statistics used to test for a group effect in MANOVA.

0

stat.CO 2026-04-20

Markov embedding shrinks state space of ranked trees for exact means

Markov embedding of ranked unlabelled evolutionary trees and its applications

The reduced chain yields every Fréchet mean, the joint law of balance indices, and moments of any order for F-matrices under neutral coalesc

abstract click to expand

Rooted bifurcating trees are mathematical objects used to model evolutionary relationships and arise naturally in both coalescent theory and phylogenetics. Recent numerical representations of tree topologies, known as F-matrices, allow for summarizing a sample of trees via Fr\'echet means and provide new measures of tree balance. However, the number of ranked unlabelled trees grows super-exponentially with the number of leaves. This makes computation intensive and current methods rely on mixed integer programming and simulation-based methods. Moreover, F-matrices are difficult to interpret, and their distribution is only described in terms of first- and second-order moments under neutral branching. In this paper, we introduce a Markov chain embedding of ranked and unlabelled trees that drastically decreases the size of the state space. Leveraging this embedding, we develop an algorithm that efficiently computes all Fr\'echet means and use discrete phase-type theory to obtain the joint distribution of tree balance indices. We also use discrete phase-type theory to generalize previous results regarding moments of F-matrices to arbitrary order for any time homogeneous and bifurcating coalescent model. Using this framework, we construct three tests for neutrality and demonstrate their improved power compared to previous methods on simulated data.

0

stat.CO 2026-04-17

The paper proposes Theta-regularized Kriging

Theta-regularized Kriging: Modelling and Algorithms

Theta-regularized Kriging penalizes the theta hyperparameter in Gaussian stochastic processes using Lasso, Ridge, or Elastic-net, yielding…

abstract click to expand

To obtain more accurate model parameters and improve prediction accuracy, we proposed a regularized Kriging model that penalizes the hyperparameter theta in the Gaussian stochastic process, termed the Theta-regularized Kriging. We derived the optimization problem for this model from a maximum likelihood perspective. Additionally, we presented specific implementation details for the iterative process, including the regularized optimization algorithm and the geometric search cross-validation tuning algorithm. Three distinct penalty methods, Lasso, Ridge, and Elastic-net regularization, were meticulously considered. Meanwhile, the proposed Theta-regularized Kriging models were tested on nine common numerical functions and two practical engineering examples. The results demonstrate that, compared with other penalized Kriging models, the proposed model performs better in terms of accuracy and stability.

0

stat.CO 2026-04-15

Adaptive sparse group lasso delivers dual sparsity for quantile regression

Adaptive Sparse Group Lasso Penalized Quantile Regression via Dual ADMM

Dual ADMM optimization with adaptive penalties delivers simultaneous within- and between-group sparsity plus global convergence.

abstract click to expand

Sparse penalized quantile regression provides an effective framework for variable selection and robust estimation in high-dimensional data analysis. When ex planatory variables are organized into groups, achieving sparsity both within and between groups is essential. However, existing quantile regression methods often fail to meet this dual objective. To address this gap, we introduce the adaptive sparse group lasso penalized quantile regression, which integrates adaptive lasso and adaptive group lasso penalties. We optimize the model parameters via the alternating direction method of multipliers (ADMM) applied to the dual problem, and establish global convergence. Through extensive simulation studies and real data analyses, we demonstrate (i) the efficacy of the proposed method in achieving simultaneous within- and between-group sparsity, and (ii) the computational efficiency of our algorithm relative to existing alternatives.

0

stat.CO 2026-04-15

New algorithm fits linear models in p-adics under digit noise

p-adic Linear Regression for Random Sampling with Digitwise Noise

Probabilistic method recovers coefficients from random p-adic samples where individual digits are corrupted; covers modulo p case as well.

abstract click to expand

We propose a new probabilistic algorithm of $p$-adic linear regression for random sampling with digitwise noise. This includes a new probabilistic algorithm of modulo $p$ linear regression.

1 0

0

stat.CO 2026-04-15

Multi-object posterior sampled via explicit Bernoulli conditionals

Multi-Object Posterior Computation via Gibbs Sampling

Gibbs updates become tractable because each conditional is a Bernoulli random finite set with closed-form existence probability and density,

abstract click to expand

This work presents a tractable approach to multi-object posterior computation under a generic measurement likelihood function. While filtering is a popular solution, valuable historical information is discarded. Posterior inference, which captures the full history of the multi-object states, provides a more comprehensive solution but is notoriously difficult and has received limited attention. Our proposed approach uses Gibbs Sampling (GS) to generate samples from the multi-object posterior. In particular, we establish that the conditional distributions of the multi-object posterior are Bernoulli random finite sets with explicit existence probabilities and attribute densities. These conditionals are straightforward to evaluate and sample from, enabling the construction of an efficient Gibbs sampler with standard convergence guarantees. To demonstrate its versatility, we develop the first multi-scan multi-object smoothing algorithm for superpositional measurements. Numerical experiments show that the proposed method delivers robust performance in challenging low-SNR scenarios where detection based smoothing deteriorates. Moreover, posterior samples obtained from our approach provide statistical characterizations of key variables and parameters, highlighting the advantages of posterior inference. This approach enriches multi-object estimation techniques, which historically lacked smoothing capabilities for non-standard measurements.

0

stat.CO 2026-04-14

Sobolev CLR penalties align functional data without derivative noise

Sobolev-Regularized Objective Functions for Robust Pairwise Alignment of Functional Data

Penalizing velocity and acceleration ensures monotonic warps that separate phase from amplitude in noisy curves.

abstract click to expand

Functional data registration is a critical challenge in modern statistics, essential for separating phase variability from amplitude variability. While derivative-based frameworks offer mathematically elegant solutions, their dependence on signal velocities renders them susceptible to additive noise. This study proposes and evaluates a family of robust, Sobolev-regularized objective functions for the pairwise alignment of functional data, operating entirely within the original function space to avoid the need for numerical differentiation of the data. We define our optimization over a second-order Sobolev space and utilize the Centered Log-Ratio (CLR) transform to represent the warping functions. By penalizing both the velocity and acceleration of the centered log-derivative, this geometric approach preempts degenerate "pinching" artifacts and ensures the resulting warps are strictly monotonic, valid diffeomorphisms. In practice, this allows for highly efficient, unconstrained optimization within a finite-dimensional space. We systematically investigate four distinct pairwise data mismatch formulations: a Standard L2 baseline, a Symmetric L2 formulation, an Isometry (L2-preserving) mapping, and a Jacobian-weighted L2 functional. We establish robust theoretical foundations for these methods, proving the existence of optimal warps and the asymptotic consistency of the finite-dimensional estimators. Our results demonstrate that this CLR-regularized framework offers a powerful, computationally scalable, and noise-robust alternative to traditional derivative-based registration.

0

stat.CO 2026-04-14

iglm package enables regression under interference in networks

R Package iglm: Regression under Interference in Connected Populations

Scalable convex optimization with theoretical guarantees lets users study spillovers in connected populations such as social media and study

abstract click to expand

We introduce R package iglm, which implements a comprehensive framework for studying relationships among predictors and outcomes under interference. The implemented regression framework facilitates the study of spillover and other phenomena in connected populations and has important advantages over existing packages, among them scalability and provable theoretical guarantees. On the computational side, the regression framework relies on scalable methods that can be applied to small and large data sets, by solving a convex optimization program based on pseudo-likelihoods using Minorization-Maximization and Quasi-Newton algorithms. On the statistical side, the regression framework comes with provable theoretical guarantees. To increase the versatility of iglm, users can add custom-built model terms. We showcase iglm using two data sets, including hate speech on the social media platform X and communications among students.

0

stat.CO 2026-04-14

Fixed uniforms turn into exact Beta(a,1-a) samples

Extended One-Liners for the Beta, Gamma, and Dirichlet Distributions with Shape Parameters Below One

Elementary operations on a set number of uniforms produce exact draws from Beta, Gamma and Dirichlet when parameters are below one.

abstract click to expand

We present an explicit deterministic transformation of a fixed number of i.i.d. uniform random variables with exact Beta$(a,1-a)$ law for $0<a<1$, using only elementary operations (an ``extended one-liner'', see \cite{devroye1996oneline}). As corollaries, the families Beta$(a,b)$ with $\min(a,b)<1$, Gamma$(c)$ with $c<1$, and Dirichlet$(\alpha_1,\dots,\alpha_d)$ with $0<\alpha_i<1$, for fixed $d$, also have extended one-\liners.

1 0

0

stat.CO 2026-04-14

R package blocks data leakage to lower optimistic bias in biomedical ML

bioLeak: Leakage-Aware Modeling and Diagnostics for Machine Learning in R

Simulations and transcriptomic case study show guarded pipelines differ from leaky ones

abstract click to expand

Data leakage remains a recurrent source of optimistic bias in biomedical machine learning studies. Standard row-wise cross-validation and globally estimated preprocessing steps are often inappropriate for data with repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies. This paper describes bioLeak, an R package for constructing leakage-aware resampling workflows and for auditing fitted models for common leakage mechanisms. The package provides leakage-aware split construction, train-fold-only preprocessing, cross-validated model fitting, nested hyperparameter tuning, post hoc leakage audits, and HTML reporting. The implementation supports binary classification, multiclass classification, regression, and survival analysis, with task-specific metrics and S4 containers for splits, fits, audits, and inflation summaries. The simulation artifacts show how apparent performance changes under controlled leakage mechanisms, and the case study illustrates how guarded and leaky pipelines can yield materially different conclusions on multi-study transcriptomic data. The emphasis throughout is on software design, reproducible workflows, and interpretation of diagnostic output.

0

stat.CO 2026-04-13

Hierarchical mass matrix unlocks closed-form leapfrog for RMHMC

Adaptive Riemannian Manifold Hamiltonian Monte Carlo with Hierarchical Metric

Adaptive tuning of the imposed structure enables efficient dynamic sampling in high-dimensional problems without requiring target hierarchy.

abstract click to expand

Hamiltonian Monte Carlo (HMC) and its dynamic extensions, such as the No-U-Turn Sampler (NUTS), are powerful Markov chain Monte Carlo methods for sampling from complex, high-dimensional probability distributions. Riemannian manifold Hamiltonian Monte Carlo (RMHMC) extends HMC by allowing the mass matrix to depend on position, which can substantially improve mixing but also makes implementation considerably more challenging. In this paper, we study an adaptive hierarchical version of RMHMC that is well suited to many hierarchical sampling problems. A key feature of hierarchical RMHMC is that, unlike general RMHMC, it admits a closed-form explicit leapfrog integrator, enabling efficient implementation and direct use within dynamic HMC methods such as NUTS. We introduce an adaptive scheme that automatically tunes the parameters of the hierarchical mass matrix during simulation. Importantly, the target density need not exhibit any hierarchical or block structure; the hierarchy is instead imposed on the mass matrix as a modeling device to capture the local geometry of the target distribution. Numerical experiments demonstrate appealing empirical performance in high-dimensional Bayesian inference problems.

0

stat.CO 2026-04-13 Recognition

Sparse MCMC preconditioner learns correlations at O(m^2 d) cost

High-dimensional Adaptive MCMC with Reduced Computational Complexity

Online PCA plus reflection matrices give better time-normalized performance than diagonal or full dense alternatives on correlated targets.

abstract click to expand

We propose an adaptive MCMC method that learns a linear preconditioner which is dense in its off-diagonal elements but sparse in its parametrisation. Due to this sparsity, we achieve a per-iteration computational complexity of $O(m^2d)$ for a user-determined parameter $m$, compared with the $O(d^2)$ complexity of existing adaptive strategies that can capture correlation information from the target. Diagonal preconditioning has an $O(d)$ per-iteration complexity, but is known to fail in the case that the target distribution is highly correlated, see \citet[Section 3.5]{hird2025a}. Our preconditioner is constructed using eigeninformation from the target covariance which we infer using online principal components analysis on the MCMC chain. It is composed of a diagonal matrix and a product of carefully chosen reflection matrices. On various numerical tests we show that it outperforms diagonal preconditioning in terms of absolute performance, and that it outperforms traditional dense preconditioning and multiple diagonal plus low-rank alternatives in terms of time-normalised performance.

1 0

0

stat.CO 2026-04-10

Python tool fuses heart data types for earlier disease detection

mmid: Multi-Modal Integration and Downstream analyses for healthcare analytics in Python

The package combines MRI scans, ECG readings and genetic scores to outperform single sources while filling gaps in incomplete records.

abstract click to expand

mmid (Multi-Modal Integration and Downstream analyses for healthcare analytics) is a Python package that offers multi-modal fusion and imputation, classification, time-to-event prediction and clustering functionalities under a single interface, filling the gap of sequential data integration and downstream analyses for healthcare applications in a structured and flexible environment. mmid wraps in a unique package several algorithms for multi-modal decomposition, prediction and clustering, which can be combined smoothly with a single command and proper configuration files, thus facilitating reproducibility and transferability of studies involving heterogeneous health data sources. A showcase on personalised cardiovascular risk prediction is used to highlight the relevance of a composite pipeline enabling proper treatment and analysis of complex multi-modal data. We thus employed mmid in an example real application scenario involving cardiac magnetic resonance imaging, electrocardiogram, and polygenic risk scores data from the UK Biobank. We proved that the three modalities captured joint and individual information that was used to (1) early identify cardiovascular disease before clinical manifestations with cardiological relevance, and (2) do it better than single data sources alone. Moreover, mmid allowed to impute partially observable data modalities without considerable performance losses in downstream disease prediction, thus proving its relevance for real-world health analytics applications (which are often characterised by the presence of missing data).

0

stat.CO 2026-04-10 Recognition

Vine copulas build dependence trees from mixed EHR data

Vine Copulas for Analyzing Multivariate Conditional Dependencies in Electronic Health Records Data

The trees rank variables and isolate central ones, letting analysts explore co-morbid links without assuming Gaussian distributions.

abstract click to expand

Electronic health records (EHR) store hundreds of demographic and laboratory variables from large patient populations. Traditional statistical methods have limited capacity in processing mixed-type data (continuous, ordinal) and capturing non-linear relationships in large multivariate data when oversimplified assumptions are made about the distribution (e.g., Gaussian) of disparate variables in EHR data. This paper addresses the limitations mentioned above by repurposing the vine copula method, which is primarily used to synthesize a multivariate distribution from many bivariate cumulative distribution functions (copulas). Vine copulas produce tree structures that represent bivariate conditional dependencies at varying hierarchical levels, decomposing a multivariate distribution. The tree structure is used to rank variables by conditional dependence and to identify a subset of central variables with local dependence, thus simplifying probabilistic mining of high-dimensional EHR data. The proposed application of vine copulas is used to identify conditional dependence between co-morbid conditions and is validated for characterizing different cohorts of EHR patients. The contribution of this paper is a novel approach to probabilistic mining and exploration of healthcare data that provides data-driven explanations, visualization, and variable selection to prognosticate a healthcare outcome. The source code is shared publicly.

0

stat.CO 2026-04-08 Recognition

Niching stabilizes importance sampling on multi-modal failure surfaces

Niching Importance Sampling for Multi-modal Rare-event Simulation

The combined estimator avoids degeneracy where standard methods collapse, yielding reliable probability estimates across numerical testcases

abstract click to expand

This paper proposes niching importance sampling, a framework that combines concepts from reliability analysis, e.g. Markov chains, importance sampling, and relative cross entropy minimisation, with niching techniques from evolutionary multi-modal optimisation. The result is a highly robust estimator of the probability of failure, that can tackle sampling challenges posed by the underlying geometry of a reliability problem. Niching importance sampling is tested on a range of numerical examples and is shown to consistently avoid the degenerate behaviour observed for existing reliability methods on several multi-modal performance functions.

0

stat.CO 2026-04-08 Recognition

S-learner ranks top 20% to capture 78% of campaign lift

A Large-Scale Empirical Comparison of Meta-Learners and Causal Forests for Heterogeneous Treatment Effect Estimation in Marketing Uplift Modeling

On 14 million records the S-learner with LightGBM outperforms T-learner, X-learner and causal forest by Qini score and cumulative gain.

abstract click to expand

Estimating Conditional Average Treatment Effects (CATE) at the individual level is central to precision marketing, yet systematic benchmarking of uplift modeling methods at industrial scale remains limited. We present UpliftBench, an empirical evaluation of four CATE estimators: S-Learner, T-Learner, X-Learner (all with LightGBM base learners), and Causal Forest (EconML), applied to the Criteo Uplift v2.1 dataset comprising 13.98 million customer records. The near-random treatment assignment (propensity AUC = 0.509) provides strong internal validity for causal estimation. Evaluated via Qini coefficient and cumulative gain curves, the S-Learner achieves the highest Qini score of 0.376, with the top 20% of customers ranked by predicted CATE capturing 77.7% of all incremental conversions, a 3.9x improvement over random targeting. SHAP analysis identifies f8 as the dominant heterogeneous treatment effect (HTE) driver among the 12 anonymized covariates. Causal Forest uncertainty quantification reveals that 1.9% of customers are confident persuadables (lower 95% CI > 0) and 0.1% are confident sleeping dogs (upper 95% CI < 0). Our results provide practitioners with evidence-based guidance on method selection for large-scale uplift modeling pipelines.

0

stat.CO 2026-04-08 2 theorems

Stochastic emulators reduce high-dimensional RBDO to deterministic optimization

High-dimensional reliability-based design optimization using stochastic emulators

Modeling conditional response distributions in design space avoids nested Monte Carlo and scales better than Kriging as dimensionality grows

abstract click to expand

Reliability-based design optimization (RBDO) is traditionally formulated as a nested optimization and reliability problem. Although surrogate models are generally employed to improve efficiency, the approach remains computationally prohibitive in high-dimensional settings. This paper proposes a novel RBDO framework based on a stochastic simulator viewpoint, in which the deterministic limit-state function and the uncertainty in the model inputs are combined into a unified stochastic representation. Under this formulation, the system response conditioned on a given design is modeled directly through its output distribution, rather than through an explicit limit-state function. Stochastic emulators are constructed in the design space to approximate the conditional response distribution, enabling the semi-analytical evaluation of failure probabilities or associated quantiles without resorting to Monte Carlo simulation. Two classes of stochastic emulators are investigated, namely generalized lambda models and stochastic polynomial chaos expansions. Both approaches provide a deterministic mapping between design variables and reliability constraints, which breaks the classical double-loop structure of RBDO and allows the use of standard deterministic optimization algorithms. The performance of the proposed approach is evaluated on a set of benchmark problems with dimensionality ranging from low to very high, including a case with stochastic excitation. The results are compared against a Kriging-based approach formulated in the full input space. The proposed method yields substantial computational gains, particularly in high-dimensional settings. While its efficiency is comparable to Kriging for low-dimensional problems, it significantly outperforms Kriging as the dimensionality increases.

0

stat.CO 2026-04-07 Recognition

R package stops preprocessing from inflating ML scores

fastml: Guarded Resampling Workflows for Safer Automated Machine Learning in R

Guarded resampling inside each fold yields realistic performance numbers while matching tidymodels accuracy with less code.

abstract click to expand

Preprocessing leakage arises when scaling, imputation, or other data-dependent transformations are estimated before resampling, inflating apparent performance while remaining hard to detect. We present fastml, an R package that provides a single-call interface for leakage-aware machine learning through guarded resampling, where preprocessing is re-estimated inside each resample and applied to the corresponding assessment data. The package supports grouped and time-ordered resampling, blocks high-risk configurations, audits recipes for external dependencies, and includes sandboxed execution and integrated model explanation. We evaluate fastml with a Monte Carlo simulation contrasting global and fold-local normalization, a usability comparison with tidymodels under matched specifications, and survival benchmarks across datasets of different sizes. The simulation demonstrates that global preprocessing substantially inflates apparent performance relative to guarded resampling. fastml matched held-out performance obtained with tidymodels while reducing workflow orchestration, and it supported consistent benchmarking of multiple survival model classes through a unified interface.

0

stat.CO 2026-04-07 Recognition

R package aggregates frequency tables with bounded disclosure risk

iLBA: An R package for confidentially disseminating aggregated frequency tables

iLBA combines small-cell adjustment and controlled aggregation to let agencies release usable tables while hiding individual records.

abstract click to expand

Statistical agencies frequently release frequency tables derived from microdata, but small frequency cells may lead to disclosure risks. We present \texttt{iLBA}, an open-source \textsf{R} package for confidential dissemination of aggregated frequency tables. The package implements the Information-Loss-Bounded Aggregation (iLBA) algorithm, which combines Small Cell Adjustment (SCA) at the finest level table with an aggregation procedure that introduces controlled ambiguity while bounding information loss. The software enables users to construct masked finest level tables, generate confidential aggregated tables for selected variables, and obtain masked frequencies for single-cell queries. By providing an accessible implementation of the iLBA method, the package facilitates reproducible and efficient disclosure control for tabular data derived from microdata.

0

stat.CO 2026-04-06 2 theorems

SMC samplers receive explicit finite-sample error bounds

On the complexity of standard and waste-free SMC samplers

Bounds apply to expectations and normalising constants and yield complexity scaling in T or dimension d.

abstract click to expand

We establish finite sample bounds for the error of standard and waste-free SMC samplers. Our results cover estimates of both expectations and normalising constants of the target distributions. We consider first an arbitrary sequence of distributions, and then specialise our results to tempering sequences. We use our results to derive the complexity of SMC samplers with respect to the parameters of the problem, such as $T$, the number of target distributions, in the general case, or $d$, the dimension of the ambient space, in the tempering case. We use these bounds to derive practical recommendations for the implementation of SMC samplers for end users.

0