Extending Sheldon M. Ross's Method for Efficient Large-Scale Variance Computation

Jiawen Li

arxiv: 2410.21922 · v10 · submitted 2024-10-29 · 📊 stat.CO · math.ST· stat.TH

Extending Sheldon M. Ross's Method for Efficient Large-Scale Variance Computation

Jiawen Li This is my paper

Pith reviewed 2026-05-23 19:33 UTC · model grok-4.3

classification 📊 stat.CO math.STstat.TH

keywords variance computationbatch updatessufficient statisticsstreaming dataRoss methodChan-Golub-LeVeque formulaonline statistics

0 comments

The pith

Batch variance updates using stored prior statistics beat Ross's method except when the new batch has size one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Prior Knowledge Acceleration (PKA), a method that reuses the count, sum, and sum of squares already computed for an existing dataset when new observations arrive in a batch. This produces an update formula algebraically identical to the 1983 Chan-Golub-LeVeque pairwise method, yet the authors supply a runtime analysis that yields an explicit acceleration factor and shows the cost advantage over both full recomputation and Ross's single-sample update. They prove Ross's approach is cheapest only for a batch of size exactly one. The same reuse principle extends directly to covariance and other decomposable statistics. Experiments on synthetic and real streaming data confirm the predicted speedups, reaching hundreds of times faster when the prior set is large relative to the incoming batch.

Core claim

Ross's single-sample variance update is preferable only when the new batch contains exactly one observation (N2 = 1); for any larger batch the PKA reuse of prior sufficient statistics delivers strictly lower arithmetic cost, with the crossover point governed by the derived acceleration factor τ_a.

What carries the argument

The Prior Knowledge Acceleration (PKA) batch-update identity that combines stored prior count/sum/sum-of-squares with the new batch's own count/sum/sum-of-squares.

If this is right

For every batch size greater than one, the batch-update formula has lower arithmetic cost than Ross's single-sample recurrence.
The same reuse pattern applies without change to covariance matrices and any other statistic whose update depends only on a fixed set of power sums.
When the prior dataset is orders of magnitude larger than the incoming batch, observed wall-clock speedups reach hundreds of times relative to Welford, Chan pairwise, or naive two-pass baselines.
The method remains correct for streaming data that arrives in variable-sized batches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems that already maintain running statistics for monitoring or checkpointing can adopt the batch formula at essentially zero extra implementation cost.
The same algebraic structure suggests analogous accelerations for higher-order moments or for any decomposable U-statistic.
In distributed or federated settings the prior statistics can be shipped once and then updated locally on each new batch without retransmitting raw data.

Load-bearing premise

The count, sum, and sum of squares for the prior dataset have already been computed and stored, so the marginal cost is only the arithmetic on the new batch.

What would settle it

A controlled timing run on a machine where the new batch size exceeds the prior dataset size and no prior statistics are pre-stored; if the PKA method is still faster than naive two-pass recomputation, the central claim does not hold.

read the original abstract

We introduce Prior Knowledge Acceleration (PKA), a batch-update method for variance that reuses previously computed sufficient statistics to avoid full recomputation. The update identity is algebraically equivalent to the pairwise formula of Chan, Golub, and LeVeque (1983); our contribution is a runtime-cost analysis that derives an explicit acceleration factor $\tau_a$ and identifies the data-size regime where batch updating outperforms both na\"ive recomputation and Ross's single-sample method. We prove that Ross's approach is preferable only when the new batch contains a single sample ($N_2 = 1$). We further generalise the framework to covariance and other decomposable statistics. Benchmarks against Welford, Chan pairwise, and na\"ive two-pass baselines on synthetic and real-world streaming data confirm the theoretical predictions, with speedups of up to $454\times$ when the prior dataset is large relative to the new batch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper recycles the 1983 Chan pairwise variance update, adds a runtime cost model with an explicit acceleration factor, and proves Ross's method wins only for N2=1 batches.

read the letter

The central point is a cost comparison for batch variance updates on large prior data. They restate an algebraic identity they say matches Chan, Golub, and LeVeque 1983 exactly, then derive an acceleration factor tau_a, show that Ross's single-sample method is cheaper only when the new batch has size one, and extend the setup to covariance. Benchmarks on synthetic and real streaming data are said to match the predictions and reach speedups of hundreds of times when the prior set dwarfs the new batch. That regime analysis and the concrete factor are the actual additions, and the benchmarks give a practical check on the claims. The generalization to other decomposable statistics is also straightforward to follow if the algebra carries over. The main limitation is that the update formula itself is not new, so the work reduces to analyzing when each known method pays off and proving the N2=1 restriction. If that proof is short algebra, the paper is essentially a focused note rather than a broad advance. The abstract is silent on numerical stability or floating-point error growth, which could matter once N gets large. No circularity or hidden fitting shows up in the stated claims. This is useful for readers who maintain streaming pipelines and need to decide between Welford, pairwise, or Ross-style updates on real hardware. A practitioner who cares about the exact crossover points would get value from the tau_a derivation and the reported timings. It is worth sending to a referee to check the derivations and benchmark reproducibility, even with the limited novelty.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Prior Knowledge Acceleration (PKA), a batch-update method for variance that reuses precomputed sufficient statistics (count, sum, sum of squares) from a prior dataset. It asserts algebraic equivalence to the Chan-Golub-LeVeque pairwise formula (1983), derives an explicit acceleration factor τ_a, proves that Ross's single-sample method is preferable only when the new batch size N2 = 1, generalizes the framework to covariance and other decomposable statistics, and reports benchmarks against Welford, Chan pairwise, and naïve baselines on synthetic and real streaming data with speedups up to 454× when the prior dataset is large relative to the new batch.

Significance. If the runtime-cost analysis and the stated algebraic equivalence hold, the work supplies a concrete regime-based guideline (via τ_a) for selecting among incremental variance methods in large-scale streaming settings. The explicit comparison to Ross's method and the generalization to covariance add practical value; the benchmarks are presented as confirming the derived predictions.

major comments (2)

[Abstract / derivation of τ_a] Abstract and the section deriving τ_a: the central claim that Ross's method is preferable only for N2 = 1 rests on an explicit acceleration factor τ_a whose derivation is asserted but not supplied with the intermediate algebraic steps or the precise cost-model assumptions (e.g., arithmetic-operation counts for each method). Without these steps the load-bearing runtime comparison cannot be verified.
[Benchmarks] Benchmark section: the reported speedups (up to 454×) and confirmation of theoretical predictions are given without accompanying error analysis, floating-point stability discussion, or condition-number checks for large prior datasets. Because the method is claimed to be algebraically equivalent, any deviation in observed accuracy would undermine the practical recommendation.

minor comments (1)

Notation for the prior and new batch sizes (N1, N2) and the sufficient statistics should be introduced once with a clear table or equation block to avoid repeated re-definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments. We address each major point below and will incorporate the suggested clarifications into a revised manuscript.

read point-by-point responses

Referee: [Abstract / derivation of τ_a] Abstract and the section deriving τ_a: the central claim that Ross's method is preferable only for N2 = 1 rests on an explicit acceleration factor τ_a whose derivation is asserted but not supplied with the intermediate algebraic steps or the precise cost-model assumptions (e.g., arithmetic-operation counts for each method). Without these steps the load-bearing runtime comparison cannot be verified.

Authors: We agree that the intermediate steps in the derivation of τ_a were presented at a summary level. In the revision we will expand the relevant section to include the complete algebraic derivation from the operation counts of each method. The cost model will be stated explicitly (counting additions, multiplications, and divisions under standard floating-point arithmetic) so that the comparison yielding τ_a and the optimality of Ross only for N2=1 can be verified directly. revision: yes
Referee: [Benchmarks] Benchmark section: the reported speedups (up to 454×) and confirmation of theoretical predictions are given without accompanying error analysis, floating-point stability discussion, or condition-number checks for large prior datasets. Because the method is claimed to be algebraically equivalent, any deviation in observed accuracy would undermine the practical recommendation.

Authors: Because the PKA update is algebraically equivalent to the Chan-Golub-LeVeque formula, the computed values are identical in exact arithmetic. We will add to the benchmark section a short floating-point analysis together with relative-error measurements against a higher-precision reference for the reported datasets. We will also report the condition numbers of the test matrices to place the observed numerical behavior in context; these checks are feasible and will be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation chain rests on algebraic equivalence to the external Chan et al. (1983) pairwise formula, an explicit runtime-cost model deriving τ_a, and a proof that Ross's single-sample method is cheaper only for N2=1. These steps are presented as direct algebraic identities and cost comparisons under the stated premise of precomputed prior statistics; no fitted parameters are relabeled as predictions, no self-citations are load-bearing, and no ansatz or uniqueness claim reduces to the paper's own inputs. Benchmarks are described as external confirmation rather than the source of the claimed identities. The central claims therefore remain independent of the result they evaluate.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no free parameters, invented entities, or non-standard axioms are identifiable from the given text. The work relies on standard algebraic identities for sums and sums of squares.

pith-pipeline@v0.9.0 · 5679 in / 1227 out tokens · 24719 ms · 2026-05-23T19:33:24.183830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

(1993, December)

Agterberg, F.P. (1993, December). Calculation of the variance of mean values for blocks in regional resource evaluation studies. Nonrenewable Resources, 2(4), 312–324, https://doi.org/10.1007/bf02257541

work page doi:10.1007/bf02257541 1993
[2]

Bekci, R.Y. (2024). Online learning of delayed choices. A. Globerson et al. (Eds.), Advances in neural information processing systems (Vol. 37, pp. 2292–2322). Curran Associates, Inc

work page 2024
[3]

Chan, T.F., Golub, G.H., LeVeque, R.J. (1983). Algorithms for computing the sample variance: Analysis and recommendations. The American Statistician , 37(3), 242–247, https://doi.org/10.1080/00031305.1983.10483115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/00031305.1983.10483115 1983
[4]

(1977, September)

Dempster, A.P., Laird, N.M., Rubin, D.B. (1977, September). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statisti- cal Society Series B: Statistical Methodology , 39(1), 1–22, Retrieved from http://dx.doi.org/10.1111/j.2517-6161.1977.tb01600.x

work page doi:10.1111/j.2517-6161.1977.tb01600.x 1977
[5]

Fisher, R.A. (1935). The Design of Experiments . Edinburgh: Oliver and Boyd

work page 1935
[6]

Guan, X., & Burton, H. (2022). Bias-variance tradeoff in machine learning: Theoretical formulation and implications to structural engineering applications. Structures, 46, 17-30, https://doi.org/10.1016/j.istruc.2022.10.004

work page doi:10.1016/j.istruc.2022.10.004 2022
[7]

Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning . Springer New York

work page 2009
[8]

Hebrail, G., & Berard, A. (2006). Individual Household Electric Power Consumption. UCI Machine Learning Repository. (DOI: https://doi.org/10.24432/C58K54)

work page doi:10.24432/c58k54 2006
[9]

Hu, S., Li, G., Shi, W. (2023). Lars: A latency-aware and real-time scheduling framework for edge-enabled internet of vehicles. IEEE Transactions on Services Computing, 16(1), 398-411, https://doi.org/10.1109/TSC.2021.3106260

work page doi:10.1109/tsc.2021.3106260 2023
[10]

Knuth, D.E. (2005). The art of computer programming, volume 1, fascicle 1: Mmix a risc computer for the new millenium . Addison-Wesley. 15

work page 2005
[12]

Ross, S.M. (2021). Introduction to probability and statistics for engineers and scientists. London, United Kingdom: Academic Press

work page 2021
[13]

Schmitt, S.M., & Fessler, J.A. (2012). Fast variance computation for quadratically penalized iterative reconstruction of 3d axial ct images. 2012 ieee nuclear science symposium and medical imaging conference record (nss/mic) (p. 3287-3292). Searc´ oid, M. (2006).Lipschitz functions . Berlin, New York: Springer-Verlag

work page 2012
[14]

Zhai, J., Wang, J., Wang, X. (2014). Ensemble online sequential extreme learning machine for large data set classification. 2014 ieee international conference on systems, man, and cybernetics (smc) (p. 2250-2255)

work page 2014
[15]

Zhang, X., & Lu, X. (2021). Online algorithm for variance components estimation. Communications in Nonlinear Science and Numerical Simulation , 97, 105722, https://doi.org/https://doi.org/10.1016/j.cnsns.2021.105722 16

work page doi:10.1016/j.cnsns.2021.105722 2021

[1] [1]

(1993, December)

Agterberg, F.P. (1993, December). Calculation of the variance of mean values for blocks in regional resource evaluation studies. Nonrenewable Resources, 2(4), 312–324, https://doi.org/10.1007/bf02257541

work page doi:10.1007/bf02257541 1993

[2] [2]

Bekci, R.Y. (2024). Online learning of delayed choices. A. Globerson et al. (Eds.), Advances in neural information processing systems (Vol. 37, pp. 2292–2322). Curran Associates, Inc

work page 2024

[3] [3]

Chan, T.F., Golub, G.H., LeVeque, R.J. (1983). Algorithms for computing the sample variance: Analysis and recommendations. The American Statistician , 37(3), 242–247, https://doi.org/10.1080/00031305.1983.10483115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/00031305.1983.10483115 1983

[4] [4]

(1977, September)

Dempster, A.P., Laird, N.M., Rubin, D.B. (1977, September). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statisti- cal Society Series B: Statistical Methodology , 39(1), 1–22, Retrieved from http://dx.doi.org/10.1111/j.2517-6161.1977.tb01600.x

work page doi:10.1111/j.2517-6161.1977.tb01600.x 1977

[5] [5]

Fisher, R.A. (1935). The Design of Experiments . Edinburgh: Oliver and Boyd

work page 1935

[6] [6]

Guan, X., & Burton, H. (2022). Bias-variance tradeoff in machine learning: Theoretical formulation and implications to structural engineering applications. Structures, 46, 17-30, https://doi.org/10.1016/j.istruc.2022.10.004

work page doi:10.1016/j.istruc.2022.10.004 2022

[7] [7]

Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning . Springer New York

work page 2009

[8] [8]

Hebrail, G., & Berard, A. (2006). Individual Household Electric Power Consumption. UCI Machine Learning Repository. (DOI: https://doi.org/10.24432/C58K54)

work page doi:10.24432/c58k54 2006

[9] [9]

Hu, S., Li, G., Shi, W. (2023). Lars: A latency-aware and real-time scheduling framework for edge-enabled internet of vehicles. IEEE Transactions on Services Computing, 16(1), 398-411, https://doi.org/10.1109/TSC.2021.3106260

work page doi:10.1109/tsc.2021.3106260 2023

[10] [10]

Knuth, D.E. (2005). The art of computer programming, volume 1, fascicle 1: Mmix a risc computer for the new millenium . Addison-Wesley. 15

work page 2005

[11] [12]

Ross, S.M. (2021). Introduction to probability and statistics for engineers and scientists. London, United Kingdom: Academic Press

work page 2021

[12] [13]

Schmitt, S.M., & Fessler, J.A. (2012). Fast variance computation for quadratically penalized iterative reconstruction of 3d axial ct images. 2012 ieee nuclear science symposium and medical imaging conference record (nss/mic) (p. 3287-3292). Searc´ oid, M. (2006).Lipschitz functions . Berlin, New York: Springer-Verlag

work page 2012

[13] [14]

Zhai, J., Wang, J., Wang, X. (2014). Ensemble online sequential extreme learning machine for large data set classification. 2014 ieee international conference on systems, man, and cybernetics (smc) (p. 2250-2255)

work page 2014

[14] [15]

Zhang, X., & Lu, X. (2021). Online algorithm for variance components estimation. Communications in Nonlinear Science and Numerical Simulation , 97, 105722, https://doi.org/https://doi.org/10.1016/j.cnsns.2021.105722 16

work page doi:10.1016/j.cnsns.2021.105722 2021