Gradient-Free Warm-Start Library Recovery: an Amortized-Regret Separation

Germany); Jianwei Lou (RailMind Systems; Neuss

arxiv: 2606.21253 · v1 · pith:HVZKS7JHnew · submitted 2026-06-19 · 💻 cs.LG · cs.NE

Gradient-Free Warm-Start Library Recovery: an Amortized-Regret Separation

Jianwei Lou (RailMind Systems , Neuss , Germany) This is my paper

Pith reviewed 2026-06-26 14:53 UTC · model grok-4.3

classification 💻 cs.LG cs.NE

keywords continual learningwarm-start libraryamortized regretregime recognitiongradient-free learningrecurring streamsedge deployment

0 comments

The pith

A warm-start library learner attains amortized recovery cost O(KD/ε²+(R-K)logK/Δ²) versus Θ(RD/ε²) for memoryless re-estimation on recurring streams with given segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that on streams with recurring regimes, a gradient-free warm-start library learner can achieve lower amortized recovery cost than a memoryless re-estimator when segmentation is provided externally. The key is that recognizing which of K known regimes is active costs only O(log K/Δ²) independent of dimension D, while estimating a new regime always costs Θ(D/ε²). This leads to an advantage that grows with the number of recurrences and the dimension. The separation is tight according to matching lower bounds and does not require backpropagation or suffer forgetting.

Core claim

Given segmentation of the stream into regimes, the warm-start library learner attains an amortized recovery cost of O(KD/ε²+(R-K)logK/Δ²) versus a memoryless re-estimator's Θ(RD/ε²), because regime recognition costs O(log K/Δ²) independent of D whereas regime estimation costs Θ(D/ε²); matching lower bounds establish that each term is individually minimax-tight.

What carries the argument

Decoupling of regime recognition, which costs O(log K/Δ²) and is independent of dimension, from regime estimation, which costs Θ(D/ε²).

If this is right

The advantage scales with dimension D and recurrence density (R-K).
Recognition is Θ(log K/Δ²) and memoryless estimation is Ω(RD/ε²) in the worst case.
The method requires no end-to-end backprop and guarantees zero forgetting by construction.
A count-calibrated variant matches the leading constant of a Bayesian baseline up to bounded overshoot.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If regimes are well-separated, library methods could support efficient continual learning on resource-constrained devices without gradients.
High-dimensional data with many distinct regimes may benefit most from this separation, up to the simplex packing limit of e^Θ(D) regimes.
The cost structure might generalize to other local, online learners that can store and recall past estimates.

Load-bearing premise

The input stream is externally segmented into regimes that remain separated by a minimum distance Δ allowing cheap recognition.

What would settle it

Run the library learner and a memoryless re-estimator on a synthetic stream with known segmentation, R recurrences of K regimes, and measure whether the amortized costs match the stated bounds as D and R vary.

Figures

Figures reproduced from arXiv: 2606.21253 by Germany), Jianwei Lou (RailMind Systems, Neuss.

**Figure 1.** Figure 1: Empirical corroboration of the decoupling and the count-calibrated recurrence tie (synthetic recurring-regime streams, supplied segmentation, five seeds per cell). Throughout, B is the conjugate-Dirichlet library learner Blib of Proposition 2 run end-to-end (a full Bayesian posteriorpredictive over the K components with log-evidence weighting), not an abstracted reference curve. (A) Recognition–estimation… view at source ↗

**Figure 2.** Figure 2: The decoupling on real regime distributions (K=4 regimes = empirical k-mer laws of four genomes, GC 19–72%; i.i.d. block draws; supplied segmentation; five seeds × two recurrence densities). (A) Memoryless estimation cost grows ∝ D1.04 (Θ(D/ε2 ), zero censoring) while Wcc and B recognition cost stays dimension-independent. (B) A label-shuffle null (the same Wcc predictions scored against a permuted regime … view at source ↗

read the original abstract

Continual learning that is gradient-free, local, online, and append-only is attractive for edge and streaming deployment, but its value is usually argued informally. We give a provable account on recurring-regime streams. Given segmentation, a warm-start library learner attains amortized recovery cost $O\!\big(KD/\varepsilon^2+(R-K)\logK/\Delta^2\big)$ versus a memoryless re-estimator's $\Theta(RD/\varepsilon^2)$, an advantage $(R-K)\,\Theta(D/\varepsilon^2)$ growing with dimension $D$ and recurrence density. The mechanism is a decoupling: recognizing which of $K$ seen regimes is active costs $O(\log K/\Delta^2)$, independent of $D$, whereas estimating a regime costs $\Theta(D/\varepsilon^2)$. We prove this is tight: matching lower bounds give recognition $\Theta(\log K/\Delta^2)$ and a memoryless-class bound $\Omega(RD/\varepsilon^2)$, so each term is individually minimax-tight (the joint statement is conditional). The separation is born-immune (a memoryless learner's advantage is identically zero) and paradigm-level: it matches, and does not beat, a fair spawn-capable Bayesian baseline; the contribution is attaining this cost structure without end-to-end backprop and with zero forgetting by construction. A count-calibrated variant ties the baseline's leading constant up to a bounded, never-negative per-recurrence overshoot, hyperparameter-free and with no per-step transcendentals. We bound the scope: recognizable regimes are capped by simplex packing (walls $e^{\Theta(D)}$); autonomous segmentation is impossible at the packing wall (no detector escapes the false-alarm/delay frontier as regimes overlap); the advantage vanishes under overlap. The dimension-dependent separation is corroborated on synthetic streams and real $k$-mer genome distributions (memoryless cost $\propto D^{1.04}$, recognition $D$-independent); the one real sequential stream sits in the $D{=}1$ near-null corner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly separates D-independent recognition from D-dependent estimation in gradient-free warm-start recovery on recurring streams, with matching lower bounds, but only when segmentation is supplied externally.

read the letter

The main takeaway is that a warm-start library can cut amortized cost from Θ(RD/ε²) down to O(KD/ε² + (R-K)logK/Δ²) when regimes recur and segmentation is given, because recognition costs only logK/Δ² regardless of dimension while estimation still costs Θ(D/ε²). They back this with matching lower bounds on each term separately and show the joint upper bound is conditional on the separation Δ.

What stands out is the explicit decoupling and the fact that the advantage scales with both D and recurrence density. The count-calibrated version ties a Bayesian baseline up to a bounded overshoot without extra hyperparameters. The synthetic scaling checks and the k-mer genome data confirm the predicted D-dependence for memoryless cost versus D-independence for recognition.

The obvious limitation is the external segmentation requirement. The paper states outright that the gain vanishes under overlap and that autonomous detection hits a packing-wall barrier, so the result is scoped to cases where regimes are handed to the algorithm. The single real sequential stream sits at D=1, which is the corner where the claimed advantage is zero. Derivations rely on standard concentration, which is fine but means the tightness claims rest on those usual arguments.

This is useful for people working on gradient-free continual or edge learning who want a formal cost model rather than informal arguments. It organizes when library recovery helps and when it cannot. The scoped claim and the lower-bound matching make it worth sending to referees even if the practical regime is narrow.

Referee Report

0 major / 3 minor

Summary. The manuscript provides a theoretical analysis of gradient-free, local, online, append-only continual learning on recurring-regime streams. Conditional on externally supplied segmentation into K well-separated regimes (minimum distance Δ), it claims that a warm-start library learner attains amortized recovery cost O(KD/ε² + (R-K)logK/Δ²) versus Θ(RD/ε²) for a memoryless re-estimator, yielding advantage (R-K)Θ(D/ε²) that grows with dimension D and recurrence density. The separation arises from decoupling recognition (O(logK/Δ²), D-independent) from estimation (Θ(D/ε²)); matching lower bounds establish individual minimax tightness (joint bound conditional). The result is birth-immune, matches but does not beat a fair Bayesian baseline, and is hyperparameter-free in a count-calibrated variant. Scope is bounded by simplex packing (recognizable regimes ≲ e^Θ(D)); autonomous segmentation is impossible at the packing wall and the advantage vanishes under overlap. Empirical corroboration is given on synthetic streams and real k-mer genome data.

Significance. If the derivations hold, the work supplies a rigorous, paradigm-level cost separation for library-based warm-start recovery in continual learning without gradients or forgetting. Explicit matching lower bounds, the D-dependent advantage, the birth-immune property, and the hyperparameter-free variant are concrete strengths. The scope analysis (packing wall, overlap) and empirical scaling (memoryless cost ∝ D^1.04, recognition D-independent) add value for edge/streaming settings.

minor comments (3)

[Abstract] Abstract: the symbols R, K, D, ε, Δ are used in the cost expressions without an immediate parenthetical definition; a one-sentence glossary on first appearance would aid readability.
[Experiments] Empirical section: the description of error-bar handling, data-exclusion rules, and the precise synthetic-stream generation protocol is not fully detailed in the visible material; adding these would strengthen reproducibility.
[Main theorems] Notation: the distinction between the joint upper bound (conditional) and the individually tight lower bounds could be restated once in the main theorem statement for emphasis.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the accurate and positive summary of the manuscript together with the recommendation of minor revision. No specific major comments appear in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central bounds are derived from standard concentration and minimax arguments applied to externally supplied segmentation into regimes separated by Δ. Recognition cost decouples as O(log K / Δ²) independent of D while estimation remains Θ(D / ε²); the amortized expressions and lower bounds follow directly without reducing to fitted parameters or self-referential definitions inside the paper. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear. The analysis is explicitly conditional on the segmentation assumption and states that the advantage vanishes under overlap, keeping the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard probabilistic concentration inequalities to obtain the recognition and estimation costs; no free parameters are fitted to data, no new entities are postulated, and the main addition is the separation analysis itself.

axioms (1)

standard math Standard concentration inequalities suffice to bound regime recognition at O(log K / Δ²) and regime estimation at Θ(D / ε²)
Invoked to establish the two-term amortized cost and the matching lower bounds.

pith-pipeline@v0.9.1-grok · 5917 in / 1352 out tokens · 35206 ms · 2026-06-26T14:53:47.537398+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, 1989

1989
[2]

Parisi, Ronald Kemker, Jose L

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71, 2019

2019
[3]

Cambridge University Press

Nicol` o Cesa-Bianchi and G´ abor Lugosi.Prediction, Learning, and Games. Cambridge University Press. Cambridge University Press, Cambridge, 2006

2006
[4]

Hicl: Hippocampal- inspired continual learning.arXiv preprint arXiv:2508.16651, 2025

Kushal Kapoor, Wyatt Mackey, Yiannis Aloimonos, and Xiaomin Lin. Hicl: Hippocampal- inspired continual learning.arXiv preprint arXiv:2508.16651, 2025

work page arXiv 2025
[5]

Weinberger

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verd´ u, and Marcelo J. Weinberger. Inequalities for the L1 deviation of the empirical distribution.Hewlett-Packard Labs Technical Report, HPL-2003-97R1, 2003. Technical report

2003
[6]

Tsybakov.Introduction to Nonparametric Estimation

Alexandre B. Tsybakov.Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York, 2009

2009
[7]

Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection.arXiv preprint arXiv:0710.3742, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007
[8]

Persistent memory through triple-loop consolidation in a non-gradient dissipative cognitive architecture.arXiv preprint arXiv:2603.27188, 2026

Jianwei Lou. Persistent memory through triple-loop consolidation in a non-gradient dissipative cognitive architecture.arXiv preprint arXiv:2603.27188, 2026. Under review, Frontiers in Computational Neuroscience. 11

work page arXiv 2026
[9]

John Wiley & Sons, New York, 1947

Abraham Wald.Sequential Analysis. John Wiley & Sons, New York, 1947

1947
[10]

Dragalin, Alexander G

Vladimir P. Dragalin, Alexander G. Tartakovsky, and Venugopal V. Veeravalli. Multihypothesis sequential probability ratio tests — Part I: Asymptotic optimality.IEEE Transactions on Information Theory, 45(7):2448–2461, 1999

1999
[11]

A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, 23(4):493–507, 1952

Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, 23(4):493–507, 1952

1952
[12]

Estimation des densit´ es: risque minimax.Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 47:119–137, 1979

Jean Bretagnolle and Catherine Huber. Estimation des densit´ es: risque minimax.Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 47:119–137, 1979

1979
[13]

Edgar N. Gilbert. A comparison of signalling alphabets.The Bell System Technical Journal, 31(3):504–522, 1952

1952
[14]

Varshamov

Rom R. Varshamov. Estimate of the number of signals in error correcting codes.Doklady Akademii Nauk SSSR, 117:739–741, 1957

1957
[15]

Kabatiansky and Vladimir I

Grigorii A. Kabatiansky and Vladimir I. Levenshtein. Bounds for packings on a sphere and in space.Problems of Information Transmission, 14(1):1–17, 1978

1978
[16]

Beal, Zoubin Ghahramani, and Carl Edward Rasmussen

Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen. The infinite hidden Markov model. InAdvances in Neural Information Processing Systems, volume 14, pages 577–584, 2002

2002
[17]

Fox, Erik B

Emily B. Fox, Erik B. Sudderth, Michael I. Jordan, and Alan S. Willsky. A sticky HDP-HMM with application to speaker diarization.The Annals of Applied Statistics, 5(2A):1020–1056, 2011

2011
[18]

The infinite gaussian mixture model.Advances in Neural Information Processing Systems, 12, 2000

Carl Edward Rasmussen. The infinite gaussian mixture model.Advances in Neural Information Processing Systems, 12, 2000. In NIPS 12

2000
[19]

Procedures for reacting to a change in distribution.The Annals of Mathematical Statistics, 42(6):1897–1908, 1971

Gary Lorden. Procedures for reacting to a change in distribution.The Annals of Mathematical Statistics, 42(6):1897–1908, 1971

1908
[20]

Moustakides

George V. Moustakides. Quickest detection of abrupt changes for a class of random processes. IEEE Transactions on Information Theory, 44(5):1965–1968, 1998

1965
[21]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic for- getting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3...

2017
[22]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, volume 30, pages 6467–6476, 2017

2017
[23]

Standardized and reproducible measurement of decision- making in mice.eLife, 10:e63711, 2021

The International Brain Laboratory. Standardized and reproducible measurement of decision- making in mice.eLife, 10:e63711, 2021

2021
[24]

A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400–407, 1951. 12

1951
[25]

McClelland, Bruce L

James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory.Psychological Review, 102(3):419– 457, 1995

1995
[26]

McClelland

Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelligent agents need? Complementary learning systems theory updated.Trends in Cognitive Sciences, 20(7):512–534, 2016

2016
[27]

Michael James McCulloch. Self-evidencing through hierarchical gradient decomposition: A dissipative system that maintains non-equilibrium steady-state by minimizing variational free energy.arXiv preprint arXiv:2510.17916, 2025

work page arXiv 2025
[28]

Equilibrium propagation: Bridging the gap between energy-based models and backpropagation.Frontiers in Computational Neuroscience, 11:24, 2017

Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation.Frontiers in Computational Neuroscience, 11:24, 2017

2017
[29]

A model of inductive bias learning.Journal of Artificial Intelligence Research, 12:149–198, 2000

Jonathan Baxter. A model of inductive bias learning.Journal of Artificial Intelligence Research, 12:149–198, 2000

2000
[30]

The benefit of multitask representation learning.Journal of Machine Learning Research, 17(81):1–32, 2016

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning.Journal of Machine Learning Research, 17(81):1–32, 2016

2016
[31]

Cl´ ement L. Canonne. A survey on distribution testing: Your data is big. but is it blue?Theory of Computing, (9):1–100, 2020

2020
[32]

Sequential algorithms for testing identity and closeness of distributions.arXiv preprint arXiv:2205.06069, 2022

Omar Fawzi, Nicolas Flammarion, Aur´ elien Garivier, and Aadil Oufkir. Sequential algorithms for testing identity and closeness of distributions.arXiv preprint arXiv:2205.06069, 2022

work page arXiv 2022
[33]

Information bounds and quick detection of parameter changes in stochastic systems.IEEE Transactions on Information Theory, 44(7):2917–2929, 1998

Tze Leung Lai. Information bounds and quick detection of parameter changes in stochastic systems.IEEE Transactions on Information Theory, 44(7):2917–2929, 1998

1998
[34]

Upper Bounds on the Relative Entropy and R\'enyi Divergence as a Function of Total Variation Distance for Finite Alphabets

Igal Sason. An improved reverse pinsker inequality for probability distributions on a finite set. arXiv preprint arXiv:1503.03417, 2015. A Proofs Proof of Lemma 1. Writebθn for the empirical histogram from n i.i.d. draws from θ∈ ∆D−1. Since TV(bθn, θ) = 1 2 ∥bθn −θ∥ 1, the Weissmanℓ 1 tail bound [5] gives Pr θ TV(bθn, θ)> ε = Pr θ ∥bθn −θ∥ 1 >2ε ≤(2 D −2)...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[35]

, K}isF τ-measurable

bJ∈ {1, . . . , K}isF τ-measurable
[36]

, qK}lies in the simplex interior and satisfies min i̸=j TV(qi, qj)≥∆

the library{q 1, . . . , qK}lies in the simplex interior and satisfies min i̸=j TV(qi, qj)≥∆
[37]

recognition

the maximal error is controlled uniformly: sup j Pj(bJ̸=j)≤δ, for some fixedδ <1/2. The theorem concerns inf (τ, bJ)∈S(K,∆,δ) sup 1≤j≤K Ejτ. Step 1: the1/∆ 2 factor.Take two Bernoulli laws q0 = ( 1 2 + ∆, 1 2 −∆), q 1 = ( 1 2 −∆, 1 2 + ∆), which satisfy TV(q0, q1) = ∆ for ∆ < 1/2. For any sequential test with P0(bJ̸ = 0) ∨P 1(bJ̸ = 1) ≤δ , the standard ch...

[1] [1]

Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, 1989

1989

[2] [2]

Parisi, Ronald Kemker, Jose L

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71, 2019

2019

[3] [3]

Cambridge University Press

Nicol` o Cesa-Bianchi and G´ abor Lugosi.Prediction, Learning, and Games. Cambridge University Press. Cambridge University Press, Cambridge, 2006

2006

[4] [4]

Hicl: Hippocampal- inspired continual learning.arXiv preprint arXiv:2508.16651, 2025

Kushal Kapoor, Wyatt Mackey, Yiannis Aloimonos, and Xiaomin Lin. Hicl: Hippocampal- inspired continual learning.arXiv preprint arXiv:2508.16651, 2025

work page arXiv 2025

[5] [5]

Weinberger

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verd´ u, and Marcelo J. Weinberger. Inequalities for the L1 deviation of the empirical distribution.Hewlett-Packard Labs Technical Report, HPL-2003-97R1, 2003. Technical report

2003

[6] [6]

Tsybakov.Introduction to Nonparametric Estimation

Alexandre B. Tsybakov.Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York, 2009

2009

[7] [7]

Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection.arXiv preprint arXiv:0710.3742, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007

[8] [8]

Persistent memory through triple-loop consolidation in a non-gradient dissipative cognitive architecture.arXiv preprint arXiv:2603.27188, 2026

Jianwei Lou. Persistent memory through triple-loop consolidation in a non-gradient dissipative cognitive architecture.arXiv preprint arXiv:2603.27188, 2026. Under review, Frontiers in Computational Neuroscience. 11

work page arXiv 2026

[9] [9]

John Wiley & Sons, New York, 1947

Abraham Wald.Sequential Analysis. John Wiley & Sons, New York, 1947

1947

[10] [10]

Dragalin, Alexander G

Vladimir P. Dragalin, Alexander G. Tartakovsky, and Venugopal V. Veeravalli. Multihypothesis sequential probability ratio tests — Part I: Asymptotic optimality.IEEE Transactions on Information Theory, 45(7):2448–2461, 1999

1999

[11] [11]

A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, 23(4):493–507, 1952

Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.The Annals of Mathematical Statistics, 23(4):493–507, 1952

1952

[12] [12]

Estimation des densit´ es: risque minimax.Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 47:119–137, 1979

Jean Bretagnolle and Catherine Huber. Estimation des densit´ es: risque minimax.Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 47:119–137, 1979

1979

[13] [13]

Edgar N. Gilbert. A comparison of signalling alphabets.The Bell System Technical Journal, 31(3):504–522, 1952

1952

[14] [14]

Varshamov

Rom R. Varshamov. Estimate of the number of signals in error correcting codes.Doklady Akademii Nauk SSSR, 117:739–741, 1957

1957

[15] [15]

Kabatiansky and Vladimir I

Grigorii A. Kabatiansky and Vladimir I. Levenshtein. Bounds for packings on a sphere and in space.Problems of Information Transmission, 14(1):1–17, 1978

1978

[16] [16]

Beal, Zoubin Ghahramani, and Carl Edward Rasmussen

Matthew J. Beal, Zoubin Ghahramani, and Carl Edward Rasmussen. The infinite hidden Markov model. InAdvances in Neural Information Processing Systems, volume 14, pages 577–584, 2002

2002

[17] [17]

Fox, Erik B

Emily B. Fox, Erik B. Sudderth, Michael I. Jordan, and Alan S. Willsky. A sticky HDP-HMM with application to speaker diarization.The Annals of Applied Statistics, 5(2A):1020–1056, 2011

2011

[18] [18]

The infinite gaussian mixture model.Advances in Neural Information Processing Systems, 12, 2000

Carl Edward Rasmussen. The infinite gaussian mixture model.Advances in Neural Information Processing Systems, 12, 2000. In NIPS 12

2000

[19] [19]

Procedures for reacting to a change in distribution.The Annals of Mathematical Statistics, 42(6):1897–1908, 1971

Gary Lorden. Procedures for reacting to a change in distribution.The Annals of Mathematical Statistics, 42(6):1897–1908, 1971

1908

[20] [20]

Moustakides

George V. Moustakides. Quickest detection of abrupt changes for a class of random processes. IEEE Transactions on Information Theory, 44(5):1965–1968, 1998

1965

[21] [21]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic for- getting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3...

2017

[22] [22]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, volume 30, pages 6467–6476, 2017

2017

[23] [23]

Standardized and reproducible measurement of decision- making in mice.eLife, 10:e63711, 2021

The International Brain Laboratory. Standardized and reproducible measurement of decision- making in mice.eLife, 10:e63711, 2021

2021

[24] [24]

A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400–407, 1951. 12

1951

[25] [25]

McClelland, Bruce L

James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory.Psychological Review, 102(3):419– 457, 1995

1995

[26] [26]

McClelland

Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelligent agents need? Complementary learning systems theory updated.Trends in Cognitive Sciences, 20(7):512–534, 2016

2016

[27] [27]

Michael James McCulloch. Self-evidencing through hierarchical gradient decomposition: A dissipative system that maintains non-equilibrium steady-state by minimizing variational free energy.arXiv preprint arXiv:2510.17916, 2025

work page arXiv 2025

[28] [28]

Equilibrium propagation: Bridging the gap between energy-based models and backpropagation.Frontiers in Computational Neuroscience, 11:24, 2017

Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation.Frontiers in Computational Neuroscience, 11:24, 2017

2017

[29] [29]

A model of inductive bias learning.Journal of Artificial Intelligence Research, 12:149–198, 2000

Jonathan Baxter. A model of inductive bias learning.Journal of Artificial Intelligence Research, 12:149–198, 2000

2000

[30] [30]

The benefit of multitask representation learning.Journal of Machine Learning Research, 17(81):1–32, 2016

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning.Journal of Machine Learning Research, 17(81):1–32, 2016

2016

[31] [31]

Cl´ ement L. Canonne. A survey on distribution testing: Your data is big. but is it blue?Theory of Computing, (9):1–100, 2020

2020

[32] [32]

Sequential algorithms for testing identity and closeness of distributions.arXiv preprint arXiv:2205.06069, 2022

Omar Fawzi, Nicolas Flammarion, Aur´ elien Garivier, and Aadil Oufkir. Sequential algorithms for testing identity and closeness of distributions.arXiv preprint arXiv:2205.06069, 2022

work page arXiv 2022

[33] [33]

Information bounds and quick detection of parameter changes in stochastic systems.IEEE Transactions on Information Theory, 44(7):2917–2929, 1998

Tze Leung Lai. Information bounds and quick detection of parameter changes in stochastic systems.IEEE Transactions on Information Theory, 44(7):2917–2929, 1998

1998

[34] [34]

Upper Bounds on the Relative Entropy and R\'enyi Divergence as a Function of Total Variation Distance for Finite Alphabets

Igal Sason. An improved reverse pinsker inequality for probability distributions on a finite set. arXiv preprint arXiv:1503.03417, 2015. A Proofs Proof of Lemma 1. Writebθn for the empirical histogram from n i.i.d. draws from θ∈ ∆D−1. Since TV(bθn, θ) = 1 2 ∥bθn −θ∥ 1, the Weissmanℓ 1 tail bound [5] gives Pr θ TV(bθn, θ)> ε = Pr θ ∥bθn −θ∥ 1 >2ε ≤(2 D −2)...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[35] [35]

, K}isF τ-measurable

bJ∈ {1, . . . , K}isF τ-measurable

[36] [36]

, qK}lies in the simplex interior and satisfies min i̸=j TV(qi, qj)≥∆

the library{q 1, . . . , qK}lies in the simplex interior and satisfies min i̸=j TV(qi, qj)≥∆

[37] [37]

recognition

the maximal error is controlled uniformly: sup j Pj(bJ̸=j)≤δ, for some fixedδ <1/2. The theorem concerns inf (τ, bJ)∈S(K,∆,δ) sup 1≤j≤K Ejτ. Step 1: the1/∆ 2 factor.Take two Bernoulli laws q0 = ( 1 2 + ∆, 1 2 −∆), q 1 = ( 1 2 −∆, 1 2 + ∆), which satisfy TV(q0, q1) = ∆ for ∆ < 1/2. For any sequential test with P0(bJ̸ = 0) ∨P 1(bJ̸ = 1) ≤δ , the standard ch...