Active Continual Learning with Metaplastic Binary Bayesian Neural Networks

Damien Querlioz; Djohan Bonnet; Kellian Cottart; Th\'eo Ballet

arxiv: 2605.30198 · v1 · pith:YJUHBDGLnew · submitted 2026-05-28 · 💻 cs.LG

Active Continual Learning with Metaplastic Binary Bayesian Neural Networks

Kellian Cottart , Th\'eo Ballet , Djohan Bonnet , Damien Querlioz This is my paper

Pith reviewed 2026-06-29 09:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningBayesian neural networksactive learningbinary neural networksmetaplasticityvariational inferenceonline learning

0 comments

The pith

A bounded-memory variational update prevents saturation in binary Bayesian neural networks, enabling buffer-free active continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that mean-field Bernoulli posteriors in binary Bayesian neural networks tend to saturate on long non-stationary data streams, losing epistemic uncertainty and plasticity. BiMU counters this with a variational objective that relaxes toward the prior in a controlled way and adjusts the step size based on uncertainty. This keeps the posterior informative, allowing the network to perform active learning by querying labels only where models disagree, all without storing past data. A reader would care because always-on edge devices need to adapt to changing conditions while using little compute and knowing when predictions are unreliable.

Core claim

BiMU combines a data term with controlled relaxation toward the prior and an uncertainty-dependent step size that prevents saturation and sustains informative uncertainty. This non-degenerate posterior enables fully online, buffer-free active querying via Monte Carlo disagreement, reducing label queries and backpropagation updates under imbalance. BiMU sustains learning and strong OOD detection on 1000-tasks Permuted-MNIST, and on OpenLORIS-Object achieves up to 32× label/update savings at matched accuracy under class imbalance and feature compression.

What carries the argument

BiMU, the metaplastic update derived from a bounded-memory variational objective that uses an uncertainty-dependent step size to maintain non-degenerate posteriors in binary Bayesian neural networks.

If this is right

Learning continues across 1000 sequential tasks on Permuted-MNIST without performance collapse.
Active querying reduces the number of required labels and updates by factors up to 32 while matching accuracy.
The method works under class imbalance and when features are compressed.
Out-of-distribution detection remains strong without extra mechanisms.
Training stays fully online and buffer-free.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same relaxation principle applies to other posterior approximations, it could extend to larger-scale continual learning problems.
Reducing backpropagation updates this way may lower energy consumption on resource-constrained devices.
Testing on streams longer than 1000 tasks or with different imbalance levels would further validate the approach.
The disagreement-based querying could combine with other selection strategies for even greater efficiency.

Load-bearing premise

Mean-field Bernoulli posteriors saturate and lose plasticity on long non-stationary streams unless balanced by controlled relaxation and uncertainty-dependent steps.

What would settle it

An experiment showing that BiMU posteriors still saturate or that active learning performance drops without a replay buffer on the 1000-task Permuted-MNIST benchmark would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.30198 by Damien Querlioz, Djohan Bonnet, Kellian Cottart, Th\'eo Ballet.

**Figure 1.** Figure 1: Schematic of the BiMU update. The next variational state is shaped jointly by the current loss (plasticity), the previous posterior (stability), and the prior (forgetting). Bars show Bernoulli probabilities for ω ∈ {−1, +1}. Eqs. (1)-(3) constitute the background that allows deriving BiMU. Combining the closed-form Bernoulli KL (Prop. A.3) with a second-order expansion around λt−1 yields a per-synapse upd… view at source ↗

**Figure 2.** Figure 2: Three regimes emerge: (i) Uncertain synapses (λ (i) ≈ 0): updates are conservatively bounded. (ii) Consolidation (λ (i) g (i) < 0): consistent gradients reinforce the current weight sign and η approaches its upper bound, enabling fast consolidation. (iii) De-consolidation (λ (i) g (i) > 0): gradients opposing the current sign reduce |λ (i) |, but η shrinks, making sign changes difficult unless such gradie… view at source ↗

**Figure 3.** Figure 3: shows final test accuracy as a function of the queriedlabel fraction, evaluated on a balanced test set (50 images per class). Selective querying consistently improves over random querying at matched budgets, indicating that onepass active learning is effective in this low-label regime. The horizontal line shows 100% update baseline under the same one-pass protocol (i.e., always querying/updating on every… view at source ↗

**Figure 4.** Figure 4: compares VR thresholding across Bayesian methods; for the deterministic STE baseline we use aleatoric uncertainty for querying. In this stationary setting, inherent class confusability makes entropy-based querying already useful, but posterior sampling further improves the accuracy-label trade-off when the disagreement signal is well calibrated. Among Bayesian baselines, MESU is sensitive to feature scali… view at source ↗

**Figure 5.** Figure 5: reports accuracy versus queried-label fraction (overall, low frequency, and high frequency classes). With VR thresholding (computed with K = 10 posterior-sampled predictors unless stated otherwise), BiMU reaches 88.70% while updating on only 3.1% of the stream, corresponding to a 32× reduction in labeled samples and gradient updates relative to the 100% update baseline (87.76%, equivalent to training on t… view at source ↗

**Figure 6.** Figure 6: OpenLORIS-Object active continual learning (8,192 frozen VGG19 features): accuracy vs. queried-label fraction for BiMU, MESU, and BayesBiNN (VR thresholding) and STE (aleatoric thresholding). Results averaged over five runs [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Histograms of the synaptic probabilities p = sigmoid(2λ) at the end of training for different memory windows N and BayesBiNN in 100-tasks Permuted-MNIST (OCL; 100-hidden-unit MLP). Small N shows broad distributions indicative of forgettingdominated dynamics, intermediate N yields moderate peaks reflecting a balanced stability-plasticity-forgetting trade-off, and very large N exhibits sharp peaks near 0 an… view at source ↗

**Figure 8.** Figure 8: reports the performance of BayesBiNN on the Animals dataset. All acquisition strategies perform below random sampling, indicating that the uncertainty estimates are not sufficiently informative to support effective active learning. Unlike settings with a very large number of tasks such as Permuted MNIST, where weight divergence naturally emerges, the Animals dataset contains relatively few training example… view at source ↗

**Figure 9.** Figure 9: Animals dataset under imbalanced active learning using MESU. Performance as a function of the queried-label fraction, illustrating uncertainty-based acquisition with real-valued Bayesian Gaussian weights. 100 101 102 40 60 80 100 Total Aleatoric Epistemic Predictive Random VR VR-True 100 101 102 20 40 60 80 100 Low frequency 100 101 102 60 70 80 90 100 High frequency Data used for training (%) [PITH_FULL_… view at source ↗

**Figure 10.** Figure 10: Animals dataset under imbalanced active learning using MESU with feature standardisation. Standardisation improves stability and performance across labeling budgets. Finally, [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

**Figure 11.** Figure 11: Animals dataset under imbalanced active learning using STE. Accuracy as a function of the queried-label fraction when acquisition is driven by aleatoric uncertainty in binary weights setup. Overall, these complementary figures highlight qualitative differences in acquisition dynamics that are not fully captured by summary metrics alone, especially for BayesBiNN, which, while having reliable OOD detection … view at source ↗

**Figure 12.** Figure 12: presents the results for BayesBiNN on the OpenLORIS dataset in imbalanced continual active learning setting. As in the Animals setting, all active learning strategies perform below random acquisition, highlighting the limited usefulness of the uncertainty estimates produced by the model. Although OpenLORIS contains a larger total number of training examples (442,194 images), this corresponds to only appro… view at source ↗

**Figure 13.** Figure 13: OpenLORIS-Object (OCL; linear head) with 8,192-dimensional VGG19 features under imbalanced continual active learning using MESU. Performance under uncertainty-driven querying with real-valued Bayesian weights. 100 101 102 40 60 80 100 Low frequency 100 101 102 85 90 95 100 High frequency Data used for training (%) 100 101 102 60 70 80 90 100 Total Aleatoric Epistemic Predictive Random VR VR-True [PITH_FU… view at source ↗

**Figure 14.** Figure 14: OpenLORIS-Object (OCL; linear head) with 8,192-dimensional VGG19 features under imbalanced continual active learning using MESU with feature standardisation. Standardisation improves robustness under continual distribution shifts [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗

**Figure 15.** Figure 15: shows the results obtained with STE. STE’s performance degrades more noticeably when confronted with successive task shifts, yielding poor aleatoric measurements that are unable to capture efficient examples to learn on. 100 101 102 20 40 60 80 100 Total Aleatoric Random 100 101 102 20 40 60 80 100 Low frequency 100 101 102 20 40 60 80 100 High frequency Data used for training (%) [PITH_FULL_IMAGE:figure… view at source ↗

read the original abstract

Always-on edge systems must keep learning as conditions change under tight compute budgets and must detect unreliable predictions. Bayesian binary neural networks are attractive in this setting, but mean-field Bernoulli posteriors can saturate on long non-stationary streams, wiping out epistemic uncertainty and freezing plasticity. We propose BiMU, derived from a bounded-memory variational objective that balances stability, plasticity, and forgetting. BiMU combines a data term with controlled relaxation toward the prior and an uncertainty-dependent step size that prevents saturation and sustains informative uncertainty. This non-degenerate posterior enables fully online, buffer-free active querying via Monte Carlo disagreement, reducing label queries and backpropagation updates under imbalance. BiMU sustains learning and strong OOD detection on 1000-tasks Permuted-MNIST, and on OpenLORIS-Object achieves up to 32$\times$ label/update savings at matched accuracy under class imbalance and feature compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiMU adds a bounded-memory variational term and uncertainty-dependent step size to binary Bayesian nets to avoid saturation on long streams and enable buffer-free active learning, but the abstract gives no equations or analysis to check if the mechanism actually works.

read the letter

The paper's core proposal is BiMU, which combines a data term with controlled relaxation to the prior and an uncertainty-modulated learning rate inside a mean-field Bernoulli variational setup. This is meant to keep posteriors from collapsing to 0/1 on non-stationary streams so that Monte Carlo disagreement can drive active label queries without any replay buffer.

What stands out is the concrete target: 1000-task Permuted-MNIST plus OpenLORIS-Object under class imbalance and feature compression, with reported 32× reductions in labels and updates at matched accuracy plus retained OOD detection. That is a practical angle for edge continual learning.

The soft spot is exactly the stress-test point. The abstract asserts that the step-size rule prevents saturation and sustains informative uncertainty, yet supplies no derivation, no ablation on the relaxation strength or scaling factor, and no analysis of posterior degeneracy after hundreds of tasks. Without those, it is impossible to tell whether the claimed non-degenerate posterior is an outcome of the method or an assumption that happens to hold on the chosen benchmarks. The mean-field Bernoulli saturation problem is real in the literature, but the fix here remains a black box from the given text.

This is for people already working on resource-constrained continual learning who need a drop-in idea for active querying. It is not a foundational rethinking of variational continual learning.

I would send it to review so the equations and long-stream diagnostics can be checked; the experimental claims are specific enough to be worth referee time even if the central mechanism needs tightening.

Referee Report

2 major / 0 minor

Summary. The paper proposes BiMU, a method for active continual learning using metaplastic binary Bayesian neural networks. It introduces a bounded-memory variational objective that combines a data term with controlled relaxation toward the prior and an uncertainty-dependent step size to prevent saturation of mean-field Bernoulli posteriors on long non-stationary streams. This sustains informative epistemic uncertainty, enabling fully online, buffer-free active querying via Monte Carlo disagreement. The approach is claimed to reduce label queries and backpropagation updates under class imbalance, with empirical results showing sustained performance and OOD detection on 1000-task Permuted-MNIST and up to 32× savings on OpenLORIS-Object at matched accuracy.

Significance. If the core mechanism holds, the result would be significant for resource-constrained continual learning on edge devices, addressing the challenge of maintaining plasticity and uncertainty without replay buffers or large memory in non-stationary settings. The combination of active querying with binary BNNs could enable efficient adaptation and reliable OOD detection under tight compute budgets.

major comments (2)

[Abstract] The central claim that the uncertainty-dependent step size in the bounded-memory variational objective prevents saturation (posterior means approaching 0 or 1) over 1000-task streams is not supported by any derivation, analysis, or ablation in the provided text. Without showing that the effective learning rate remains non-degenerate under class imbalance or feature compression, the non-degenerate posterior required for MC-disagreement active querying does not follow.
No equations, variational objective derivation, or step-size rule are supplied, making it impossible to assess whether the claimed balance of stability, plasticity, and forgetting is achieved by construction or requires additional fitted hyperparameters beyond the two free parameters noted in the axiom ledger.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and agree that the manuscript requires revisions to provide the missing derivation, equations, and analysis.

read point-by-point responses

Referee: [Abstract] The central claim that the uncertainty-dependent step size in the bounded-memory variational objective prevents saturation (posterior means approaching 0 or 1) over 1000-task streams is not supported by any derivation, analysis, or ablation in the provided text. Without showing that the effective learning rate remains non-degenerate under class imbalance or feature compression, the non-degenerate posterior required for MC-disagreement active querying does not follow.

Authors: We agree that the abstract and current text do not supply the requested derivation, analysis, or ablation. In revision we will add a new subsection deriving the bounded-memory variational objective, explicitly defining the uncertainty-dependent step size, and providing analysis (including effective learning-rate bounds) plus an ablation on class imbalance and feature compression to demonstrate that the posterior remains non-degenerate over long streams. revision: yes
Referee: [—] No equations, variational objective derivation, or step-size rule are supplied, making it impossible to assess whether the claimed balance of stability, plasticity, and forgetting is achieved by construction or requires additional fitted hyperparameters beyond the two free parameters noted in the axiom ledger.

Authors: We acknowledge that the manuscript as provided does not include the equations or step-by-step derivation. We will revise by inserting the full variational objective (data term plus controlled prior relaxation), the precise uncertainty-dependent step-size rule, and a short proof that the balance is achieved with only the two stated hyperparameters. Pseudocode and a clarifying paragraph will also be added. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation presented as independent variational construction

full rationale

The provided abstract and context introduce BiMU as derived from a bounded-memory variational objective that explicitly combines a data term, controlled prior relaxation, and an uncertainty-dependent step size chosen to prevent saturation. No equations, self-citations, or uniqueness theorems are quoted that would reduce the non-degenerate posterior property or the active-querying performance to a fitted parameter or prior self-referential definition by construction. The claims about sustained uncertainty and label savings are framed as consequences of the proposed objective rather than inputs renamed as outputs. Absent any load-bearing self-citation chain or ansatz smuggled via citation in the given text, the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the central addition is a new variational objective whose internal parameters and assumptions cannot be audited without the full derivation.

free parameters (2)

relaxation strength toward prior
Controlled relaxation is part of the bounded-memory objective and is expected to be a tunable hyperparameter.
uncertainty-dependent step-size scaling
The step-size rule is introduced to prevent saturation and is therefore likely fitted or chosen per dataset.

axioms (1)

domain assumption Mean-field Bernoulli posteriors saturate on long non-stationary streams
Stated directly as the motivating failure mode of existing binary Bayesian networks.

pith-pipeline@v0.9.1-grok · 5685 in / 1382 out tokens · 38133 ms · 2026-06-29T09:01:55.457672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Bayesian Active Learning for Classification and Preference Learning

URL https://openreview.net/forum? id=GC5MsCxrU-. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009. Freeman, L. C. Elementary applied statistics: for students in behavioral science.Open Journal of...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[2]

com/datasets/antoreepjana/ animals-detection-images-dataset

URL https://www.kaggle. com/datasets/antoreepjana/ animals-detection-images-dataset. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models.Machine learning, 37(2):183–233, 1999. Kendall, A. and Gal, Y . What uncertainties do we need in bayesian deep learning for computer vision?Advances...

1999
[3]

Very Deep Convolutional Networks for Large-Scale Image Recognition

PMLR, 2020. Ngartera, L., Issaka, M. A., and Nadarajah, S. Application of bayesian neural networks in healthcare: three case studies.Machine Learning and Knowledge Extraction, 6 (4):2639–2658, 2024. Nguyen, C. V ., Li, Y ., Bui, T. D., and Turner, R. E. Variational continual learning. InInternational Confer- ence on Learning Representations, 2018. URL htt...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Performance is measured by ROC-AUC, computed over 1000 decision thresholds to discriminate in-distribution from OOD samples

as OOD data. Performance is measured by ROC-AUC, computed over 1000 decision thresholds to discriminate in-distribution from OOD samples. Hyperparameter optimization. Hyperparameters are computed on 10 tasks of Permuted MNIST with different permutations as the ones presented in the main paper as validation. Hyperparameters are obtained by maximizing the h...

2020

[1] [1]

Bayesian Active Learning for Classification and Preference Learning

URL https://openreview.net/forum? id=GC5MsCxrU-. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009. Freeman, L. C. Elementary applied statistics: for students in behavioral science.Open Journal of...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[2] [2]

com/datasets/antoreepjana/ animals-detection-images-dataset

URL https://www.kaggle. com/datasets/antoreepjana/ animals-detection-images-dataset. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models.Machine learning, 37(2):183–233, 1999. Kendall, A. and Gal, Y . What uncertainties do we need in bayesian deep learning for computer vision?Advances...

1999

[3] [3]

Very Deep Convolutional Networks for Large-Scale Image Recognition

PMLR, 2020. Ngartera, L., Issaka, M. A., and Nadarajah, S. Application of bayesian neural networks in healthcare: three case studies.Machine Learning and Knowledge Extraction, 6 (4):2639–2658, 2024. Nguyen, C. V ., Li, Y ., Bui, T. D., and Turner, R. E. Variational continual learning. InInternational Confer- ence on Learning Representations, 2018. URL htt...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Performance is measured by ROC-AUC, computed over 1000 decision thresholds to discriminate in-distribution from OOD samples

as OOD data. Performance is measured by ROC-AUC, computed over 1000 decision thresholds to discriminate in-distribution from OOD samples. Hyperparameter optimization. Hyperparameters are computed on 10 tasks of Permuted MNIST with different permutations as the ones presented in the main paper as validation. Hyperparameters are obtained by maximizing the h...

2020