{"total":16,"items":[{"citing_arxiv_id":"2605.22731","ref_index":14,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-21T17:03:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15961","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:54:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAE-FT uses a sparse autoencoder on pre-trained CLIP visual representations to regularize fine-tuning by penalizing changes to semantically meaningful features, aiming for robust performance on ImageNet and distribution shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13835","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:56:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12998","ref_index":4,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DRIFT: A Benchmark for Task-Free Continual Graph Learning with Continuous Distribution Shifts","primary_cat":"cs.LG","submitted_at":"2026-05-13T04:54:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09608","ref_index":54,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-10T15:40:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"control signal derived from task-induced covariance geometry, and construct a shared merging metric via Bures-Wasserstein geometry [20] and Gaussian Wasserstein barycenters [52]. 3 What Governs Forgetting in Continual Post-Training? Before introducing GCWM, we first ask what makes a continual post-training step harmful. Across Qwen3 models [53] from 0.6B to 14B and four representative strategies-Seq. SFT, EWC regulariza- tion [54], FOREVER replay [35], and AIMMerging [17]-we compare forgetting with update norm, 3 SAR, gradient conflict, and our geometry conflict. Here, retention loss is the positive old-task drop from each task's best previous score, reported in percentage points (pp) when scaled by 100, andρs denotes Spearman rank correlation. The analysis yields four findings: update norm is only a coarse"},{"citing_arxiv_id":"2605.09355","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning","primary_cat":"cs.LG","submitted_at":"2026-05-10T06:09:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FLAME is an MoE architecture using modality-specific routers and low-rank compression of expert knowledge to support efficient continual multimodal multi-task learning while reducing catastrophic forgetting.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the set of tasks available during pretraining is rarely exhaustive: after deployment, new clinical questions, sensor modalities, and institutional protocols continually emerge [ 34, 44], introducing tasks with previously unseen modality combinations that must be absorbed without retraining the model from scratch or sacrificing performance on existing tasks [32, 8, 56]. This variability is the norm rather than the exception. A straightforward solution is training separate models for each task, but this would incur substantial computational overhead, as each model requires extensive tuning, validation, and approval, and forfeits the transfer gains available when related tasks share modalities. Recent state-of-the-art multimodal models have demonstrated strong generalization across different"},{"citing_arxiv_id":"2605.08949","ref_index":1,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning","primary_cat":"cs.LG","submitted_at":"2026-05-09T13:42:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"effective alternative to Frobenius-norm projection for continual learning in LLMs. 1 Introduction Large-scale neural networks, including Large Language Models (LLMs), are increasingly deployed in settings that require sequential adaptation across diverse domains. A central challenge in such continual or post-training regimes is catastrophic forgetting [1, 2]: when models are fine-tuned on new tasks, their performance on previously learned tasks can degrade rapidly. This phenomenon has been extensively documented in the literature on neural networks and, more recently, in large-scale models such as large language models (LLMs). ∗Co-second authors. Preprint. arXiv:2605.08949v1 [cs.LG] 9 May 2026 Early empirical and theoretical studies demonstrate that standard gradient-based training does not"},{"citing_arxiv_id":"2605.07886","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Characterizing and Correcting Effective Target Shift in Online Learning","primary_cat":"stat.ML","submitted_at":"2026-05-08T15:34:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This method is also conceptually similar to reward shaping [52-54] in the reinforcement learning (RL) literature, which accelerates learning by augmenting the environmental reward. In the context of continual learning, our target correction is related to methods for mitigating catastrophic forgetting by regularizing the weights to protect prior knowledge [10, 11] or by sparsi- fying/orthogonalizing gradients to prevent interference [9, 55, 15]. The proposed target correction framework offers a conceptually different perspective: rather than constraining the weights or gradi- ents directly, we modulate the target outputs to naturally align the online trajectory with the multi-task (offline) ideal. This is also related to Dark Experience Replay [ 13], in which past predictions,"},{"citing_arxiv_id":"2605.03866","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Memory-Efficient Continual Learning with CLIP Models","primary_cat":"cs.LG","submitted_at":"2026-05-05T15:27:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A per-class loss reweighting scheme based on distributional robustness allows CLIP models to perform class-incremental and domain-incremental learning with minimal memory while limiting forgetting on CIFAR-100, ImageNet1K, and DomainNet.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08143","ref_index":11,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing","primary_cat":"cs.LG","submitted_at":"2026-05-02T15:51:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HoReN is a parameter-preserving editor that wraps an MLP with a Hopfield codebook memory and scales to 50K sequential edits on ZsRE while maintaining performance above 0.93.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InInternational Conference on Learning Representations, 2020. [11] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521-3526, 2017. [12] Dmitry Krotov. A new frontier for hopfield networks.Nature Reviews Physics, 5:366-367, 2023. URLhttps://api.semanticscholar.org/CorpusID:258812300. [13] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-shot relation extraction via reading comprehension.arXiv preprint arXiv:1706.04115, 2017. [14] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman"},{"citing_arxiv_id":"2604.24637","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks","primary_cat":"cs.LG","submitted_at":"2026-04-27T16:06:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FTN achieves near-zero forgetting on continual learning benchmarks by isolating task subnetworks via self-organizing binary masks generated through gradient descent, smoothing, and k-winner-take-all.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14259","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay","primary_cat":"q-bio.TO","submitted_at":"2026-04-15T16:08:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A structure-aware VAE generates realistic FC matrices for replay, combined with multi-level knowledge distillation and hierarchical contextual bandit sampling, to enable continual fMRI-based brain disorder diagnosis across sequentially arriving multi-site data without catastrophic forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.22479","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns","primary_cat":"cs.LG","submitted_at":"2026-02-25T23:38:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRC² is a brain-inspired decoder-only architecture that localizes fast plasticity and uses thalamic and hippocampal pathways to substantially reduce cumulative forgetting in sequential language model training on streams like C4, WikiText-103, and GSM8K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.16175","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning to Discover at Test Time","primary_cat":"cs.LG","submitted_at":"2026-01-22T18:24:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.16664","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"$\\boldsymbol{\\lambda}$-Orthogonality Regularization for Compatible Representation Learning","primary_cat":"cs.LG","submitted_at":"2025-09-20T12:35:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"λ-Orthogonality regularization enables distribution-specific adaptation of representations via affine transformations while retaining original learned structures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.19519","ref_index":61,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift","primary_cat":"cs.CV","submitted_at":"2025-05-26T05:03:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes Lipschitz regularization during fine-tuning to prevent distributional drift in personalized diffusion models, improving subject fidelity and prompt adherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}