In high-dimensional continual linear regression, optimal fixed L2 regularization strength scales as T/ln T with the number of tasks and mitigates label noise for arbitrary linear teachers.
Canonical reference
Title resolution pending
Canonical reference. 90% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
Empirical study on production-scale clinical NLP shows direct learning from verifier rejections fails due to sparse data while fixed ontology and evidence-support filters succeed, with selectivity determined by matching verifier evidence.
CARLOS employs an aggregate deep neural network trained on progressively finer time grids with adaptive sampling to learn continuous-time exercise boundaries for optimal stopping, delivering higher values than discrete Bermudan methods.
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
Formalizes continual model routing (CMR), releases CMRBench with over 2000 models, and presents CARvE which outperforms retrieval, fine-tuning and adapter-merging baselines on model/family/domain accuracy.
Introduces Unlearning Depth Score (UDS) via activation patching to quantify LLM unlearning depth and claims it outperforms 20 other metrics in faithfulness and robustness on 150 models.
Introduces a unified benchmark for continual anomaly detection with discrete and continuous protocols plus a training-free DINOSaur method that outperforms prior CAD approaches with zero forgetting and sub-100ms edge inference.
PMF-CL derives Pareto-minimal-forgetting algorithms for linear/basis-function regression and quadratic-bounded losses like logistic regression, achieving static O(d²) memory for d-parameter models.
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.
Fuzzy ARTMAP models are highly vulnerable to a new white-box attack aligned with their category competition, but progressive selective training yields stronger replay-free robustness than offline adversarial training under adaptive evaluation.
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.
IPLoc-ID extends prior localization-only work to full identification and localization by using a self-posed query in VLMs to reject negative images while preserving comparable localization accuracy.
Quality-aware self-distillation using soft correctness-aware gating and teacher-probability scaling improves VLM performance on GUI grounding benchmarks when both components are combined.
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
Structured text representations like CML and MolJSON outperform SMILES variants on structural tasks while IUPAC dominates semantic tasks such as molecule retrieval across all tested LLMs.
AdvCL repurposes adversarial perturbations into geometric control signals for continual learning using Intra-Smooth, Proto-Clip, and Inter-Align modules, reporting gains in performance, robustness, lower forgetting, and stronger transfer.
PROXYMIX learns a dynamic replay controller on a small proxy model and transfers it to a large target model, improving accuracy by 3.4 points and reducing forgetting by 3.5 points on LLaMA-3-8B continual tuning sequences.
Divergence Decoding steers LLM logits using small auxiliary models to unlearn specific data at inference time, outperforming baselines and generalizing to images.
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
citing papers explorer
-
Optimal L2 Regularization in High-dimensional Continual Linear Regression
In high-dimensional continual linear regression, optimal fixed L2 regularization strength scales as T/ln T with the number of tasks and mitigates label noise for arbitrary linear teachers.
-
Dynamic Bidirectional Pattern Memory: A Production-Scale Empirical Characterisation of Inference-Time Gating in Clinical NLP
Empirical study on production-scale clinical NLP shows direct learning from verifier rejections fails due to sparse data while fixed ontology and evidence-support filters succeed, with selectivity determined by matching verifier evidence.
-
Continuous-time Optimal Stopping through Deep Reinforcement Learning
CARLOS employs an aggregate deep neural network trained on progressively finer time grids with adaptive sampling to learn continuous-time exercise boundaries for optimal stopping, delivering higher values than discrete Bermudan methods.
-
EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
-
Continual Model Routing in Evolving Model Hubs
Formalizes continual model routing (CMR), releases CMRBench with over 2000 models, and presents CARvE which outperforms retrieval, fine-tuning and adapter-merging baselines on model/family/domain accuracy.
-
Measuring the Depth of LLM Unlearning via Activation Patching
Introduces Unlearning Depth Score (UDS) via activation patching to quantify LLM unlearning depth and claims it outperforms 20 other metrics in faithfulness and robustness on 150 models.
-
Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions
Introduces a unified benchmark for continual anomaly detection with discrete and continuous protocols plus a training-free DINOSaur method that outperforms prior CAD approaches with zero forgetting and sub-100ms edge inference.
-
PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks
PMF-CL derives Pareto-minimal-forgetting algorithms for linear/basis-function regression and quadratic-bounded losses like logistic regression, achieving static O(d²) memory for d-parameter models.
-
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
-
Language Acquisition Device in Large Language Models
Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.
-
Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation, Progressive Training, and Interpretable Diagnostics
Fuzzy ARTMAP models are highly vulnerable to a new white-box attack aligned with their category competition, but progressive selective training yields stronger replay-free robustness than offline adversarial training under adaptive evaluation.
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
-
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
-
Episodic-to-Semantic Consolidation Without Identity Drift
A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.
-
Personalized Object Identification and Localization via In-Context Inference with Vision-Language Models
IPLoc-ID extends prior localization-only work to full identification and localization by using a self-posed query in VLMs to reject negative images while preserving comparable localization accuracy.
-
Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
Quality-aware self-distillation using soft correctness-aware gating and teacher-probability scaling improves VLM performance on GUI grounding benchmarks when both components are combined.
-
Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
-
Rethinking Molecular Text Representations for LLMs: An Empirical Study
Structured text representations like CML and MolJSON outperform SMILES variants on structural tasks while IUPAC dominates semantic tasks such as molecule retrieval across all tested LLMs.
-
Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment
AdvCL repurposes adversarial perturbations into geometric control signals for continual learning using Intra-Smooth, Proto-Clip, and Inter-Align modules, reporting gains in performance, robustness, lower forgetting, and stronger transfer.
-
Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning
PROXYMIX learns a dynamic replay controller on a small proxy model and transfers it to a large target model, improving accuracy by 3.4 points and reducing forgetting by 3.5 points on LLaMA-3-8B continual tuning sequences.
-
Divergence Decoding: Inference-Time Unlearning via Auxiliary Models
Divergence Decoding steers LLM logits using small auxiliary models to unlearn specific data at inference time, outperforming baselines and generalizing to images.
-
A Systematic Study of Behavioral Cloning for Scientific Data Annotation
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
-
TypedCSIP: Typed Counterfactual Pretraining for Chinese Legislative Conflict Classification
TypedCSIP applies typed counterfactual selective intervention pretraining on expert revisions to lift macro-F1 by 0.9-1.3 pp on the LCR-CN Chinese legislative conflict classification benchmark under a pre-registered multi-seed test.
-
Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems
Introduces a representation-geometry-based taxonomy for continual learning in speech and audio, identifies mismatches with current CL assumptions in foundation models, and lists open challenges.
-
Unlocking Biological Workflows for Robust Protein-Text Question Answering: A Dual-Dimensional RAG Framework
2D-ProteinRAG is a dual-dimensional RAG framework that incorporates BLAST workflows plus horizontal attribute alignment and vertical homology denoising to improve protein-text QA on both in-distribution and out-of-distribution cases.
-
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
MixSD uses dynamic mixing of the model's expert and naive conditionals to create distribution-aligned supervision that improves the memorization-retention tradeoff over standard SFT.
-
Silent Collapse in Recursive Learning Systems
Silent collapse in recursive learning contracts internal distributions like entropy and diversity despite stable metrics, preceded by three precursors that enable the MTR monitoring framework to intervene early.
-
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
-
Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning
SLICE applies gradient surgery via projection and truncated SVD to initialize LoRA adapters, yielding better stability-plasticity trade-offs on continual learning benchmarks including adversarial task sequences.
-
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
-
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
-
DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures
DynaMiCS uses short probing runs to build a slope matrix of cross-domain effects and solves a constrained optimization over mixture weights to improve targets while respecting performance bounds on constrained domains.
-
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution under GPT-5.1.
-
RareCP: Regime-Aware Retrieval for Efficient Conformal Prediction
RareCP improves interval efficiency for time series conformal prediction by retrieving and weighting regime-specific calibration examples while adapting to drift and maintaining coverage.
-
Rotation-Preserving Supervised Fine-Tuning
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
-
Adaptive Data Compression and Reconstruction for Memory-Bounded EEG Continual Learning
ADaCoRe enables memory-bounded UICL for EEG by compressing and reconstructing signals while preserving key morphologies, outperforming baselines with gains of at least +2.7 and +15.3 ACC on ISRUC and FACED datasets.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
-
EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure
EASE closes three residual anchors in federated multimodal unlearning using bilateral displacement, cosine-sine decomposition, and forget lock, achieving near-retrain performance on forget and retain data.
-
NORACL: Neurogenesis for Oracle-free Resource-Adaptive Continual Learning
NORACL dynamically grows network capacity via neurogenesis-inspired signals to achieve oracle-level continual learning performance without pre-specifying architecture size.
-
Fine-Tuning Regimes Define Distinct Continual Learning Problems
The relative rankings of continual learning methods are not preserved across different fine-tuning regimes defined by trainable parameter depth.
-
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
-
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
-
Physics-Informed Neural Networks for Methane Sorption: Cross-Gas Transfer Learning, Ensemble Collapse Under Physics Constraints, and Monte Carlo Dropout Uncertainty Quantification
A PINN transfer learning framework for coal methane sorption reaches R²=0.932 on held-out data with 227% improvement over classical isotherms and identifies Monte Carlo Dropout as the best uncertainty method while ensembles degrade under shared physics constraints.
-
Parameter-efficient Quantum Multi-task Learning
QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.
-
Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover
Heavy supervised fine-tuning on formal math suppresses tool-calling in Goedel-Prover-V2 from 89.4% to near 0%, but 100 Lean agentic traces restore it to 83.8% on the Berkeley Function Calling Leaderboard with in-domain gains on ProofNet.
-
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
-
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
SSU mitigates catastrophic forgetting in low-resource LLM target-language adaptation by scoring and column-wise freezing source-critical parameters, reducing source degradation to ~3% versus ~20% for full fine-tuning while matching target performance.