In high-dimensional continual linear regression, optimal fixed L2 regularization strength scales as T/ln T with the number of tasks and mitigates label noise for arbitrary linear teachers.
Canonical reference
Title resolution pending
Canonical reference. 90% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
CARLOS employs an aggregate deep neural network trained on progressively finer time grids with adaptive sampling to learn continuous-time exercise boundaries for optimal stopping, delivering higher values than discrete Bermudan methods.
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.
Fuzzy ARTMAP models are highly vulnerable to a new white-box attack aligned with their category competition, but progressive selective training yields stronger replay-free robustness than offline adversarial training under adaptive evaluation.
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
Quality-aware self-distillation using soft correctness-aware gating and teacher-probability scaling improves VLM performance on GUI grounding benchmarks when both components are combined.
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
Structured text representations like CML and MolJSON outperform SMILES variants on structural tasks while IUPAC dominates semantic tasks such as molecule retrieval across all tested LLMs.
AdvCL repurposes adversarial perturbations into geometric control signals for continual learning using Intra-Smooth, Proto-Clip, and Inter-Align modules, reporting gains in performance, robustness, lower forgetting, and stronger transfer.
PROXYMIX learns a dynamic replay controller on a small proxy model and transfers it to a large target model, improving accuracy by 3.4 points and reducing forgetting by 3.5 points on LLaMA-3-8B continual tuning sequences.
Divergence Decoding steers LLM logits using small auxiliary models to unlearn specific data at inference time, outperforming baselines and generalizing to images.
2D-ProteinRAG is a dual-dimensional RAG framework that incorporates BLAST workflows plus horizontal attribute alignment and vertical homology denoising to improve protein-text QA on both in-distribution and out-of-distribution cases.
Silent collapse in recursive learning contracts internal distributions like entropy and diversity despite stable metrics, preceded by three precursors that enable the MTR monitoring framework to intervene early.
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
SLICE applies gradient surgery via projection and truncated SVD to initialize LoRA adapters, yielding better stability-plasticity trade-offs on continual learning benchmarks including adversarial task sequences.
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
DynaMiCS uses short probing runs to build a slope matrix of cross-domain effects and solves a constrained optimization over mixture weights to improve targets while respecting performance bounds on constrained domains.
Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution under GPT-5.1.
citing papers explorer
-
Physics-Informed Neural Networks for Methane Sorption: Cross-Gas Transfer Learning, Ensemble Collapse Under Physics Constraints, and Monte Carlo Dropout Uncertainty Quantification
A PINN transfer learning framework for coal methane sorption reaches R²=0.932 on held-out data with 227% improvement over classical isotherms and identifies Monte Carlo Dropout as the best uncertainty method while ensembles degrade under shared physics constraints.