{"total":15,"items":[{"citing_arxiv_id":"2606.23276","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Exposing the Illusion of Erasure in Knowledge Editing for LLMs","primary_cat":"cs.LG","submitted_at":"2026-06-22T12:53:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Knowledge editing methods redistribute and suppress rather than overwrite facts in LLMs, creating narrow vulnerable regions in representation space that adversarial prompts can exploit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03935","ref_index":34,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Quadratic integrate-and-fire neurons exhibit less fragmented loss landscapes and outperform leaky integrate-and-fire neurons in spike-based gradient descent","primary_cat":"cs.NE","submitted_at":"2026-06-02T17:26:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"QIF neurons outperform LIF neurons in spike-based gradient descent training of spiking neural networks by avoiding discontinuities that fragment the loss landscape.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16134","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Navigating Potholes with Geometry-Aware Sharpness Minimization","primary_cat":"cs.LG","submitted_at":"2026-05-15T16:17:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLQR+SAM pairs a slow learned geometry preconditioner with fast SAM perturbations to amplify escape from locally sharp 'potholes' while stabilizing flat basins, producing consistent gains over SAM and LLQR alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14200","ref_index":134,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization","primary_cat":"cs.LG","submitted_at":"2026-05-13T23:32:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13143","ref_index":14,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"On the Generalization of Knowledge Distillation: An Information-Theoretic View","primary_cat":"cs.IT","submitted_at":"2026-05-13T08:10:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05436","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Estimating Implicit Regularization in Deep Learning","primary_cat":"stat.ML","submitted_at":"2026-05-06T20:52:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1c9ac0159c94d8d0cbedc973445af2da-Abstract.html. 10 [15] David P. Helmbold and Philip M. Long. On the inductive bias of dropout.J. Mach. Learn. Res., 16(1):3403-3454, January 2015. ISSN 1532-4435. [16] Sepp Hochreiter and Jürgen Schmidhuber. Flat Minima.Neural Computation, 9(1):1-42, January 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URLhttps://doi.org/ 10.1162/neco.1997.9.1.1. [17] Alexander Immer, Lucas Torroba Hennigen, Vincent Fortuin, and Ryan Cotterell. Probing as Quantifying Inductive Bias. In Smaranda Muresan, Preslav Nakov, and Aline Villavicen- cio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1839-1851, Dublin, Ireland, May 2022. Asso-"},{"citing_arxiv_id":"2605.05341","ref_index":19,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Feature Starvation as Geometric Instability in Sparse Autoencoders","primary_cat":"cs.LG","submitted_at":"2026-05-06T18:11:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Learning fast approximations of sparse coding. InProceedings of the 27th International Conference on Machine Learning, ICML, 2010. URL https://icml.cc/ Conferences/2010/papers/449.pdf. [18] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504-507, 2006. URLhttps://doi.org/10.1126/science.1127647. [19] S. Hochreiter and J. Schmidhuber. Flat minima.Neural Computation, 9(1):1-42, 1997. URL https://doi.org/10.1162/neco.1997.9.1.1. [20] A. J. Hoffman. On approximate solutions of systems of linear inequalities.Journal of Research of the National Bureau of Standards, 49(4):263-265, 1952. URL https://nvlpubs.nist. gov/nistpubs/jres/049/4/v49.n04.a05.pdf."},{"citing_arxiv_id":"2605.02105","ref_index":52,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting","primary_cat":"cs.LG","submitted_at":"2026-05-04T00:02:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00939","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity","primary_cat":"cs.LG","submitted_at":"2026-05-01T04:11:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Lemma 3.3(First-order Gradient Sensitivity Approxima- tion).Let fθ be a deep neural network decomposed into a body Φ and a final block Tlast. For a small input embedding perturbation δ, there exists an induced parameter pertur- bation νδ in the final block such that the shift in the loss landscape is locally invariant: L(θ∗,E+δ,ˆy)≈ L(θ ∗ +ν δ,E,ˆy).(2) This equivalence holds up to first-order Taylor approxima- tion around the current activation point and is not claimed to be exact beyond local neighborhoods. Proof. We utilize the Input-Output Jacobian of the network body to map embedding noise to hidden state noise. We then construct a rank-1 update to the final block's weights that produces an identical shift in the pre-activation of the output."},{"citing_arxiv_id":"2604.25150","ref_index":20,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Role of Symmetry in Optimizing Overparameterized Networks","primary_cat":"cs.LG","submitted_at":"2026-04-28T02:53:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Overparameterization adds symmetries that precondition the Hessian for better minima and increase the probability mass of global minima near typical initializations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15167","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence","primary_cat":"cs.LG","submitted_at":"2026-04-16T15:46:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10202","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks","primary_cat":"cs.LG","submitted_at":"2026-04-11T13:16:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3: Tokyo City University (Japan) studies, it has been posited that the sharpness of a critical point is closely related to generalization performance [4], [5]. Specifically, flat minima have often been associated with better generalization, whereas sharp minima are linked to poorer gen- eralization performance, as the loss is more sensitive to small parameter perturbations [6]. This sharpness is characterized by the quadratic term of the Taylor expansion of the loss function around a critical point [7]. Since the eigenvalues of the Hessian represent the curvature of the loss surface, they serve as a metric for sharpness [8]. In particular, the maximum eigenvalueλ 1 is employed as a key indicator of the curvature of the loss landscape [9]."},{"citing_arxiv_id":"2604.02653","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability","primary_cat":"cs.LG","submitted_at":"2026-04-03T02:35:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"For losses with product-stable minima, gradient descent on l(xy) converges provably at the edge of stability, with bifurcation diagrams characterizing the resulting stable oscillations and sharpness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05209","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Are Flat Minima an Illusion?","primary_cat":"cs.LG","submitted_at":"2026-03-24T06:14:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"doi: 10.1145/307400. 307435. URLhttps://doi.org/10.1145/307400.307435. Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative flatness and generalization. InAdvances in Neural Information Processing Systems, volume 34, pages 18420-18432, 2021. Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465-471, 1978. doi: 10. 1016/0005-1098(78)90005-5. URLhttps://doi.org/10.1016/0005-1098(78)90005-5. Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using pac-bayesian analysis. InProceed- ings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of"},{"citing_arxiv_id":"2505.13196","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Physics-Inspired Optimizer: Velocity Regularized Adam","primary_cat":"cs.LG","submitted_at":"2025-05-19T14:51:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}