{"total":14,"items":[{"citing_arxiv_id":"2605.13473","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-13T12:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"this by introducing a diagonal preconditioner that dynamically scales the update directions.Right: The OSDN computation flow is decoupled into two phases: a lightweight preconditioner update (Phase 1) followed by the primary state update (Phase 2). This decoupled design strictly preserves the efficiency of hardware-friendly chunkwise parallelization. chunkwise WY parallelisation [63] and gated variants [61, 55], DeltaNet-style models close much of the recall gap. The optimisation view, however, exposes one structural choice that has remained untouched: the learning rate βt is a single scalar, applied uniformly to every key dimension - the recurrent counterpart of vanilla SGD, forgoing the by-now-standard role of diagonal preconditioning in adaptive optimisation [17, 28, 21]."},{"citing_arxiv_id":"2605.08587","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kaczmarz Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-09T01:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"recur, repeated writes can redundantly accumulate related information instead of editing the existing association. Delta-rule models address this redundancy by writing the residual between the target value and the value predicted by the current state. DeltaNet therefore overwrites a specific key direction instead of only adding or decaying information [44]. To control the residual-write magnitude more finely, Gated DeltaNet (GDN) introduces a learned scalar that balances forgetting and updating and improves long-context retrieval [45]. However, this scalar is primarily an empirical design choice rather than a theoretically specified step size. An inappropriate coefficient can over-correct large-norm keys,"},{"citing_arxiv_id":"2605.05838","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MDN: Parallelizing Stepwise Momentum for Delta Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-07T08:12:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21100","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences","primary_cat":"cs.LG","submitted_at":"2026-04-22T21:38:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08542","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reconstructing high-quality kilometer-scale 3D scenes from RGB-only sequences. Building upon VGGT [ 78]'s strong visual geometry reasoning capability, we address the loss of global information inherent in chunk-wise processing [17] by introducing a neural global context representation for sequence-level information aggregation. Inspired by re- cent advances in subquadratic sequence modeling [79, 100], we realize this representation with a set of online-adapted, lightweight sub-networks that efficiently aggregate long- range context during inference via self-supervised objectives. The resulting neural global context representation offers strong expressive capacity to compactly encode and preserve extensive context, effectively mitigating the long-range de-"},{"citing_arxiv_id":"2604.07350","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast Spatial Memory with Elastic Test-Time Training","primary_cat":"cs.CV","submitted_at":"2026-04-08T17:59:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06169","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"In-Place Test-Time Training","primary_cat":"cs.LG","submitted_at":"2026-04-07T17:59:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[53] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InICLR, 2021. [54] Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025. 14 [55] Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. Memoryllm: Towards self-updatable large language models, 2024. URLhttps://arxiv.org/abs/2402.04624. [56] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and"},{"citing_arxiv_id":"2603.04385","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training","primary_cat":"cs.CV","submitted_at":"2026-03-04T18:49:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.21204","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-Time Training with KV Binding Is Secretly Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-02-24T18:59:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Test-time training with KV binding reduces to learned linear attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.22766","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparse Attention as Compact Kernel Regression","primary_cat":"cs.LG","submitted_at":"2026-01-30T09:45:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Sparse attention arises from compact kernel regression, with Epanechnikov and similar kernels mapping to normalized ReLU, sparsemax, and alpha-entmax attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21016","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression","primary_cat":"cs.LG","submitted_at":"2025-11-26T03:26:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.26645","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TTT3R: 3D Reconstruction as Test-Time Training","primary_cat":"cs.CV","submitted_at":"2025-09-30T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22630","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StateX: Enhancing RNN Recall via Post-training State Expansion","primary_cat":"cs.CL","submitted_at":"2025-09-26T17:55:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StateX post-trains RNNs to expand recurrent state size, improving recall and in-context learning with negligible parameter growth.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22321","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Distributed Associative Memory via Online Convex Optimization","primary_cat":"cs.LG","submitted_at":"2025-09-26T13:20:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A distributed online convex optimization protocol for associative memory achieves sublinear regret guarantees and outperforms baselines in experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}