{"total":15,"items":[{"citing_arxiv_id":"2605.28507","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Universal Time Series Generation with Neural Controlled Differential Equations","primary_cat":"cs.LG","submitted_at":"2026-05-27T14:10:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proves SLiCEs are universal time-series generators approximating path laws in W_∞ and proposes G-SLiCEs for path-space flow matching with benefits on irregular grids.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22791","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention","primary_cat":"cs.AI","submitted_at":"2026-05-21T17:44:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gated DeltaNet-2 decouples channel-wise erase and write gates in linear attention, generalizing prior DeltaNet and KDA models while showing stronger results on language modeling and long-context retrieval at 1.3B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20670","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LT2: Linear-Time Looped Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-20T03:36:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12992","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting","primary_cat":"q-bio.NC","submitted_at":"2026-05-13T04:45:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12491","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Elastic Attention Cores for Scalable Vision Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[78] Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023. [79] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. [80] Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024. [81] Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, et al. Gated slot attention for efficient linear-time sequence"},{"citing_arxiv_id":"2605.06501","ref_index":61,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cubit: Token Mixer with Kernel Ridge Regression","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:18:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Transformer ArchitectureTransformer architecture [ 80] was proposed in 2017, with Feed- Forward Network and Attention. In the following years, there are modifications of Transformer. From the view of FFN, there are mixture-of-expert and different activation functions. From the view of attention, there are GQA [1], MQA [69], MLA [43], TPA [95] and so on. There is also gated attention [61] to improve the performance. Also, there are works that are trying to replace the softmax attention with ReLU attention [87] and sigmoid attention [62]. The skip connection [30] is also discussed, such as hyper-connection [97], attention residual [76], deepnorm [81], and sandwitchnorm [16]. However, these modifications do not modify the modeling of the attention"},{"citing_arxiv_id":"2605.05066","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Impossibility Triangle of Long-Context Modeling","primary_cat":"cs.CL","submitted_at":"2026-05-06T16:01:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"resented inb-bit floating-point precision. The initial state can encode at mostd·bbits of information. Consider two initial statess 0 ands ′ 0 differing by ∆s0 =s ′ 0 −s 0 with∥∆s 0∥= 2 −b (the smallest representable perturbation). After one transition step with inputx 1, the Lipschitz condition (Axiom 3) gives ∥s1 −s ′ 1∥=∥δ(s 0, x1)−δ(s ′ 0, x1)∥ ≤L· ∥∆s 0∥=L·2 −b.(28) AfterTsteps, applying the Lipschitz condition recursively, ∥sT −s ′ T ∥ ≤L T · ∥∆s0∥=L T ·2 −b.(29) The effective precision of the state afterTsteps is determined by the smallest pertur- bation ins 0 that produces a distinguishable difference ins T . A perturbation of magnitude 27 Zhou ϵins 0 grows to at mostL T ·ϵins T . For this to exceed the representation threshold 2 −b,"},{"citing_arxiv_id":"2604.19021","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control","primary_cat":"cs.LG","submitted_at":"2026-04-21T03:15:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.15031","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Attention Residuals","primary_cat":"cs.CL","submitted_at":"2026-03-16T09:32:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.21204","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-Time Training with KV Binding Is Secretly Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-02-24T18:59:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Test-time training with KV binding reduces to learned linear attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.17388","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Selective Rotary Position Embedding","primary_cat":"cs.CL","submitted_at":"2025-11-21T16:50:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.26692","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kimi Linear: An Expressive, Efficient Attention Architecture","primary_cat":"cs.CL","submitted_at":"2025-10-30T16:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Input Length Extrapolation\". In:Proceedings of ICLR. 2022.URL: https://openreview.net/forum?id= R8sQPpGCv0. [76] Krishna C. Puvvada et al.SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling. 2025. arXiv:2504.08719 [cs.CL]. [77] Zhen Qin et al.HGRN2: Gated Linear RNNs with State Expansion. 2024. arXiv:2404.07904 [cs.CL]. [78] Zhen Qin et al.TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer. 2024. arXiv:2307.14995 [cs.CL]. [79] Zihan Qiu et al.Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free. 2025. arXiv:2505.06708 [cs.CL]. 21 Kimi Linear: An Expressive, Efficient Attention ArchitectureTECHNICALREPORT"},{"citing_arxiv_id":"2510.26083","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism","primary_cat":"cs.LG","submitted_at":"2025-10-30T02:41:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Nirvana adds a task-aware memory trigger and updater to specialized generalist models, achieving strong general benchmark results, lowest perplexity in biomedicine/finance/law, and improved MRI reconstruction fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.21060","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","primary_cat":"cs.LG","submitted_at":"2024-05-31T17:50:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"\"TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer\". In: arXiv preprint arXiv:2307.14995 (2023). [83] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. \"CosFormer: Rethinking Softmax in Attention\". In:The International Conference on Learning Representations (ICLR). 2022. [84] Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. \"HGRN2: Gated Linear RNNs with State Expansion\". In: arXiv preprint arXiv:2404.07904 (2024). [85] Zhen Qin, Songlin Yang, and Yiran Zhong. \"Hierarchically Gated Recurrent Neural Network for Sequence Model- ing\". In: Advances in Neural Information Processing Systems 36 (2023)."},{"citing_arxiv_id":"2312.06635","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gated Linear Attention Transformers with Hardware-Efficient Training","primary_cat":"cs.LG","submitted_at":"2023-12-11T18:51:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}