{"total":11,"items":[{"citing_arxiv_id":"2605.25966","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training","primary_cat":"cs.LG","submitted_at":"2026-05-25T15:42:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Factorial experiments with over 1300 runs falsify the hypothesis that INT6 QAT needs a different LR schedule from higher precision and identify a 50M-parameter boundary for INT4 schedule sensitivity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17659","ref_index":20,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes","primary_cat":"cs.LG","submitted_at":"2026-05-17T21:29:20+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10775","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the global convergence of gradient descent for wide shallow models with bounded nonlinearities","primary_cat":"math.OC","submitted_at":"2026-05-11T16:08:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.","context_count":1,"top_context_role":"extension","top_context_polarity":"extend","context_text":"The first corresponds to settings whereϕis positively 1-homogeneous, meaning thatϕ(λθ) = λϕ(θ) for everyλ >0. In this case, Φ is positively 2-homogeneous. This essentially models two-layer networks 5 with ReLU activations. The second case corresponds to the situation whereϕis bounded, and Φ is hence only partially 1-homogeneous. This models two-layer networks with sigmoid activations. In [CB18, Theorems 3.4 and 3.5], it is proved that, in both cases and under suitable assumptions on the initial distribution, the Wasserstein gradient flow ofFcan only converge to global minimizers. This result is established for every dw ⩾1 in the 2-homogeneous case but only ford w = 1 in the partially 1-homogeneous case. Section outline.In this section, we investigate the extension of the global convergence result mentioned"},{"citing_arxiv_id":"2605.03667","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity","primary_cat":"cs.LG","submitted_at":"2026-05-05T12:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14430","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Three-Phase Transformer","primary_cat":"cs.CL","submitted_at":"2026-04-15T21:23:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Three-Phase Transformer partitions hidden states into N cyclic channels with phase-respecting RMSNorm and Givens rotations plus an orthogonal Gabriel's horn DC injection, delivering 7.2% lower perplexity and 1.93x faster convergence than a matched RoPE baseline at 123M parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.04572","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums","primary_cat":"cs.AI","submitted_at":"2026-02-04T13:58:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despite incentive misalignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20856","ref_index":92,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NVIDIA Nemotron 3: Efficient and Open Intelligence","primary_cat":"cs.CL","submitted_at":"2025-12-24T00:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12744","ref_index":71,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity","primary_cat":"cs.LG","submitted_at":"2025-12-14T15:47:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.17192","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast Inference from Transformers via Speculative Decoding","primary_cat":"cs.LG","submitted_at":"2022-11-30T17:33:28+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.14198","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flamingo: a Visual Language Model for Few-Shot Learning","primary_cat":"cs.CV","submitted_at":"2022-04-29T16:29:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"10864, 2020. [103] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Conference on Neural Information Processing Systems, 2017. [104] David R So, Wojciech Ma 'nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V . Le. Primer: Searching for efﬁcient transformers for language modeling. arXiv:2109.08668, 2021. [105] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in NLP. arXiv:1906.02243, 2019. [106] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: Pre-training of generic visual-linguistic representations. arXiv:1908.08530, 2019. [107] Chen Sun, Austin Myers, Carl V ondrick, Kevin Murphy, and Cordelia Schmid."},{"citing_arxiv_id":"2202.08906","ref_index":200,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ST-MoE: Designing Stable and Transferable Sparse Expert Models","primary_cat":"cs.CL","submitted_at":"2022-02-17T21:39:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}