{"total":30,"items":[{"citing_arxiv_id":"2605.13225","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings","primary_cat":"cs.LG","submitted_at":"2026-05-13T09:17:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12438","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Causal Language Modeling Detour Improves Encoder Continued Pretraining","primary_cat":"cs.CL","submitted_at":"2026-05-12T17:34:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A temporary CLM phase followed by MLM decay during encoder continued pretraining outperforms standard MLM on biomedical tasks by 0.3-2.8pp across languages and model sizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11838","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters","primary_cat":"cs.LG","submitted_at":"2026-05-12T09:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11255","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2026-05-11T21:27:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10655","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization","primary_cat":"cs.LG","submitted_at":"2026-05-11T14:40:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10108","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction","primary_cat":"cs.CL","submitted_at":"2026-05-11T07:23:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GLiNER-Relex unifies NER and RE in one zero-shot transformer-based model that achieves competitive results on CoNLL04, DocRED, FewRel, and CrossRE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09990","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference","primary_cat":"cs.CL","submitted_at":"2026-05-11T05:06:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09751","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes","primary_cat":"cs.CL","submitted_at":"2026-05-10T21:00:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fixed 16-bit binary token codes can replace trainable input embeddings in 32-layer decoder-only models while maintaining comparable held-out perplexity on 17B tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09611","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks","primary_cat":"cs.CL","submitted_at":"2026-05-10T15:48:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09331","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dimension-Free Saddle-Point Escape in Muon","primary_cat":"cs.LG","submitted_at":"2026-05-10T05:06:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09165","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sparse Layers are Critical to Scaling Looped Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-09T20:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07815","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling","primary_cat":"cs.LG","submitted_at":"2026-05-08T14:47:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06366","ref_index":16,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Layer Collapse in Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T14:39:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05683","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization","primary_cat":"stat.ML","submitted_at":"2026-05-07T05:19:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04478","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training","primary_cat":"cs.DC","submitted_at":"2026-05-06T04:07:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02144","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Projection-Free Transformers via Gaussian Kernel Attention","primary_cat":"cs.LG","submitted_at":"2026-05-04T01:57:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00072","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"XekRung Technical Report","primary_cat":"cs.CR","submitted_at":"2026-04-30T11:50:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25374","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Language corpora for the Dutch medical domain","primary_cat":"cs.CL","submitted_at":"2026-04-28T08:38:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A 35-billion-token Dutch medical corpus was assembled from translated, mined, and extracted sources and released publicly on Hugging Face as the first large-scale resource of its kind.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24809","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model","primary_cat":"cs.LG","submitted_at":"2026-04-27T08:07:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06683","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models","primary_cat":"cs.LG","submitted_at":"2026-04-24T20:37:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05227","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods","primary_cat":"cs.LG","submitted_at":"2026-04-19T14:23:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12946","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Parcae: Scaling Laws For Stable Looped Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-14T16:43:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18603","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings","primary_cat":"q-bio.QM","submitted_at":"2026-04-09T19:32:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05688","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion","primary_cat":"cs.CL","submitted_at":"2026-04-07T10:40:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02685","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Finding Belief Geometries with Sparse Autoencoders","primary_cat":"cs.LG","submitted_at":"2026-04-03T03:29:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new pipeline identifies candidate simplex geometries in Gemma-2-9B representations, with five clusters showing significant barycentric prediction advantages consistent with belief-state encoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.29496","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Metriplector: From Field Theory to Neural Architecture","primary_cat":"cs.AI","submitted_at":"2026-03-31T09:40:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small parameter counts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.16982","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Muon is Scalable for LLM Training","primary_cat":"cs.LG","submitted_at":"2025-02-24T09:12:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.02737","ref_index":208,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model","primary_cat":"cs.CL","submitted_at":"2025-02-04T21:43:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.23218","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","primary_cat":"cs.CL","submitted_at":"2024-10-30T17:10:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.18223","ref_index":234,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-03-31T17:28:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"be identified and removed. In addition to the above methods, LLMs (especially rela- tively small models) can be also employed for data selection, either by computing perplexity [232] or directly prompting LLMs [233] for measuring the sample importance. However, using LLMs is unavoidably computationally intensive for large-scale data selection. De-duplication.Existing work [234] has found that dupli- cate data in a corpus would reduce the diversity of language models, which may cause the training process to become un- stable and thus affect the model performance. Therefore, it is necessary to de-duplicate the pre-training corpus. Specially, de-duplication can be performed at different granularities, including sentence-level, document-level, and dataset-level"}],"limit":50,"offset":0}