{"total":17,"items":[{"citing_arxiv_id":"2605.23893","ref_index":45,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models","primary_cat":"cs.LG","submitted_at":"2026-05-22T17:56:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20005","ref_index":50,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15676","ref_index":36,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Dynamic Chunking for Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-15T06:56:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16423","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization","primary_cat":"cs.CV","submitted_at":"2026-05-14T14:55:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13997","ref_index":55,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:07:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13026","ref_index":60,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Understanding and Accelerating the Training of Masked Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-13T05:29:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12327","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Grid Games: The Power of Multiple Grids for Quantizing Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T16:09:02+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"to isolate the effect of grid choice (we provide results with Hadamard transforms in the Appendix D). We measure KL divergence against BF16 logits on WikiText-2 and C4, as well as Expected Acceptance Rate (EAR) between the original model and the quantized one [17]. We run models on downstream tasks using Harness [14] and report accuracies on Winogrande [32], ARC-C, ARC-E [7], Lambada (standard) [30], PIQA [2], Hellaswag (10-shot) [39], MMLU [18], IFEval (Prompt) [ 40], and GSM8K-CoT [8]. We compare several single-grids NVFP4, BOF4 [3], NF4 [11], Split87, and several multi-grid variants IF4 (per-block INT4/FP4 selection [10]), PO2(NF4), and PO2(Split87). We also compare with Four-Over-Six [9] and the SFP4 described in Section 4."},{"citing_arxiv_id":"2605.11558","ref_index":59,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Composite Activation Function for Learning Stable Binary Representations","primary_cat":"cs.LG","submitted_at":"2026-05-12T05:41:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[57] Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, Michael Pfeiffer, and Shih-Chii Liu. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification.Frontiers in neuroscience, 11:682, 2017. [58] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors.nature, 323(6088):533-536, 1986. [59] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99-106, 2021. [60] Johannes Schmidt-Hieber. Deep relu network approximation of functions on a manifold.arXiv preprint arXiv:1908.00695, 2019. [61] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU"},{"citing_arxiv_id":"2605.08568","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression","primary_cat":"cs.LG","submitted_at":"2026-05-09T00:02:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We conduct experiments on open-source LLMs and widely used language modeling and zero-shot reasoning benchmarks. For model selection, we consider representative architectures including LLaMA-7B, LLaMA-13B, LLaMA-30B [ 6] and Qwen2.5-7B [ 38]. For evaluation, we report perplexity on WikiText2 [39], PTB [40], and C4 [41], and zero-shot accuracy on OpenBookQA [42], ARC-e, ARC-c [ 43], WinoGrande [44], HellaSwag [45], PIQA [46], and MathQA [47]. All downstream reasoning tasks are evaluated in the zero-shot setting using the LM-Evaluation-Harness [48]. Baselines.Our method is not a standalone matrix decomposition framework, but a rank selection mechanism built on top of existing SVD-based compression methods. Therefore, we evaluate it by integrating it with several representative and influential SVD-based methods, including SVD-LLM"},{"citing_arxiv_id":"2605.06665","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"UniPool: A Globally Shared Expert Pool for Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06501","ref_index":67,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Cubit: Token Mixer with Kernel Ridge Regression","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:18:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05007","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation","primary_cat":"cs.AI","submitted_at":"2026-05-06T15:07:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04913","ref_index":20,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training","primary_cat":"cs.CL","submitted_at":"2026-05-06T13:41:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LoPT splits LLM post-training at the midpoint with task loss on the second half and feature reconstruction on the first half to reduce cost and interference.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"from the matched E2E baseline only by the midpoint stop-gradient boundary and the auxiliary reconstruction update for the first-half block. D.2 Evaluation Details For downstream benchmark, we assess LoPT with lm-eval-harness [5] under a unified protocol. The main evaluation benchmark includes MMLU (5-shot) [8], IFEval [34], ARC-Challenge (25-shot) [2], GSM8K (4- shot) [3], HellaSwag (5-shot) [31], TruthfulQA MC2 [16], and Winogrande (5-shot) [20]. These benchmarks cover general knowledge, instruction following, reasoning, commonsense understanding, truthfulness, and mathematical problem solving. For trained checkpoints, all reported lm-eval scores are means over three independently trained checkpoints with different random seeds. We compute aggregate averages and method differences from these seed-averaged scores."},{"citing_arxiv_id":"2604.24715","ref_index":42,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling","primary_cat":"cs.CL","submitted_at":"2026-04-27T17:23:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.23818","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2025-10-27T19:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.01352","ref_index":50,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network","primary_cat":"cs.LG","submitted_at":"2025-06-02T06:13:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.09992","ref_index":116,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Large Language Diffusion Models","primary_cat":"cs.CL","submitted_at":"2025-02-14T08:23:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:1905.07830, 2019. [114] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958, 2021. [115] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99-106, 2021. [116] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, 2020. [117] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al."}],"limit":50,"offset":0}