{"total":12,"items":[{"citing_arxiv_id":"2605.21699","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-20T19:59:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20005","ref_index":72,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18753","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16215","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Fully Open Meditron: An Auditable Pipeline for Clinical LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-15T17:29:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13997","ref_index":69,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:07:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12327","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Grid Games: The Power of Multiple Grids for Quantizing Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T16:09:02+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We measure KL divergence against BF16 logits on WikiText-2 and C4, as well as Expected Acceptance Rate (EAR) between the original model and the quantized one [17]. We run models on downstream tasks using Harness [14] and report accuracies on Winogrande [32], ARC-C, ARC-E [7], Lambada (standard) [30], PIQA [2], Hellaswag (10-shot) [39], MMLU [18], IFEval (Prompt) [ 40], and GSM8K-CoT [8]. We compare several single-grids NVFP4, BOF4 [3], NF4 [11], Split87, and several multi-grid variants IF4 (per-block INT4/FP4 selection [10]), PO2(NF4), and PO2(Split87). We also compare with Four-Over-Six [9] and the SFP4 described in Section 4.4. Weight-and-Activation PTQ Results.Tables 3 and 4 report the W4A4 results. Dual-grid methods"},{"citing_arxiv_id":"2605.08755","ref_index":58,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss","primary_cat":"cs.LG","submitted_at":"2026-05-09T07:35:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Models and Benchmarks.We apply LAQuant on Llama-3.1 8B [ 15], DeepSeek-R1-distilled Llama-3.1 8B [8], and Qwen3 (Base + chat variants at 1.7B, 4B, 8B) [56]. Following ParoQuant [30], we evaluate LAQuant on three different types of benchmarks: i) perplexity on WikiText2 and C4, ii) zero-shot accuracy on ARC-Challenge, ARC-Easy [6], BoolQ [5], and HellaSwag [58], and iii) reasoning accuracy on AIME2024 [ 23], AIME2025 [ 39], GPQA Diamond [ 45], LSAT-AR [62], 6 MMLU-Pro [51], and LiveCodeBench [22] (2024.10-2025.02 subset as used in the official Qwen3 evaluation3). Table 1: Pass@1 ↑ onreasoning tasksunder 3- and 4-bit quantization with a group size of 128. R-QAT and ParoQ denote ReasoningQAT and ParoQuant, respectively; ParoQ++ uses the same"},{"citing_arxiv_id":"2605.08568","ref_index":45,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression","primary_cat":"cs.LG","submitted_at":"2026-05-09T00:02:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We conduct experiments on open-source LLMs and widely used language modeling and zero-shot reasoning benchmarks. For model selection, we consider representative architectures including LLaMA-7B, LLaMA-13B, LLaMA-30B [ 6] and Qwen2.5-7B [ 38]. For evaluation, we report perplexity on WikiText2 [39], PTB [40], and C4 [41], and zero-shot accuracy on OpenBookQA [42], ARC-e, ARC-c [ 43], WinoGrande [44], HellaSwag [45], PIQA [46], and MathQA [47]. All downstream reasoning tasks are evaluated in the zero-shot setting using the LM-Evaluation-Harness [48]. Baselines.Our method is not a standalone matrix decomposition framework, but a rank selection mechanism built on top of existing SVD-based compression methods. Therefore, we evaluate it by integrating it with several representative and influential SVD-based methods, including SVD-LLM"},{"citing_arxiv_id":"2605.07977","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback","primary_cat":"cs.LG","submitted_at":"2026-05-08T16:35:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect ones, outperforming baselines without ground-truth contexts.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"All experiments are conducted on a server with a NVIDIA A100-40GB GPU, utilizing the HuggingFace [15] and PyTorch [28] libraries for implementation. More detailed specifications can be found in Appendix D. Datasets.We consider four benchmark datasets encompassing a diverse range of domains: ARC- Challenge [7] for science-based question answering, HellaSwag [46] for common-sense reasoning sentence completion, MathMCQA for competition-level mathematics [4], and StrategyQA [9] for multi-hop reasoning. We measure the accuracy of the final produced answer after reasoning from the testing dataset to evaluate the performance of SPEAR and the baselines. In terms of the incomplete but informative feedback given to the model, for HellaSwag and ARC-Challenge, we include the"},{"citing_arxiv_id":"2605.06632","ref_index":38,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Crafting Reversible SFT Behaviors in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:44:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06548","ref_index":107,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Continuous Latent Diffusion Language Model","primary_cat":"cs.CL","submitted_at":"2026-05-07T16:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04913","ref_index":31,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training","primary_cat":"cs.CL","submitted_at":"2026-05-06T13:41:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LoPT splits LLM post-training at the midpoint with task loss on the second half and feature reconstruction on the first half to reduce cost and interference.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"from the matched E2E baseline only by the midpoint stop-gradient boundary and the auxiliary reconstruction update for the first-half block. D.2 Evaluation Details For downstream benchmark, we assess LoPT with lm-eval-harness [5] under a unified protocol. The main evaluation benchmark includes MMLU (5-shot) [8], IFEval [34], ARC-Challenge (25-shot) [2], GSM8K (4- shot) [3], HellaSwag (5-shot) [31], TruthfulQA MC2 [16], and Winogrande (5-shot) [20]. These benchmarks cover general knowledge, instruction following, reasoning, commonsense understanding, truthfulness, and mathematical problem solving. For trained checkpoints, all reported lm-eval scores are means over three independently trained checkpoints with different random seeds. We compute aggregate averages and method"}],"limit":50,"offset":0}