{"total":12,"items":[{"citing_arxiv_id":"2605.20295","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization","primary_cat":"cs.LG","submitted_at":"2026-05-19T10:48:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Quant.npu provides a fully static quantization pipeline for on-device LLMs on NPUs by combining rotation matrices, bit-width-aware initialization, two-stage selective optimization, and adaptive mixed precision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15508","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"STS: Efficient Sparse Attention with Speculative Token Sparsity","primary_cat":"cs.LG","submitted_at":"2026-05-15T01:05:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10777","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Locking Pretrained Weights via Deep Low-Rank Residual Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-11T16:09:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08755","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss","primary_cat":"cs.LG","submitted_at":"2026-05-09T07:35:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"(per-token average) on AIME25 traces both strongly anti-correlate with downstream Pass@1 (Pearson r≈ −0.85 for both), and are themselves highly correlated ( r= +0.85 ), consistent with a shared KV-cache fidelity factor (Figure 2b). LRM Quantization is Sensitive to Calibration Data, Especially under Challenging Scenarios. Existing LLM quantization algorithms typically calibrate on pre-training text corpora (e.g., Wiki- Text2 [41], C4 [44], RedPajama [52]), and the generalization capability of modern LLMs has made this choice sufficient for most tasks. For LRMs, however, calibration data choice is a first-order factor. Figure 3a compares GPTQ-quantized LRMs under two calibration sources: agenericmixture of pre-training text, and areasoningcorpus of DeepSeek-R1 traces on OpenR1-Math-220k problems."},{"citing_arxiv_id":"2605.08575","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution","primary_cat":"cs.LG","submitted_at":"2026-05-09T00:34:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pre-trained MoE models exhibit up to 90% intra-expert activation sparsity that enables up to 2.5x faster MoE layer execution when exploited in the vLLM inference system.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Shared Experts Granite-1B-A400M [15] 400M / 1B 32 8 512 False OLMoE-1B-7B [33] 1B / 7B 64 8 1024 False DeepSeek-V2-Lite [8] 2.4B / 16B 64 6 1408 True GPT-OSS-20B [36] 3B / 20B 32 4 2880 False Qwen3.5-35B-A3B [38] 3B / 35B 256 8 512 True Qwen3.5-122B-A10B [38] 10B / 122B 256 8 1024 True Qwen3.5-397B-A17B [38] 17B / 397B 512 10 1024 True Llama-4-Maverick [31] 17B / 400B 128 1 8192 True Table 1: Architectural details of the tested MoE models sparsity as a practical optimization technique for MoE model execution and lay the groundwork for future efficient LLM research. 2 Background and Related Work Feed-Forward Network:The feed-forward network (FFN) of the Transformer architecture [ 46] applies a non-linear activation function in a higher-dimensional latent space to encode and process"},{"citing_arxiv_id":"2605.08568","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression","primary_cat":"cs.LG","submitted_at":"2026-05-09T00:02:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"5 Experiment 5.1 Experimental Setup Models and Datasets.We conduct experiments on open-source LLMs and widely used language modeling and zero-shot reasoning benchmarks. For model selection, we consider representative architectures including LLaMA-7B, LLaMA-13B, LLaMA-30B [ 6] and Qwen2.5-7B [ 38]. For evaluation, we report perplexity on WikiText2 [39], PTB [40], and C4 [41], and zero-shot accuracy on OpenBookQA [42], ARC-e, ARC-c [ 43], WinoGrande [44], HellaSwag [45], PIQA [46], and MathQA [47]. All downstream reasoning tasks are evaluated in the zero-shot setting using the LM-Evaluation-Harness [48]. Baselines.Our method is not a standalone matrix decomposition framework, but a rank selection"},{"citing_arxiv_id":"2605.05971","ref_index":37,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Training Transformers for KV Cache Compressibility","primary_cat":"cs.LG","submitted_at":"2026-05-07T10:17:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"[35] Zhuoling Li, Xiaogang Xu, Zhenhua Xu, SerNam Lim, and Hengshuang Zhao. Larm: Large auto-regressive model for long-horizon embodied intelligence.arXiv preprint arXiv:2405.17424, 2024. [36] Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, and Muhan Zhang. Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass.arXiv preprint arXiv:2602.06358, 2026. [37] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. [38] CA Micchelli and Allan Pinkus. Moment theory for weak chebyshev systems with applications to monosplines, quadrature formulae and best one-sided lˆ1-approximation by spline functions with fixed knots.SIAM Journal on Mathematical Analysis, 8(2):206-230, 1977."},{"citing_arxiv_id":"2604.12946","ref_index":52,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Parcae: Scaling Laws For Stable Looped Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-14T16:43:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.09136","ref_index":103,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG","primary_cat":"cs.AI","submitted_at":"2025-01-15T20:40:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Agentic RAG embeds agents with reflection, planning, tool use, and collaboration into retrieval pipelines to overcome static RAG limitations, and the survey offers a taxonomy by agent count, control, autonomy, and knowledge representation plus applications and open challenges.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"tion NarrativeQA (NQA) [74], QMSum [76] Text Genera- tion Biography Biography Dataset Text ClassificationSentiment Analysis SST-2 [98] General Classification VioLens [99], TREC [60] Code SearchProgramming Search CodeSearchNet [100] Robustness Retrieval Robustness NoMIRACL [101] Language Modeling Ro- bustness WikiText-103 [102] MathMath Reasoning GSM8K [103] Machine Trans- lation Translation Tasks JRC-Acquis [104] 12.3 Memory Management and Long-Term Adaptation Long-term memory design remains open challenge. Persistent memory risks knowledge drift and bias reinforcement, while frequent updates can amplify hallucinations. Key questions include balancing persistence with adaptability, selective retention, and reconciling external knowledge with stored agent experiences."},{"citing_arxiv_id":"2412.12636","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TrainMover: An Interruption-Resilient Runtime for ML Training","primary_cat":"cs.DC","submitted_at":"2024-12-17T07:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TrainMover achieves ~20s downtime for interruptions in 1024-GPU LLM training via two-phase delta-based communication setup, communication-free sandboxed warmup, and general standby design, projecting 55% reduction in wasted GPU hours.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.06587","ref_index":50,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TouchAI: Exploring human-AI perceptual alignment in touch through language model representations","primary_cat":"cs.CL","submitted_at":"2024-06-05T08:46:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs show partial and variable perceptual alignment with human touch on textiles, succeeding on samples like silk satin but failing on cotton denim when matching descriptive language to embedding similarity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2108.12409","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation","primary_cat":"cs.CL","submitted_at":"2021-08-27T17:35:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ALiBi enables transformers trained on length-1024 sequences to extrapolate to length-2048 with the same perplexity as a sinusoidal model trained on 2048, while training 11% faster and using 11% less memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}