{"total":17,"items":[{"citing_arxiv_id":"2606.27866","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-26T09:08:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlexMoE produces nested pruned subnetworks for MoE LLMs across budgets via channel importance ranking and discrete action learning, plus one mid-budget recovery fine-tune, retaining 99.8% performance at 50% expert parameter pruning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06526","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions","primary_cat":"cs.AI","submitted_at":"2026-06-02T20:38:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15626","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression","primary_cat":"cs.LG","submitted_at":"2026-05-15T05:19:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"IO-SVD performs SVD-based LLM compression by constructing a KL-aware double-sided whitening space and using first-order loss estimates for heterogeneous rank allocation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08894","ref_index":1,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-09T11:19:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"In this work, we highlight smoothness as a important but overlooked objective in extreme quantization. Building on input-gradient analysis and sequence neighborhood modeling, we introduce LGP for PTQ and LGR for QAT as simple smoothness-preserving instantiations, and advocate explicitly incorporating smoothness into future quantization design. References [1] A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, Y . Choi, and H. Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2357-2367, 2019."},{"citing_arxiv_id":"2604.19520","ref_index":31,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SimDiff: Depth Pruning via Similarity and Difference","primary_cat":"cs.AI","submitted_at":"2026-04-21T14:43:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15306","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Generalization in LLM Problem Solving: The Case of the Shortest Path","primary_cat":"cs.AI","submitted_at":"2026-04-16T17:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06515","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees","primary_cat":"cs.LG","submitted_at":"2026-04-07T23:17:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A router-norm and variance-based bit allocation strategy for quantizing MoE models that claims higher accuracy and lower cost than prior mixed-precision methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04131","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents","primary_cat":"cs.AI","submitted_at":"2026-04-05T14:27:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.25412","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-03-26T13:08:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.02764","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-12-02T13:44:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21285","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark","primary_cat":"cs.CL","submitted_at":"2025-11-26T11:18:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11794","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DataComp-LM: In search of the next generation of training sets for language models","primary_cat":"cs.LG","submitted_at":"2024-06-17T17:42:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"https://arxiv.org/abs/2301.03988. [10] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra- Aimée Cojocaru, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023. [11] Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp."},{"citing_arxiv_id":"2311.16867","ref_index":55,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Falcon Series of Open Language Models","primary_cat":"cs.CL","submitted_at":"2023-11-28T15:12:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.05653","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning","primary_cat":"cs.CL","submitted_at":"2023-09-11T17:47:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.14233","ref_index":128,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Enhancing Chat Language Models by Scaling High-quality Instructional Conversations","primary_cat":"cs.CL","submitted_at":"2023-05-23T16:49:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.10403","ref_index":226,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PaLM 2 Technical Report","primary_cat":"cs.CL","submitted_at":"2023-05-17T17:46:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2206.04615","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","primary_cat":"cs.CL","submitted_at":"2022-06-09T17:05:34+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BIG-bench is a 204-task benchmark that measures scaling trends, calibration, and absolute limitations of language models across knowledge, reasoning, and social domains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"in East Africa and written with a Latin script. Figure 15a displays a trend of increasing performance with model scale, with the best model achieving 43% accuracy on when given four choices. However, it is not clear 20 Task Name Description Languages [mC4 rank] Conlang Translation Problems Decipher language rules and lexi- con from a few examples English [1], German [4], Finnish [24], Abma [100+] , Apinayé [100+] , Inapuri [100+] , Ndebele [100+] , Palauan [100+] Kannada Riddles Answer Kannada riddles Kannada [65] Language Identifica- tion Identify the language a given sen- tence is written in 1000 languages Swahili English Proverbs For a given proverb in Kiswahili, choose a proverb in English which is closest in meaning"}],"limit":50,"offset":0}