{"total":15,"items":[{"citing_arxiv_id":"2606.26530","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DiARC: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-25T02:10:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiARC improves LLM performance on ARC-like benchmarks by constructing and training on preference pairs from three types of negative samples while keeping demonstrations fixed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.14150","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Small LLMs: Pruning vs. Training from Scratch","primary_cat":"cs.LG","submitted_at":"2026-06-12T06:24:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pruned initializations from an 8B model outperform random starts with equal training tokens, but with full token budgets fine-grained pruning retains advantage while coarse structured pruning does not.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07604","ref_index":91,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Contribution Weights: A Geometrical Analysis of Self-Attention Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-29T09:40:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28207","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pruning and Distilling Mixture-of-Experts into Dense Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-27T09:27:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A systematic MoE-to-dense conversion via expert scoring, grouping, and distillation yields +6.3 pp average accuracy over dense-to-dense pruning at matched parameter count on tested models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21171","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FTerViT: Fully Ternary Vision Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-20T13:41:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24380","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency","primary_cat":"cs.CL","submitted_at":"2026-04-27T12:10:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"removalis disproportionately harmful on bench- marks that place higher demands on multi-step reasoning and/or free-form generation. Crucially, our findings in the multimodal domain strongly align with recent discoveries in the unimodal LLM pruning literature, pointing to a fundamental property of transformer architec- tures. For instance, Sreenivas et al. [66] recently demonstrated a similar phenomenon when prun- ing LLMs: widthwise pruning proved vastly supe- rior to layerwise pruning, especially on reasoning- intensive benchmarks such as GSM8K [67]. The fact that layer removal disproportionately impairs complex reasoning across both pure language models and vision-language models underscores that this degradation is not merely an artifact"},{"citing_arxiv_id":"2604.10627","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment","primary_cat":"cs.CL","submitted_at":"2026-04-12T13:06:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01997","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Limits of Layer Pruning for Generative Reasoning in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-02-02T11:57:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Layer pruning preserves classification performance in LLMs but fundamentally limits recovery of generative reasoning capabilities even after extensive self-supervised finetuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.08584","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ministral 3","primary_cat":"cs.CL","submitted_at":"2026-01-13T14:06:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20856","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NVIDIA Nemotron 3: Efficient and Open Intelligence","primary_cat":"cs.CL","submitted_at":"2025-12-24T00:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12744","ref_index":70,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity","primary_cat":"cs.LG","submitted_at":"2025-12-14T15:47:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.08008","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2025-10-09T09:45:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.12876","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs","primary_cat":"cs.LG","submitted_at":"2025-06-15T15:02:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.03610","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games","primary_cat":"cs.AI","submitted_at":"2025-06-04T06:40:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.08464","ref_index":137,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing","primary_cat":"cs.CL","submitted_at":"2024-06-12T17:52:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}