{"total":18,"items":[{"citing_arxiv_id":"2605.10504","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining","primary_cat":"cs.CL","submitted_at":"2026-05-11T13:01:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08740","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure","primary_cat":"cs.LG","submitted_at":"2026-05-09T07:05:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07105","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Theoretical Limits of Language Model Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-08T01:32:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06366","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Layer Collapse in Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T14:39:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05341","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Feature Starvation as Geometric Instability in Sparse Autoencoders","primary_cat":"cs.LG","submitted_at":"2026-05-06T18:11:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02968","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency","primary_cat":"cs.LG","submitted_at":"2026-05-03T12:21:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00195","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diversity in Large Language Models under Supervised Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-30T20:20:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18239","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner","primary_cat":"cs.LG","submitted_at":"2026-04-20T13:23:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16197","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation","primary_cat":"cs.LG","submitted_at":"2026-04-17T16:07:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence analysis on LLMs up to 32B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13950","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs","primary_cat":"cs.CL","submitted_at":"2026-04-15T15:03:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in extractable vs. conjunctive uses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11119","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DDO-RM: Distribution-Level Policy Improvement after Reward Learning","primary_cat":"stat.ML","submitted_at":"2026-04-13T07:33:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.05171","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach","primary_cat":"cs.LG","submitted_at":"2025-02-07T18:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.21787","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","primary_cat":"cs.LG","submitted_at":"2024-07-31T17:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.08543","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniLLM: On-Policy Distillation of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-06-14T14:44:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.01116","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only","primary_cat":"cs.CL","submitted_at":"2023-06-01T20:03:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.14314","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QLoRA: Efficient Finetuning of Quantized LLMs","primary_cat":"cs.LG","submitted_at":"2023-05-23T17:50:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.06161","ref_index":213,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StarCoder: may the source be with you!","primary_cat":"cs.CL","submitted_at":"2023-05-09T08:16:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.18223","ref_index":98,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-03-31T17:28:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"mT0 [94] Nov-2022 13 mT5✓- - - - -✓- Galactica [35] Nov-2022 120 - - - 106B tokens - - -✓ ✓ BLOOMZ [94] Nov-2022 176 BLOOM✓- - - - -✓- OPT-IML [95] Dec-2022 175 OPT✓- - - 128 40G A100 -✓ ✓ LLaMA [57] Feb-2023 65 - - - 1.4T tokens - 2048 80G A100 21 d✓- Pythia [96] Apr-2023 12 - - - 300B tokens - 256 40G A100 -✓- CodeGen2 [97] May-2023 16 - - - 400B tokens - - -✓- StarCoder [98] May-2023 15.5 - - - 1T tokens - 512 40G A100 -✓ ✓ LLaMA2 [99] Jul-2023 70 -✓ ✓2T tokens - 2000 80G A100 -✓- Baichuan2 [100] Sep-2023 13 -✓ ✓2.6T tokens - 1024 A800 -✓- QWEN [101] Sep-2023 14 -✓ ✓3T tokens - - -✓- FLM [102] Sep-2023 101 -✓- 311B tokens - 192 A800 22 d✓- Publicly Available Skywork [103] Oct-2023 13 - - - 3.2T tokens - 512 80G A800 -✓-"}],"limit":50,"offset":0}