{"total":11,"items":[{"citing_arxiv_id":"2606.18587","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dual Dimensionality for Local and Global Attention","primary_cat":"cs.CL","submitted_at":"2026-06-17T01:27:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Distance-Adaptive Representation (DAR) keeps full KV dimensionality inside a local window and reduces it to 1/4 outside, matching full-dimensional baselines on pretraining (70M-410M) and 1B-scale fine-tuning while uniform reduction performs worse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20105","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing","primary_cat":"cs.LG","submitted_at":"2026-05-19T16:56:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In high-dimensional analysis, pretrained PCA representations for linear probing generalize best at low dimensionality when pretraining data is plentiful but labeled data scarce, with an exact trade-off showing how much unlabeled data replaces one labeled sample.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06901","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reflections and New Directions for Human-Centered Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-07T20:02:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Model developers must address human concerns, preferences, values, and goals with rigor at every stage of the LLM pipeline rather than only in post-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04952","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-05-06T14:15:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04116","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Membership Inference Attacks for Retrieval Based In-Context Learning for Document Question Answering","primary_cat":"cs.CR","submitted_at":"2026-05-05T08:19:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Black-box membership inference attacks on retrieval-based in-context learning for document QA succeed via query prefixes, with a novel weighted-averaging method outperforming priors even under paraphrasing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.06395","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies","primary_cat":"cs.CL","submitted_at":"2024-04-09T15:36:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.10631","ref_index":127,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Llemma: An Open Language Model For Mathematics","primary_cat":"cs.CL","submitted_at":"2023-10-16T17:54:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2304.01373","ref_index":130,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling","primary_cat":"cs.CL","submitted_at":"2023-04-03T20:58:15+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.08112","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Eliciting Latent Predictions from Transformers with the Tuned Lens","primary_cat":"cs.LG","submitted_at":"2023-03-14T17:47:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"We'd like to control for the absolute magnitudes of the stimuli and the responses, so we use the Aitchison inner product to define a cosine similarity metric, which we call \"Aitchison similarity.\" Then the stimulus-response alignment at layer ℓ under g is simply the Aitchison similarity between the stimulus and response: sim(S(hℓ),R(h ℓ)) = ⟨S(hℓ),R(h ℓ)⟩w ∥S(hℓ)∥w∥R(hℓ)∥w (18) We propose to use CBE (Section 4.1) to define a \"natural\" choice for the intervention g. Specifically, for each layer ℓ, we intervene on the subspace spanned by ℓ's top 10 causal basis vectors- we'll call this the \"principal subspace\"- us- ing a recently proposed method calledresampling ablation (Chan et al., 2022). Given a hidden statehℓ =M ≤ℓ(x), resampling ablation re-"},{"citing_arxiv_id":"2211.05100","ref_index":204,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BLOOM: A 176B-Parameter Open-Access Multilingual Language Model","primary_cat":"cs.CL","submitted_at":"2022-11-09T18:48:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.06745","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GPT-NeoX-20B: An Open-Source Autoregressive Language Model","primary_cat":"cs.CL","submitted_at":"2022-04-14T04:00:27+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}