{"total":14,"items":[{"citing_arxiv_id":"2605.23180","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Self-Improving In-Context Learning","primary_cat":"cs.CL","submitted_at":"2026-05-22T03:01:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A test-time zeroth-order optimization of prompt embeddings using a bounded self-supervised proxy from demonstration log-probabilities improves ICL accuracy and correlates with gains across tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15676","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Dynamic Chunking for Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-15T06:56:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12327","ref_index":18,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Grid Games: The Power of Multiple Grids for Quantizing Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T16:09:02+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"to isolate the effect of grid choice (we provide results with Hadamard transforms in the Appendix D). We measure KL divergence against BF16 logits on WikiText-2 and C4, as well as Expected Acceptance Rate (EAR) between the original model and the quantized one [17]. We run models on downstream tasks using Harness [14] and report accuracies on Winogrande [32], ARC-C, ARC-E [7], Lambada (standard) [30], PIQA [2], Hellaswag (10-shot) [39], MMLU [18], IFEval (Prompt) [ 40], and GSM8K-CoT [8]. We compare several single-grids NVFP4, BOF4 [3], NF4 [11], Split87, and several multi-grid variants IF4 (per-block INT4/FP4 selection [10]), PO2(NF4), and PO2(Split87). We also compare with Four-Over-Six [9] and the SFP4 described in Section 4.4. Weight-and-Activation PTQ Results.Tables 3 and 4 report the W4A4 results."},{"citing_arxiv_id":"2605.07268","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-08T05:33:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning gap rather than knowledge deficits.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"86% ), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit. 1 Introduction Multiple-choice questions (MCQs) remain the dominant paradigm for evaluating large language models (LLMs) [11, 34, 2], with logical reasoning benchmarks garnering particular attention for their ability to isolate pure reasoning from domain-specific knowledge. The recent emergence of Large Reasoning Models (LRMs) [7, 29] has accelerated this trend by leveraging test-time scaling and extended chain-of-thought (CoT) with reflection [3] to achieve unprecedented performance on"},{"citing_arxiv_id":"2605.06663","ref_index":45,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EMO: Pretraining Mixture of Experts for Emergent Modularity","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:59:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EMO pretrains MoEs using document boundaries to induce semantic expert specialization, enabling modular subset deployment with minimal accuracy loss unlike standard MoEs.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"on five evaluation suites: (1)MC9, an average over nine multiple-choice benchmarks including ARC-Easy [32], ARC-Challenge [32], BoolQ [33], CSQA [34], HellaSwag [35], OpenBookQA [36], PIQA [37], SocialIQa [38], and WinoGrande [39]; (2)Gen5, an average over five generative tasks including CoQA [40], SQuAD [41], Natural Questions [ 42], TriviaQA [43], and DROP [44]; (3) MMLU[45] 1; (4)MMLU-Pro[46] 1; and (5)GSM8K[47]. Selective Expert Use.We next evaluate whether models can be deployed using only a subset of experts for each downstream domain (Figure 1). We consider coarse-grained domain grouping of MMLU and MMLU-Pro, e.g., math, physics, health, philosophy, history, which contain 161 and 131 domains, respectively, as well as GSM8K."},{"citing_arxiv_id":"2605.06523","ref_index":13,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:30:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06382","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rethinking Vacuity for OOD Detection in Evidential Deep Learning","primary_cat":"cs.AI","submitted_at":"2026-05-07T15:00:56+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Vacuity-based OOD detection in evidential deep learning is highly sensitive to class cardinality differences between ID and OOD, which can artificially inflate AUROC and AUPR without any change in model predictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05842","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Taklif.AI: LLM-Powered Platform for Interest-Based Personalized College Assignments","primary_cat":"cs.AI","submitted_at":"2026-05-07T08:17:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Taklif.AI generates interest-based personalized college assignments via LLMs with prompt engineering and guardrails, receiving positive feedback from 84% of 68 preliminary users.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05703","ref_index":44,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-05-07T05:48:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An ensemble-based information-theoretic active learning method using ensemble Kalman inversion selects valuable tasks to optimize communication structures in LLM multi-agent systems more reliably than random sampling under limited training budgets.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"positions in the dataset manifold can be identified. The most intuitive approach is to directly apply 6 a text encoder based on sentence transformers [ 41, 42]. Recent work has proposed instruction embeddings to enhance this process [ 43]. For tasks with labels that distinguish them into several groups within the whole dataset, e.g., subjects in MMLU [44], these labels can also be included in the embedding process to specify different manifolds for each group. For tasks without explicit labels, e.g., GSM8K [45], we simply input the whole task into a single encoder without further modification. Once the embedding vectors are obtained, we employ a greedy approach to form the intermediate pool. We first choose the center point, and then sequentially choose the farthest point from the"},{"citing_arxiv_id":"2605.01899","ref_index":61,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-03T14:28:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01853","ref_index":43,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-03T12:46:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24429","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Multi-Dimensional Audit of Politically Aligned Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-27T12:57:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A multi-dimensional audit framework for politically aligned LLMs finds consistent trade-offs: larger models are more effective and truthful but less fair with higher bias, while fine-tuned models reduce bias but increase hallucinations and reasoning decline, and all tested models show deficiencies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12744","ref_index":54,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity","primary_cat":"cs.LG","submitted_at":"2025-12-14T15:47:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.07517","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning","primary_cat":"cs.AI","submitted_at":"2025-10-08T20:29:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"5 Experiments 5.1 Setup Models and Datasets.We evaluate across five model families:Qwen2.5-7b-instruct, Qwen2.5-32b-instruct [16], Llama3.1-8b-instruct [24], Mistral-7b-v0.3 [25], and the latest GPT-OSS-20b [26], and evaluate on four benchmark datasets covering diverse reasoning tasks: Google-Proof QA (GPQA) [27], MMLU Professional Medicine subset [28, 29], HellaSwag [30], and the Grade-School Math 8K (GSM8K) [31]. See Appendix B.1 for more dataset details, and Appendix B.2 for other experimental details. 5.2 Experimental Results Identity bias is pervasive across models and tasks, and is dominated by sycophancy.Table 1 reports the Identity Bias Coefficient (IBC) values across models and datasets."}],"limit":50,"offset":0}