{"total":20,"items":[{"citing_arxiv_id":"2605.30844","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fine-Tuning Improves Information Conveyance in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-29T05:05:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuning reorganizes uncertainty in LLMs into more efficient information conveyance, as shown by stronger length-entropy correlations and a tripling of entropy-semantic diversity links after controls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29448","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions","primary_cat":"cs.LG","submitted_at":"2026-05-28T06:40:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Vendi Score and scaling-law objectives belong to the class of matrix spectral functions, which are submodular, enabling efficient greedy selection of training data that outperforms random subsets in predicting held-out performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22564","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:45:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20086","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What Do Evolutionary Coding Agents Evolve?","primary_cat":"cs.NE","submitted_at":"2026-05-19T16:41:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17193","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-LLM Systems Exhibit Robust Semantic Collapse","primary_cat":"cs.MA","submitted_at":"2026-05-16T23:29:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Closed-loop multi-LLM systems exhibit robust semantic collapse across model families and interventions, consistent with intrinsic properties of autoregressive generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17187","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media","primary_cat":"cs.CL","submitted_at":"2026-05-16T22:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11494","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T04:10:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"tiPrompts[ 41] for broad category coverage, andGenEval[ 14] for compositional accuracy. For MS-COCO, we use a subset of 2,000 captions from the 2014 validation split. See Supp. Sec B.1. Metrics.We report four primary metrics.InBatchSim (InBSim) ↓ measures the average pairwise CLIP similarity among images generated from the same prompt, directly quantifying mode collapse, where lower values indicate greater diversity.Vendi Score [8] per prompt (VD/pDINO) ↑ computes the effective number of distinct outputs per prompt using DINO [29] features, providing a feature-level diversity measure complementary to InBSim.CLIP Score[ 17] ↑ measures text-image alignment, ensuring diversity gains do not come at the expense of prompt faithfulness.Human Preference Score (HPSv2)[ 39] ↑ evaluates perceptual quality and aesthetic appeal; we refer to HPSv2 simply as"},{"citing_arxiv_id":"2605.11258","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unlocking LLM Creativity in Science through Analogical Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-11T21:35:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The Virtual Lab simulates a real-life lab environment through collaborating cross-domain LLM agents, resulting in the design of novel COVID-19 binders [38]. Kosmos performs long-horizon iteration that discovered novel mechanisms for aging [26]. The AI Co-scientist applies tournament evolution to hypothesis generation, resulting in promising new disease targets [16]. These efforts demonstrate AI's immense potential to augment the scientific process. However, the success of autonomous science relies on the ability of AI systems to consistently generate novel and diverse approaches to research problems. Proposed solutions must be novel to drive research progress past existing work. Furthermore, even the most promising solutions may be invalidated after empirical testing."},{"citing_arxiv_id":"2605.11142","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rank Is Not Capacity: Spectral Occupancy for Latent Graph Models","primary_cat":"cs.LG","submitted_at":"2026-05-11T18:46:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Spectra defines and controls effective capacity in graph embeddings via the Shannon effective rank of a trace-normalized kernel spectrum, making capacity a post-fit property rather than a pre-training hyperparameter.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"The kernel is invariant under the gauge symmetries of the factorization, trace normalization fixes total spectral mass, and the normalized eigenvalues form a probability distribution on latent modes, which we call thespectral occupancy distribution. We summarize this distribution by the effective spectral dimension dspec, the exponential of its Shannon entropy [16, 54]. Spectral entropy has been used as a diagnostic in language models [27, 66], training dynamics [67], and adaptive-rank compression [13]. Here, it becomes a controllable, end-to-end training-time capacity coordinate for latent graph models. This positioning connects to two views of overparameterized models: effective complexity can govern generalization even when parameter counts are large [7, 9], and gradient descent on factorized matrix"},{"citing_arxiv_id":"2605.08472","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-08T20:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. [7] Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025. [8] Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022. [9] Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503."},{"citing_arxiv_id":"2605.05104","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Building informative materials datasets beyond targeted objectives","primary_cat":"cond-mat.mtrl-sci","submitted_at":"2026-05-06T16:39:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A diversity-aware selection framework builds materials datasets that improve prediction performance on both targeted (up to 25% gain) and untargeted properties (up to 10% gain) compared to random or non-diverse sampling in noisy experimental settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23540","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization","primary_cat":"cs.CV","submitted_at":"2026-04-26T05:32:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"accelerating convergence without risking divergence or triggering reward hacking vulnerabilities. 5 Empirical Analysis 5.1 Experimental Settings Datasets & Metrics.We evaluate our framework on four diverse benchmarks to capture dif- ferent generative capabilities. We useMS-COCO 2017 [21](5k val) to test zero-shot fidelity, DrawBench [31]for complex spatial relations, andGenEval [10]for fine-grained compositional reasoning and object counting. Additionally, we utilizePick-a-Pic [19]to assess alignment with complex human preferences. To comprehensively quantify performance, generative quality and di- versity are measured usingFID [34]andVendi Score [9], whileCLIP Score [29]evaluates overall text-image semantic matching. Finally, detailed intent alignment and visual appeal are"},{"citing_arxiv_id":"2604.03472","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution","primary_cat":"cs.CL","submitted_at":"2026-04-03T21:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03380","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation","primary_cat":"cs.CL","submitted_at":"2026-04-03T18:19:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.24480","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories","primary_cat":"cs.CV","submitted_at":"2026-03-25T16:22:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PF-MA is a new active learning rule that favors likely-positive uncertain samples to speed up discovery of rare categories in imbalanced visual retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.07633","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flow-Based Conformal Predictive Distributions","primary_cat":"stat.ML","submitted_at":"2026-02-07T17:26:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Differentiable nonconformity scores induce flows that sample conformal prediction set boundaries, and mixing flows across levels produces conformal predictive distributions whose quantiles match the sets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.00090","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models","primary_cat":"cs.CV","submitted_at":"2025-12-31T19:47:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12072","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs","primary_cat":"cs.CL","submitted_at":"2025-12-12T22:39:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Voyager iteratively optimizes a determinantal point process diversity measure to generate synthetic LLM datasets, delivering 1.5-3 times higher diversity than baselines in experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25424","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Polychromic Objectives for Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-09-29T19:32:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.15689","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization","primary_cat":"cs.CV","submitted_at":"2024-12-20T09:07:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DOLLAR combines variational score and consistency distillation for few-step video generation plus latent reward optimization, reporting 82.57 VBench score and up to 278x speedup over the teacher diffusion model for 128-frame 10-second videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}