{"total":11,"items":[{"citing_arxiv_id":"2605.18113","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"iPOE: Interpretable Prompt Optimization via Explanations","primary_cat":"cs.CL","submitted_at":"2026-05-18T09:21:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"iPOE generates and optimizes annotation guidelines from explanations to produce interpretable prompts, reporting up to 39% gains over baselines on four datasets with LLM explanations substituting for human ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17790","ref_index":34,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery","primary_cat":"cs.AI","submitted_at":"2026-05-18T03:14:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STRIDE is a self-reflective agent framework that improves accuracy, OOD robustness, and structural recovery in LLM-based symbolic regression by integrating generation, evaluation, repair, and diversity-preserving memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13149","ref_index":61,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions","primary_cat":"cs.CL","submitted_at":"2026-05-13T08:15:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12382","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Pretraining Exposure Explains Popularity Judgments in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-12T16:45:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"wise popularity comparisons between two entities and their aliases. The model predicts which of the two entities is more popular. For each entity pair, LLM is queried three times, and the final com- parison outcome is determined by majority vote. The prompt ad- ditionally requests a brief justification for each decision, as prior work has shown that generating explanations can lead to more reliable model outputs [8]. The resulting pairwise preferences are subsequently converted into listwise popularity scores using the Bradley-Terry model [3]. 4 Experimental Setup Indexing the OLMo corpora is performed on 16 compute nodes, each equipped with 380 CPU cores and 760 GB of RAM, over a period of 10https://pageviews.wmcloud.org 11https://en.wikipedia.org/wiki/Greenland"},{"citing_arxiv_id":"2605.16379","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"An Information-Theoretic Criterion for Efficient Data Synthesis","primary_cat":"cs.LG","submitted_at":"2026-05-11T01:27:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Synthetic data improves models only in information-open generation-training loops with external signals, and coarser signals like binary correctness enable better generalization by converging to the most information-efficient component.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09079","ref_index":57,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators","primary_cat":"cs.AI","submitted_at":"2026-05-09T17:39:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"labeling and confirmation bias in deep semi-supervised learning. In2020 International Joint Conference on Neural Networks, pages 1-8. IEEE, 2020. doi: 10.1109/IJCNN48605.2020. 9207304. [56] Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. Debiased self-training for semi-supervised learning. InAdvances in Neural Information Processing Systems, 2022. [57] Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. AI models collapse when trained on recursively generated data.Nature, 631:755-759, 2024. doi: 10.1038/s41586-024-07566-y. [58] Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G."},{"citing_arxiv_id":"2604.17886","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Latent Preference Modeling for Cross-Session Personalized Tool Calling","primary_cat":"cs.CL","submitted_at":"2026-04-20T06:57:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17769","ref_index":51,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF","primary_cat":"cs.CL","submitted_at":"2026-04-20T03:49:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17574","ref_index":129,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation","primary_cat":"cs.CL","submitted_at":"2026-04-19T18:29:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.07974","ref_index":203,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","primary_cat":"cs.SE","submitted_at":"2024-03-12T17:58:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.13116","ref_index":79,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Survey on Knowledge Distillation of Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-02-20T16:17:37+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}