{"total":17,"items":[{"citing_arxiv_id":"2605.05076","ref_index":66,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"High-Dimensional Statistics: Reflections on Progress and Open Problems","primary_cat":"math.ST","submitted_at":"2026-05-06T16:11:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This review synthesizes representative advances in high-dimensional statistics, highlights common themes and open problems, and points to key entry works.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.03714","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines","primary_cat":"cs.CL","submitted_at":"2023-10-05T17:37:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.16264","ref_index":72,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Data-Constrained Language Models","primary_cat":"cs.CL","submitted_at":"2023-05-25T17:18:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46-51. [70] Niklas Muennighoff. 2020. Vilio: State-of-the-art visio-linguistic models applied to hateful memes. arXiv preprint arXiv:2012.07788. [71] Niklas Muennighoff. 2022. SGPT: GPT Sentence Embeddings for Semantic Search. arXiv preprint arXiv:2202.08904. [72] Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2023. OctoPack: Instruction Tuning Code Large Language Models. arXiv preprint arXiv:2308.07124. [73] Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive Text Embedding Benchmark."},{"citing_arxiv_id":"2304.06364","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models","primary_cat":"cs.CL","submitted_at":"2023-04-13T09:39:30+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.18223","ref_index":125,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-03-31T17:28:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"word, the higher the fidelity, the more resolution you get in this process...\"16. Capacity Leap. Although GPT-2 is intended to be an \"un- supervised multitask learner\", it overall has an inferior performance compared with supervised fine-tuning state- of-the-art methods. Because it has a relatively small model size, it has been widely fine-tuned in downstream tasks, especially the dialog tasks [124, 125]. Based on GPT-2, GPT-3 15. To better understand this sentence, we put some explanation words in parentheses. 16. https://lifearchitect.ai/ilya/ 8 TABLE 1: Statistics of large language models (having a size larger than 10B in this survey) in recent years, including the capacity evaluation, pre-training data scale (either in the number of tokens or storage size) and hardware resource costs."},{"citing_arxiv_id":"2301.13688","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Flan Collection: Designing Data and Methods for Effective Instruction Tuning","primary_cat":"cs.AI","submitted_at":"2023-01-31T15:03:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.09110","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Holistic Evaluation of Language Models","primary_cat":"cs.CL","submitted_at":"2022-11-16T18:51:34+00:00","verdict":"ACCEPT","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HELM establishes a multi-metric evaluation covering 30 language models on 42 scenarios (16 core) to raise average scenario coverage from 17.9% to 96% under uniform conditions while releasing all prompts, completions, and a toolkit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2206.07682","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Emergent Abilities of Large Language Models","primary_cat":"cs.CL","submitted_at":"2022-06-15T17:32:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.01068","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OPT: Open Pre-trained Transformer Language Models","primary_cat":"cs.CL","submitted_at":"2022-05-02T17:49:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2110.08207","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multitask Prompted Training Enables Zero-Shot Task Generalization","primary_cat":"cs.LG","submitted_at":"2021-10-15T17:08:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2104.08773","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cross-Task Generalization via Natural Language Crowdsourcing Instructions","primary_cat":"cs.CL","submitted_at":"2021-04-18T08:44:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents the NATURAL INSTRUCTIONS meta-dataset and shows generative pre-trained language models achieve 19% better generalization to unseen tasks when using task instructions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2009.01325","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to summarize from human feedback","primary_cat":"cs.CL","submitted_at":"2020-09-02T19:54:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2005.14165","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Language Models are Few-Shot Learners","primary_cat":"cs.CL","submitted_at":"2020-05-28T17:29:03+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1910.10683","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","primary_cat":"cs.LG","submitted_at":"2019-10-23T17:37:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1909.05858","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CTRL: A Conditional Transformer Language Model for Controllable Generation","primary_cat":"cs.CL","submitted_at":"2019-09-11T17:57:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1905.00537","ref_index":123,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems","primary_cat":"cs.CL","submitted_at":"2019-05-02T00:41:50+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1804.07461","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding","primary_cat":"cs.CL","submitted_at":"2018-04-20T06:35:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GLUE is a multi-task benchmark for general natural language understanding that includes a diagnostic test suite and finds limited gains from current multi-task learning methods over single-task training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}