{"total":37,"items":[{"citing_arxiv_id":"2606.27981","ref_index":173,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ToxiREX: A Dataset on Toxic REasoning in ConteXt","primary_cat":"cs.CL","submitted_at":"2026-06-26T11:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22211","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Open AI in the Wild: Adoption and Adaptation of Open Models on r/LocalLLaMA","primary_cat":"cs.HC","submitted_at":"2026-06-20T20:14:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Thematic analysis of r/LocalLLaMA discussions finds users define openness via reliability, local control, privacy, and adaptation under compute, licensing, and usability constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31175","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Efficient LLMs Annealing with Principled Sample Selection","primary_cat":"cs.CL","submitted_at":"2026-05-29T11:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiReCT reformulates LLM annealing sample selection as a constrained optimization problem that enforces per-sample gradient directions aligned with the loss landscape's curvature.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10288","ref_index":4,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-11T09:50:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"01degrades accuracy slightly. Data-mixture learning.We also evaluate data-mixture learning on17copyright-free Pile domains. A280M GPT-style proxy model first learns the data-mixture weights, and a280M main model is then trained from scratch with the learned mixture. Both proxy and main models use16layers, hidden size1024, and16heads with the EleutherAI/gpt-neox-20b [4] tokenizer. Figure 2 reports the final per-domain evaluation.BROS achieves average loss2.8109, essentially matching MA-SOBA (2.8098) and ZOFO (2.8114), while avoiding the unstable domain concentration observed for DoReMi [58]. The proxy-memory profiling results are included in the legend of Figure 2.BROSreduces peak memory by about45%relative to MA-SOBA and about27%relative to Penalty, while retaining a substantially higher"},{"citing_arxiv_id":"2605.02255","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Privacy of LLMs: An Ablation Study","primary_cat":"cs.CR","submitted_at":"2026-05-04T06:06:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01699","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance","primary_cat":"cs.LG","submitted_at":"2026-05-03T03:44:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13392","ref_index":7,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold","primary_cat":"cs.AI","submitted_at":"2026-04-15T01:43:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and financial tabular benchmarks with new faithfulness metrics.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":", 2016); (3) TabNet (Arik & Pfister, 2020), a repre- sentative deep learning method for tabular prediction; (4) TabPFN (Hollmann et al., 2023), a trained Transformer to approximate probabilistic inference for tabular classification tasks. For baselines involving fine-tuning LLMs, we further include (5) Direct SFT and (6) Direct Reasoning Curation (detailed in Section 3.1) followed by SFT (DRC+SFT); (7) Direct RL approach that directly conducts RL on the base LLM to exploit the existing reasoning capabilities of the base LLM, corresponding to the approach in (Xu et al., 2025). For all LLM fine-tuning methods, we use Qwen-2.5- 3B-Instruct as the base model. For RL, we use the recently proposed DisCO algorithm (Li et al., 2025), which has been observed to be better than GRPO."},{"citing_arxiv_id":"2511.17388","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Selective Rotary Position Embedding","primary_cat":"cs.CL","submitted_at":"2025-11-21T16:50:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.16745","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling","primary_cat":"cs.LG","submitted_at":"2025-08-22T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.12120","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws","primary_cat":"cs.LG","submitted_at":"2025-02-17T18:45:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.08313","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MiniMax-01: Scaling Foundation Models with Lightning Attention","primary_cat":"cs.CL","submitted_at":"2025-01-14T18:50:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniMax-01 models match GPT-4o and Claude-3.5-Sonnet performance while providing 20-32 times longer context windows through lightning attention and MoE scaling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.21316","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading","primary_cat":"cs.LG","submitted_at":"2024-10-26T00:43:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.10819","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads","primary_cat":"cs.CL","submitted_at":"2024-10-14T17:59:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.04620","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to (Learn at Test Time): RNNs with Expressive Hidden States","primary_cat":"cs.LG","submitted_at":"2024-07-05T16:23:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Our main codebase is based on EasyLM [25], an open-source project for training and serving LLMs in JAX. All experiments can be reproduced using the publicly available code and datasets provided at the bottom of the first page. Datasets. Following the Mamba paper [27], we perform standard experiments with 2k and 8k context lengths on the Pile [24], a popular dataset of documents for training open-source LLMs [8]. However, the Pile contains few sequences of length greater than 8k [19]. To evaluate capabilities in long context, we also experiment with context lengths ranging from 1k to 32k in 2× increments, on a subset of the Pile called Books3, which has been widely used to train LLMs in long context [52, 3]. Backbone architecture. As discussed in Subsection 2."},{"citing_arxiv_id":"2406.00515","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Large Language Models for Code Generation","primary_cat":"cs.CL","submitted_at":"2024-06-01T17:48:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Magicoder[278], StarCoder2-instruct [304] Pre-training (Sec. 5.3) Model Architectures Encoder-Decoder PyMT5[57], PLBART[7], CodeT5[271], JuPyT5[41] AlphaCode[151], CodeRL[139], ERNIE-Code[40] PPOCoder[238], CodeT5+[269], CodeFusion[241] AST-T5[81] Decoder-Only GPT-C[244], GPT-Neo[30], GPT-J[258], Codex[48] CodeGPT[172], CodeParrot[254], PolyCoder[290] CodeGen[193], GPT-NeoX[29], PaLM-Coder[54] InCoder[77], PanGu-Coder[55], PyCodeGPT[306] CodeGeeX[321], BLOOM[140], ChatGPT[196] SantaCoder[9], LLaMA[252], GPT-4[5] CodeGen2[192], replit-code[223], StarCoder[147] WizardCoder[173], phi-1[84], ChainCoder[323] CodeGeeX2[321], PanGu-Coder2[234], Llama 2[253] OctoPack[187], Code Llama[227], MFTCoder[160] phi-1.5[150], CodeShell[285], Magicoder[278]"},{"citing_arxiv_id":"2405.21060","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","primary_cat":"cs.LG","submitted_at":"2024-05-31T17:50:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Large Language Models across Training and Scaling\". In:The International Conference on Machine Learning (ICML). PMLR. 2023, pp. 2397-2430. [11] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. \"PIQA: Reasoning about Physical Commonsense in Natural Language\". In: Proceedings of the AAAI conference on Artificial Intelligence . Vol. 34. 2020. [12] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. \"Gpt-NeoX-20B: An Open-source Autoregressive Language Model\". In: arXiv preprint arXiv:2204.06745 (2022). [13] Guy E Blelloch. \"Prefix Sums and Their Applications\". In: (1990). [14] Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard"},{"citing_arxiv_id":"2405.14782","ref_index":251,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lessons from the Trenches on Reproducible Evaluation of Language Models","primary_cat":"cs.CL","submitted_at":"2024-05-23T16:50:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.10981","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey on Retrieval-Augmented Text Generation for Large Language Models","primary_cat":"cs.IR","submitted_at":"2024-04-17T01:27:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AAR [157] 2023 ANCE [146], Contriever Flan-T5, InstructGPT Query2doc [137] 2023 BM25, DPR GPT-3 (text-davinci-003) Step-Back [163] 2023 PaLM-2L [23] PaLM-2L, GPT-4 ITER-RETGEN [121] 2023 Contriever InstructGPT (text-davinci-003), LLaMA2 RECITE [125] 2023 PaLM, UL2 [127], OPT [161], Codex [16] PROMPTAGATOR [27] 2023 T5 FLAN UPRISE [20] 2023 GPT-Neo-2.7B [8] BLOOM-7.1B [142], OPT-66B, GPT-3-175B GENREAD [156] 2023 InstructGPT LAPDOG [52] 2023 Contriever T5 KnowledGPT [140] 2023 GPT-4 Selfmem [21] 2023 BM25 XGLM [90], XLM-Rbase [25] MEMWALKER [13] 2023 LLaMA2 LLaMA2 RECOMP [147] 2023 BM25 T5-Large Rewrite-Retrieve-Read [94]2023 Bing T5-Large, ChatGPT(gpt-3.5-turbo), Vicuna-13B Atlas [94] 2023 Contriever T5"},{"citing_arxiv_id":"2403.07974","ref_index":216,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","primary_cat":"cs.SE","submitted_at":"2024-03-12T17:58:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.01411","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology","primary_cat":"cs.SE","submitted_at":"2024-02-02T13:42:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.00752","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","primary_cat":"cs.LG","submitted_at":"2023-12-01T18:01:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Large Language Models across Training and Scaling\". In: The International Conference on Machine Learning (ICML) . PMLR. 2023, pp. 2397-2430. [8] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. \"PIQA: Reasoning about Physical Commonsense in Natural Language\". In: Proceedings of the AAAI conference on Artificial Intelligence . Vol. 34. 2020. [9] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. \"Gpt-NeoX-20B: An Open-source Autoregressive Language Model\". In:arXiv preprint arXiv:2204.06745 (2022). [10] Guy E Blelloch. \"Prefix Sums and Their Applications\". In: (1990). [11] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher."},{"citing_arxiv_id":"2311.05232","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","primary_cat":"cs.CL","submitted_at":"2023-11-09T09:25:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, inappropriate retrieval granularity can compromise the semantic integrity and affect the relevance of retrieved information [ 224], thereby affecting the performance of LLMs. Fixed-size chunking, which typically breaks down the documents into chunks of a specified length such as 100-word paragraphs, serves as the most crude and prevalent strategy of chunking, which is widely used in RAG systems [24, 109, 165]. Considering fixed-size chunking falls short in capture structure and dependency of lengthy documents, Sarthi et al. [267] proposed RAPTOR, an indexing and retrieval system. By recursively embedding, clustering, and summarizing chunks of text, RAPTOR constructs a tree to capture both high-level and low-level details. When retrieval, RAPTOR enables LLMs to integrate information from different levels of abstraction, providing"},{"citing_arxiv_id":"2311.01378","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Vision-Language Foundation Models as Effective Robot Imitators","primary_cat":"cs.RO","submitted_at":"2023-11-02T16:34:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.16789","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detecting Pretraining Data from Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-10-25T17:21:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"are guaranteed not to be present in the pretraining data. The temporal nature of events ensures that non-member data is indeed unseen and not mentioned in the pretraining data. (2) General: our benchmark is not confined to any specific model and can be applied to various models pretrained using Wikipedia (e.g., OPT, LLaMA, GPT-Neo) since Wikipedia is a commonly used pretraining data source. (3) Dynamic: we will continually update our benchmark by gathering newer non-member data (i.e., more recent events) from Wikipedia since our data construction pipeline is fully automated. MIA methods for finetuning (Carlini et al., 2022; Watson et al., 2022) usually calibrate the target model probabilities of an example using a shadow reference model that is trained on a similar"},{"citing_arxiv_id":"2310.11511","ref_index":91,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection","primary_cat":"cs.CL","submitted_at":"2023-10-17T18:18:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.00071","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"YaRN: Efficient Context Window Extension of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-08-31T18:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation beyond fine-tuning lengths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.06435","ref_index":118,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Overview of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-07-12T20:01:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tual consistency, ERNIE 3.0 Titan adds another task, Credible and Controllable Generations, to its multi-task learning setup. 8 It introduces additional self-supervised adversarial and control- lable language modeling losses to the pre-training step, which enables ERNIE 3.0 Titan to beat other LLMs in their manually selected Factual QA task set evaluations. GPT-NeoX-20B [118]: An auto-regressive model that largely follows GPT-3 with a few deviations in architecture design, trained on the Pile dataset without any data deduplication. GPT- NeoX has parallel attention and feed-forward layers in a trans- former block, given in Eq. 4, that increases throughput by 15%. It uses rotary positional embedding [66], applying it to only"},{"citing_arxiv_id":"2306.00978","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration","primary_cat":"cs.CL","submitted_at":"2023-06-01T17:59:10+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.16264","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Data-Constrained Language Models","primary_cat":"cs.CL","submitted_at":"2023-05-25T17:18:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"soning about Physical Commonsense in Natural Language. In Thirty-Fourth AAAI Conference on Artificial Intelligence. [13] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv preprint arXiv:2204.06745. [14] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this software, please cite it using these metadata, 58. [15] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al."},{"citing_arxiv_id":"2305.07922","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CodeT5+: Open Code Large Language Models for Code Understanding and Generation","primary_cat":"cs.CL","submitted_at":"2023-05-13T14:23:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CodeT5+ is a flexible encoder-decoder LLM family for code pretrained with diverse objectives on multilingual corpora and initialized from existing LLMs, achieving state-of-the-art results on code generation, completion, math programming, and retrieval tasks including new SoTA on HumanEval with the 1","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.06161","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StarCoder: may the source be with you!","primary_cat":"cs.CL","submitted_at":"2023-05-09T08:16:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.02301","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes","primary_cat":"cs.CL","submitted_at":"2023-05-03T17:50:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.18223","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-03-31T17:28:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"mT5 [83] Oct-2020 13 - - - 1T tokens - - -✓- PanGu-α[84] Apr-2021 13* - - - 1.1TB - 2048 Ascend 910 -✓- CPM-2 [85] Jun-2021 198 - - - 2.6TB - - - - - T0 [28] Oct-2021 11 T5✓- - - 512 TPU v3 27 h✓- CodeGen [86] Mar-2022 16 - - - 577B tokens - - -✓- GPT-NeoX-20B [87] Apr-2022 20 - - - 825GB - 96 40G A100 -✓- Tk-Instruct [88] Apr-2022 11 T5✓- - - 256 TPU v3 4 h✓- UL2 [89] May-2022 20 - - - 1T tokens Apr-2019 512 TPU v4 -✓ ✓ OPT [90] May-2022 175 - - - 180B tokens - 992 80G A100 -✓- NLLB [91] Jul-2022 54.5 - - - - - - -✓- CodeGeeX [92] Sep-2022 13 - - - 850B tokens - 1536 Ascend 910 60 d✓- GLM [93] Oct-2022 130 - - - 400B tokens - 768 40G A100 60 d✓- Flan-T5 [69] Oct-2022 11 T5✓- - - - -✓ ✓ BLOOM [78] Nov-2022 176 - - - 366B tokens - 384 80G A100 105 d✓-"},{"citing_arxiv_id":"2303.08112","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Eliciting Latent Predictions from Transformers with the Tuned Lens","primary_cat":"cs.LG","submitted_at":"2023-03-14T17:47:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"⟨pnew −p old,q new −q old⟩w >0.(15) Measuring alignment.Let g:R d →R d be an arbitrary function for intervening on hidden states, and let hℓ be the hidden state at layer ℓ on some input x. We'll define the stimulusto be the Aitchison difference between the tuned lens output before and after the intervention: S(hℓ) = TunedLensℓ(g(hℓ))−TunedLens ℓ(hℓ)(16) Analogously, theresponsewill be defined as the Aitchison difference between the final layer output before and after the intervention: R(hℓ) =M >ℓ(g(hℓ))− M >ℓ(hℓ)(17) 15 20 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0Accuracy Accuracy for Direct Questions Lens T ype Logit Lens T uned Lens 15 20 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0Accuracy Accuracy for Standard Questions"},{"citing_arxiv_id":"2211.09085","ref_index":99,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Galactica: A Large Language Model for Science","primary_cat":"cs.CL","submitted_at":"2022-11-16T18:06:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.05100","ref_index":209,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BLOOM: A 176B-Parameter Open-Access Multilingual Language Model","primary_cat":"cs.CL","submitted_at":"2022-11-09T18:48:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.01068","ref_index":293,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OPT: Open Pre-trained Transformer Language Models","primary_cat":"cs.CL","submitted_at":"2022-05-02T17:49:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}