GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
hub
How much knowledge can you pack into the parameters of a language model?
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
RAG over structured thinking traces boosts LLM reasoning on AIME, LiveCodeBench, and GPQA, with relative gains up to 56% and little added cost.
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer trainable parameters.
Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
Chunk-as-a-Service with the UCOSA online algorithm enables budget-constrained selection of prompts for chunk enrichment in RAG, outperforming random selection by 52% on a combined performance metric and delivering higher performance-to-budget ratios than standard RaaS.
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Ensemble voting across multiple LLMs improves results on EHR question answering subtasks, with best dev scores of 88.81 micro F1 on evidence-answer alignment.
citing papers explorer
-
Inner Monologue: Embodied Reasoning through Planning with Language Models
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.