{"work":{"id":"214732c0-2edd-44a0-af9e-28184a2b8279","openalex_id":null,"doi":null,"arxiv_id":"2005.14165","raw_key":null,"title":"Language Models are Few-Shot Learners","authors":null,"authors_text":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal","year":2020,"venue":"cs.CL","abstract":"Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.","external_url":"https://arxiv.org/abs/2005.14165","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T21:19:28.977680+00:00","pith_arxiv_id":"2005.14165","created_at":"2026-05-08T18:13:53.729131+00:00","updated_at":"2026-05-14T21:19:28.977680+00:00","title_quality_ok":true,"display_title":"Language Models are Few-Shot Learners","render_title":"Language Models are Few-Shot Learners"},"hub":{"state":{"work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":148,"external_cited_by_count":null,"distinct_field_count":20,"first_pith_cited_at":"2020-06-05T19:54:34+00:00","last_pith_cited_at":"2026-05-13T15:00:29+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-14T22:06:14.523922+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":1}],"polarity_counts":[{"context_polarity":"unclear","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Language Models are Few-Shot Learners","claims":[{"claim_text":"Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performan","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Language Models are Few-Shot Learners because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T22:53:50.523190+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"eba2929d-c6c0-4c6b-9618-0628e5acfc97","orcid":null,"display_name":"Tom B. Brown"},{"id":"f54cf421-84fe-41eb-8542-6d488ae5eda1","orcid":null,"display_name":"Benjamin Mann"},{"id":"07d78f01-c830-471b-a513-c3b9859a4427","orcid":null,"display_name":"Nick Ryder"},{"id":"2995d5ea-ea4e-4824-9356-40f09d6332b3","orcid":null,"display_name":"Melanie Subbiah"},{"id":"419c50dd-141f-4361-af4e-2e0acde44908","orcid":null,"display_name":"Jared Kaplan"},{"id":"be640ed3-1548-4b69-a2ac-76fa45adb08b","orcid":null,"display_name":"Prafulla Dhariwal"}]},"error":null,"updated_at":"2026-05-13T22:53:59.016303+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T22:43:48.497067+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":27},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":27},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":27},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":26},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":24},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":20},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":18},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":16},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":16},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":16},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":15},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":15},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":15},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":13},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":13},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":13},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":12},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":12},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":12},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":11},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":10},{"title":"Finetuned Language Models Are Zero-Shot Learners","work_id":"7ed6cdaa-ed67-4db4-aceb-b7e1b0e6e7c4","shared_citers":10},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":10},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":10}],"time_series":[{"n":6,"year":2020},{"n":12,"year":2021},{"n":10,"year":2022},{"n":7,"year":2023},{"n":11,"year":2024},{"n":4,"year":2025},{"n":94,"year":2026}]},"error":null,"updated_at":"2026-05-13T22:43:56.043625+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T22:44:02.060377+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Language Models are Few-Shot Learners","claims":[{"claim_text":"Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performan","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Language Models are Few-Shot Learners because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T22:43:57.364903+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Language Models are Few-Shot Learners","claims":[{"claim_text":"Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performan","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Language Models are Few-Shot Learners because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T22:43:55.901221+00:00"}},"summary":{"title":"Language Models are Few-Shot Learners","claims":[{"claim_text":"Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performan","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Language Models are Few-Shot Learners because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":27},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":27},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":27},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":26},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":24},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":20},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":18},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":16},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":16},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":16},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":15},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":15},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":15},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":13},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":13},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":13},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":12},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":12},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":12},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":11},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":10},{"title":"Finetuned Language Models Are Zero-Shot Learners","work_id":"7ed6cdaa-ed67-4db4-aceb-b7e1b0e6e7c4","shared_citers":10},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":10},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":10}],"time_series":[{"n":6,"year":2020},{"n":12,"year":2021},{"n":10,"year":2022},{"n":7,"year":2023},{"n":11,"year":2024},{"n":4,"year":2025},{"n":94,"year":2026}]},"authors":[{"id":"f54cf421-84fe-41eb-8542-6d488ae5eda1","orcid":null,"display_name":"Benjamin Mann","source":"manual","import_confidence":0.72},{"id":"419c50dd-141f-4361-af4e-2e0acde44908","orcid":null,"display_name":"Jared Kaplan","source":"manual","import_confidence":0.72},{"id":"2995d5ea-ea4e-4824-9356-40f09d6332b3","orcid":null,"display_name":"Melanie Subbiah","source":"manual","import_confidence":0.72},{"id":"07d78f01-c830-471b-a513-c3b9859a4427","orcid":null,"display_name":"Nick Ryder","source":"manual","import_confidence":0.72},{"id":"be640ed3-1548-4b69-a2ac-76fa45adb08b","orcid":null,"display_name":"Prafulla Dhariwal","source":"manual","import_confidence":0.72},{"id":"eba2929d-c6c0-4c6b-9618-0628e5acfc97","orcid":null,"display_name":"Tom B. Brown","source":"manual","import_confidence":0.72}]}}