{"work":{"id":"5cd5f484-8e24-440f-8bed-e8a801f4ac40","openalex_id":null,"doi":null,"arxiv_id":"2104.10350","raw_key":null,"title":"Carbon Emissions and Large Neural Network Training","authors":null,"authors_text":"David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild","year":2021,"venue":"cs.LG","abstract":"The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.","external_url":"https://arxiv.org/abs/2104.10350","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-15T04:49:44.165277+00:00","pith_arxiv_id":"2104.10350","created_at":"2026-05-08T18:44:01.386909+00:00","updated_at":"2026-05-15T04:49:44.165277+00:00","title_quality_ok":true,"display_title":"Carbon Emissions and Large Neural Network Training","render_title":"Carbon Emissions and Large Neural Network Training"},"hub":{"state":{"work_id":"5cd5f484-8e24-440f-8bed-e8a801f4ac40","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":37,"external_cited_by_count":null,"distinct_field_count":14,"first_pith_cited_at":"2021-10-15T17:08:57+00:00","last_pith_cited_at":"2026-05-14T09:39:15+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-15T05:37:29.893163+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":4}],"polarity_counts":[{"context_polarity":"background","n":3},{"context_polarity":"support","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:20:11.513286+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":11},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":8},{"title":"Quantifying the carbon emissions of machine learning.arXiv preprint arXiv:1910.09700","work_id":"7bc98d11-b344-40f2-b27f-2e08a08d1b95","shared_citers":8},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":7},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":6},{"title":"Strubell, A","work_id":"33ac678f-b75d-4caf-b8a6-9a4d65b1748c","shared_citers":6},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":5},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":5},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":4},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":4},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":4},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":4},{"title":"S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing","work_id":"81a6320b-c2e1-4d74-a03e-9e1ff6bbed8d","shared_citers":4},{"title":"SQuAD: 100,000+ Questions for Machine Comprehension of Text","work_id":"0492dd16-26e8-48d9-874c-3dd90cae7b85","shared_citers":4},{"title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","work_id":"f43c4955-a965-4897-a11b-c4b25d2aeaa8","shared_citers":4},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":4},{"title":"Alignment of language agents","work_id":"2dc6ed25-0b66-42f5-b67e-eb7e67977011","shared_citers":3},{"title":"arXiv preprint arXiv:2107.02137 , year=","work_id":"a6d1bbcd-82f9-438e-b837-c250e0bea6d9","shared_citers":3},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":3},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":3},{"title":"Don’t give me the details, just the summary! topic- aware convolutional neural networks for extreme summarization","work_id":"83dfe48d-b12e-425e-a8c9-d62ac86d1373","shared_citers":3},{"title":"Ethical and social risks of harm from Language Models","work_id":"b4ce1c45-ef69-445a-a872-dbb785b485e9","shared_citers":3},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":3}],"time_series":[{"n":3,"year":2021},{"n":5,"year":2022},{"n":2,"year":2023},{"n":1,"year":2024},{"n":24,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:20:07.424043+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:16:25.572296+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Carbon Emissions and Large Neural Network Training","claims":[{"claim_text":"The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Carbon Emissions and Large Neural Network Training because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:19:51.277342+00:00"}},"summary":{"title":"Carbon Emissions and Large Neural Network Training","claims":[{"claim_text":"The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Carbon Emissions and Large Neural Network Training because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":11},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":8},{"title":"Quantifying the carbon emissions of machine learning.arXiv preprint arXiv:1910.09700","work_id":"7bc98d11-b344-40f2-b27f-2e08a08d1b95","shared_citers":8},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":7},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":6},{"title":"Strubell, A","work_id":"33ac678f-b75d-4caf-b8a6-9a4d65b1748c","shared_citers":6},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":5},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":5},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":4},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":4},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":4},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":4},{"title":"S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing","work_id":"81a6320b-c2e1-4d74-a03e-9e1ff6bbed8d","shared_citers":4},{"title":"SQuAD: 100,000+ Questions for Machine Comprehension of Text","work_id":"0492dd16-26e8-48d9-874c-3dd90cae7b85","shared_citers":4},{"title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","work_id":"f43c4955-a965-4897-a11b-c4b25d2aeaa8","shared_citers":4},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":4},{"title":"Alignment of language agents","work_id":"2dc6ed25-0b66-42f5-b67e-eb7e67977011","shared_citers":3},{"title":"arXiv preprint arXiv:2107.02137 , year=","work_id":"a6d1bbcd-82f9-438e-b837-c250e0bea6d9","shared_citers":3},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":3},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":3},{"title":"Don’t give me the details, just the summary! topic- aware convolutional neural networks for extreme summarization","work_id":"83dfe48d-b12e-425e-a8c9-d62ac86d1373","shared_citers":3},{"title":"Ethical and social risks of harm from Language Models","work_id":"b4ce1c45-ef69-445a-a872-dbb785b485e9","shared_citers":3},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":3}],"time_series":[{"n":3,"year":2021},{"n":5,"year":2022},{"n":2,"year":2023},{"n":1,"year":2024},{"n":24,"year":2026}],"dependency_candidates":[]},"authors":[]}}