Recognition: 3 theorem links
· Lean TheoremCarbon Emissions and Large Neural Network Training
Pith reviewed 2026-05-11 23:43 UTC · model grok-4.3
The pith
Choices in neural network design, training location, and hardware can reduce the carbon footprint of large AI models by up to 1000 times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Calculations for several recent large models show that large but sparsely activated DNNs consume less than one tenth the energy of large dense DNNs without loss of accuracy. Geographic location changes the fraction of carbon-free energy and resulting CO2e by factors of five to ten. Cloud data centers are 1.4 to 2 times more energy efficient than typical facilities, and ML-oriented accelerators inside them are 2 to 5 times more effective than general-purpose systems. The combined choice of DNN architecture, datacenter, and processor therefore allows carbon footprint reductions up to 100-1000 times.
What carries the argument
The side-by-side energy and CO2e calculations across dense versus sparse DNN architectures, different geographic carbon intensities, datacenter infrastructure levels, and general versus ML-specific processors.
If this is right
- Sparsely activated models achieve comparable accuracy while using under one tenth the energy of equivalent dense models.
- Scheduling workloads in locations with higher carbon-free energy shares reduces emissions by factors of five to ten.
- Cloud data centers combined with ML accelerators improve energy efficiency by roughly three to ten times over standard setups.
- Reporting energy consumption and CO2e in papers on large-scale training prevents inaccurate later estimates.
- Adding energy usage to benchmarks such as MLPerf would make efficiency a primary evaluation criterion.
Where Pith is reading between the lines
- Workload schedulers could move large training jobs to low-carbon periods and regions in real time to cut emissions without altering model code.
- The large variability implies that past aggregate estimates of AI's total carbon impact may require downward revision when actual training conditions are taken into account.
- Model selection processes may begin to treat energy efficiency as a first-class constraint alongside accuracy and speed.
- Similar calculations could be applied to inference workloads, affecting choices about where and how deployed models run.
Load-bearing premise
The carbon intensity figures for specific datacenters and the power draw estimates for accelerators and systems are taken as accurate without independent verification.
What would settle it
Direct metering of electricity consumption during training of a model such as Switch Transformer or GPT-3 at two locations with documented carbon intensities, followed by comparison of the measured emission ratio against the predicted five- to tenfold geographic difference.
read the original abstract
The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper estimates the energy consumption and carbon footprint of several large neural network models including T5, Meena, GShard, Switch Transformer, and GPT-3, while refining earlier estimates for the Evolved Transformer found via neural architecture search. It identifies four main opportunities for reducing CO2e: sparsely activated DNNs consuming <1/10 the energy of dense models, geographic location affecting carbon intensity by 5-10X even within the same country, Cloud datacenters being 1.4-2X more efficient than typical ones, and ML accelerators being 2-5X more effective than off-the-shelf hardware. These factors are multiplied to claim potential carbon footprint reductions of up to ~100-1000X. The authors advocate making energy consumption and CO2e explicit in ML papers and incorporating energy metrics into benchmarks such as MLPerf.
Significance. If the estimates and multiplicative factors are robust, the work is significant for quantifying the environmental costs of scaling ML and for outlining concrete, high-impact mitigation strategies based on model architecture, scheduling location, infrastructure, and hardware. The explicit numerical baselines for multiple recent models and the call for energy to become a standard evaluation metric alongside accuracy provide a useful reference point for the community and could encourage more reproducible reporting practices.
major comments (3)
- Abstract: the headline claim that DNN/datacenter/processor choice yields up to ~100-1000X lower CO2e is obtained by multiplying four independent factors (sparsity <1/10, location 5-10X, datacenter 1.4-2X, accelerator 2-5X); the manuscript provides no sensitivity analysis or error bounds showing how plausible 2-3X variations in any single input (carbon intensity or power-draw model) would affect the upper end of the reported range.
- Sections detailing per-model energy calculations (T5, Meena, GShard, Switch, GPT-3): the baseline energy figures are derived from hardware specifications, assumed utilization rates, and training durations without primary measurement data, cross-validation, or explicit exclusion rules, which directly scales all subsequent relative savings and the 100-1000X claim.
- Geographic location and datacenter efficiency paragraphs: the 5-10X carbon-free energy variation and 1.4-2X datacenter efficiency gains are stated without citing the specific carbon-intensity tables, PUE values, or regional grid data sources used, leaving the load-bearing numerical inputs un-auditable.
minor comments (3)
- Abstract: the forward-looking statement 'we are now optimizing where and when large models are trained' lacks any accompanying details on methodology or preliminary results.
- Throughout: several efficiency ranges (1.4-2X, 2-5X) would benefit from explicit references to the supporting studies or internal measurements.
- Final paragraph: the proposal to add energy metrics to MLPerf could be strengthened by discussing how consistent measurement protocols would be defined across heterogeneous hardware.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing our responses and indicating where revisions have been made to improve clarity, transparency, and auditability.
read point-by-point responses
-
Referee: Abstract: the headline claim that DNN/datacenter/processor choice yields up to ~100-1000X lower CO2e is obtained by multiplying four independent factors (sparsity <1/10, location 5-10X, datacenter 1.4-2X, accelerator 2-5X); the manuscript provides no sensitivity analysis or error bounds showing how plausible 2-3X variations in any single input (carbon intensity or power-draw model) would affect the upper end of the reported range.
Authors: The ~100-1000X range illustrates the cumulative potential obtained by multiplying the upper ends of each independent factor (sparsity savings, location variation, datacenter efficiency, and accelerator gains), each drawn from observed ranges in practice. These are presented as separate opportunities rather than a single combined scenario. We agree that noting the impact of input variations would strengthen the presentation. In the revised manuscript, we have added a sentence in the abstract and a short paragraph in the discussion clarifying that the upper bound is illustrative, that the factors are multiplicative and independent, and that even with 2-3X uncertainty in any one input the order-of-magnitude potential remains substantial. revision: partial
-
Referee: Sections detailing per-model energy calculations (T5, Meena, GShard, Switch, GPT-3): the baseline energy figures are derived from hardware specifications, assumed utilization rates, and training durations without primary measurement data, cross-validation, or explicit exclusion rules, which directly scales all subsequent relative savings and the 100-1000X claim.
Authors: The baseline energy figures are retrospective estimates constructed from publicly reported training durations, hardware power specifications, and typical utilization rates (e.g., 30-50% for large-scale training) as documented in the source papers for each model. Primary measurement data from the original training runs is not available to us, as the models were developed by multiple organizations. We have expanded the methods and appendix sections in the revision to list the exact sources, assumptions, and any exclusion criteria used for each model, improving traceability while preserving the original estimates. revision: yes
-
Referee: Geographic location and datacenter efficiency paragraphs: the 5-10X carbon-free energy variation and 1.4-2X datacenter efficiency gains are stated without citing the specific carbon-intensity tables, PUE values, or regional grid data sources used, leaving the load-bearing numerical inputs un-auditable.
Authors: We appreciate this observation. In the revised manuscript we have inserted explicit citations for the carbon-intensity ranges (drawing on regional grid data from electricityMap and U.S. EPA reports showing 5-10X differences even within countries) and for datacenter PUE values (citing industry reports from Google, Microsoft, and the Uptime Institute documenting Cloud PUEs of ~1.1-1.4 versus typical values of 1.5-2.0). These additions render the numerical inputs fully auditable. revision: yes
Circularity Check
No circularity: estimates are direct calculations from external hardware and grid inputs
full rationale
The paper computes energy use and CO2e for models including T5, Meena, GShard, Switch Transformer, and GPT-3 using stated training durations, utilization rates, power draws, and regional carbon-intensity values as direct inputs. The 100-1000X reduction range is obtained by multiplying independent factors (location 5-10X, datacenter 1.4-2X, accelerator 2-5X, sparsity <1/10) drawn from hardware comparisons and grid data rather than any fitted parameter or self-referential definition. No equation or claim reduces a reported result to its own inputs by construction, and the derivation chain remains self-contained against the provided external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- PUE (power usage effectiveness)
- carbon intensity per kWh
axioms (1)
- domain assumption Published hardware specifications and prior energy models accurately reflect actual training power draw
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced contradictsthe choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel contradictsGeographic location matters... fraction of carbon-free energy and resulting CO2e vary ~5X-10X... Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient... ML-oriented accelerators... ~2-5X more effective
-
IndisputableMonolith.Foundation.PhiForcingphi_equation echoesLarge but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters
Forward citations
Cited by 35 Pith papers
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...
-
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
-
Training single-electron and single-photon stochastic physical neural networks
Single-electron and single-photon stochastic physical neural networks achieve over 97% MNIST test accuracy when trained with empirical outputs in the backward pass using few trials per layer.
-
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks
In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spe...
-
Segment Anything
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
High-Resolution Image Synthesis with Latent Diffusion Models
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Recasting AI Data Centers as Engines for Carbon Removal
AI data center waste heat upgraded by heat pumps can drive direct air capture to achieve net CO2 removal and offset operational emissions in several US states under current and 2030 scenarios.
-
Language-Conditioned Visual Grounding with CLIP Multilingual
Fixing the visual encoder in multilingual CLIP isolates text-branch deficits as the cause of lower visual grounding performance for low-resource languages, with model scaling widening some gaps but not others.
-
A Hardware-aware Hopfield Network with a Nonlinear Memristor Array for Robust Associative Memory with Superlinear Capacity
A memristor-array Hopfield network uses device nonlinearity to exceed classical memory capacity with K ~ 0.14N experimentally and superlinear K ~ 0.3 N^1.2 in simulations.
-
OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
OpenG2G is a new extensible simulation platform that lets users implement and compare classic, optimization, and learning-based controllers for AI datacenter power flexibility coordinated with the grid.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Are Large Language Models Economically Viable for Industry Deployment?
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
-
TRON: Trainable, architecture-reconfigurable random optical neural networks
TRON demonstrates a trainable and reconfigurable optical neural network that combines multi-scattering media with DMD-based matrix multiplication and performs in-situ optimization plus neural architecture search on th...
-
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures
Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.
-
SAM 2: Segment Anything in Images and Videos
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
LaMDA: Language Models for Dialog Applications
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
-
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity
ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.
-
Toward a Sustainable Software Architecture Community: Evaluating ICSA's Environmental Impact
The study provides exploratory estimates of carbon emissions from GenAI inference in ICSA papers and from the full operations of the ICSA 2025 conference.
-
DINOv2: Learning Robust Visual Features without Supervision
Pith review generated a malformed one-line summary.
-
From Cradle to Cloud: A Life Cycle Review of AI's Environmental Footprint
A review of AI sustainability studies finds inconsistent life cycle definitions and predominant reliance on coarse CO2e proxies, with limited coverage of water, materials, and multi-impact assessments.
-
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
-
AI-Native Autonomous Infrastructure (ANAI): A Formal Framework for the Next General-Purpose Technology
Introduces ANAI framework with Autonomy Index (AIx), Infrastructure Coupling Coefficient (ICC), and Technological Transition Potential (TTP) to model AI-driven infrastructural transition via nonlinear coevolution and ...
-
minAction.net: Energy-First Neural Architecture Design -- From Biological Principles to Systematic Validation
Large-scale experiments show architecture performance depends on task type, not universality, and a single-parameter energy penalty reduces computational energy by ~1000x with negligible accuracy cost.
-
SymptomWise: A Deterministic Reasoning Layer for Reliable and Efficient AI Systems
SymptomWise uses expert knowledge and deterministic rules for diagnosis after LLM-based symptom extraction, achieving 88% top-5 accuracy on 42 challenging pediatric neurology cases.
-
Analytic Framework for Estimating Memory Cost
An analytic framework is introduced to estimate memory-related energy costs of AI models and quantify their ecological footprint.
-
Unbox Responsible GeoAI: Navigating Climate Extreme and Disaster Mapping
Responsible GeoAI for disaster mapping requires governance across data, applications, and society rather than algorithm improvements alone.
Reference graph
Works this paper leans on
-
[1]
We use Google Georgia datacenter’s PUE from the period in which the search computation was run (1.10 in Table 4) instead of the US average in 2018 (1.58)
work page 2018
-
[2]
used the US average CO 2 per kilowatt hour (KWh) as calculated by the U.S
Strubell et al. used the US average CO 2 per kilowatt hour (KWh) as calculated by the U.S. Environmental Protection Agency (EPA) of 0.423 kg per KWh in 2018. For Google, we use the Georgia datacenter’s average CO 2 e/KWh for the month when NAS was performed (0.431 CO 2 e/KWh in Table 4)
work page 2018
-
[3]
used Google TPU v2 accelerators, not NVIDIA P100 GPUs as modeled in [Str19]
So et al. used Google TPU v2 accelerators, not NVIDIA P100 GPUs as modeled in [Str19]. TPU v2s are much faster, so the search process takes 32,633 TPU v2 hours instead of 117,780 P100 hours. We measured the power when running the [So19] NAS computation on TPU v2, including the memory, fans, network interfaces, and the CPU host. The average power was 208 W...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.