arxiv: 2604.19342 · v1 · submitted 2026-04-21 · 💻 cs.CL

Recognition: unknown

Are Large Language Models Economically Viable for Industry Deployment?

Abdullah Mohammad , Sushant Kumar Ray , Pushkar Arora , Rafiq Ali , Ebad Shabbir , Gautam Siddharth Kashyap , Jiechao Gao , Usman Naseem

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM deploymenteconomic viabilityenergy efficiencysmall language modelsbenchmarking frameworkquantizationlegacy hardwareindustrial tasks

0 comments

The pith

Models under 2 billion parameters outperform larger ones in economic returns and energy use for industry tasks on legacy hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that accuracy-focused benchmarks overlook the energy, latency, and hardware costs that matter most in real industrial settings such as healthcare support and financial analytics. It introduces the EDGE-EVAL framework to measure full-lifecycle viability on legacy NVIDIA Tesla T4 GPUs, using five new metrics that track profitability break-even, intelligence per watt, hardware density, cold-start overhead, and quantization safety. When applied to LLaMA and Qwen variants across three tasks, the results show that the efficiency frontier sits with sub-2B models. LLaMA-3.2-1B quantized to 4 bits reaches median ROI break-even after 14 requests and delivers three times the energy-normalized intelligence of 7B models while sustaining over 6,900 tokens per second per gigabyte.

Core claim

By running the EDGE-EVAL framework on legacy NVIDIA Tesla T4 GPUs across three industrial tasks, the authors establish that models with fewer than 2 billion parameters dominate larger baselines on combined economic and ecological criteria. LLaMA-3.2-1B in INT4 quantization reaches median ROI break-even in 14 requests, supplies three times higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens per second per gigabyte. The evaluation also identifies that QLoRA adaptation raises energy costs by up to seven times for small models despite lower memory footprint.

What carries the argument

EDGE-EVAL framework, which evaluates models across their full lifecycle using five deployment metrics—Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW), System Density (ρsys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)—on legacy hardware.

If this is right

Small models achieve median ROI break-even after only 14 requests in the tested industrial tasks.
Sub-2B models provide three times the energy-normalized intelligence of 7B models under the new metrics.
4-bit quantization on small models sustains throughput above 6,900 tokens per second per gigabyte.
QLoRA adaptation increases energy use by up to 7x for small models, contrary to expectations for compression.
The efficiency frontier for economic and ecological performance lies with models under 2 billion parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Industry deployments may favor running many specialized small models in parallel rather than fewer large ones to control costs and energy.
Hardware and software optimizations could shift toward sub-2B model families instead of continued emphasis on larger scales.
The QLoRA energy penalty suggests re-examination of quantization-aware training specifically for edge and legacy settings.
If the pattern holds, model selection in cost-sensitive sectors would prioritize energy and break-even metrics over raw parameter count.

Load-bearing premise

The three chosen industrial tasks and the five new metrics fully capture the operational and economic constraints of real industry deployments on legacy hardware.

What would settle it

A broader test on additional industry tasks or different hardware that shows larger models reaching faster overall ROI or lower total energy and cost per useful output than the sub-2B class.

Figures

Figures reproduced from arXiv: 2604.19342 by Abdullah Mohammad, Ebad Shabbir, Gautam Siddharth Kashyap, Jiechao Gao, Pushkar Arora, Rafiq Ali, Sushant Kumar Ray, Usman Naseem.

**Figure 1.** Figure 1: Illustration of the Deployment–Evaluation Gap–QLoRA reduces memory by ∼ 60% yet increases fine-tuning energy up to 7.2× for small models, showing that memory efficiency does not equal energy efficiency. 1 Introduction Generative AI—powered by Large Language Models (LLMs) (Ciubotaru, 2025)—is rapidly transitioning from research prototypes to real-world industry deployment. Across healthcare decision su… view at source ↗

**Figure 2.** Figure 2: Lifecycle benchmarking pipeline of EDGE-EVAL. For each configuration (f, p, t, a), models pass through three stages—adaptation, compression, and inference—under uniform hardware constraints. The recorded lifecycle variables are subsequently aggregated into the five deployment metrics defined in Section 4.2. tion (f, p, t, a) ∈ F ×P ×T ×A, we execute a full deployment pipeline consisting of adaptation, com… view at source ↗

**Figure 3.** Figure 3: Multidimensional efficiency under legacy deployment–compact ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Systems-level deployment landscape on legacy T4 hardware. Compact ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Generative AI-powered by Large Language Models (LLMs)-is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization-not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment-Evaluation Gap-the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL-a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics-Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW ), System Density (\r{ho}sys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)-capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier-models in the <2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3x higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly-while QLoRA reduces memory footprint, it increases adaptation energy by up to 7x for small models-challenging prevailing assumptions about quantization-aware training in edge deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines five new deployment metrics and claims small models win on cost and energy for legacy GPUs, but the calculations and assumptions stay under-documented.

read the letter

The main point is that this work tries to fill the gap between accuracy benchmarks and actual industry constraints by measuring break-even requests, energy per unit of performance, and similar factors on T4 GPUs. Small models under 2B parameters come out ahead in their tests, with one 1B model hitting ROI after 14 requests and showing big efficiency gains under quantization. That framing is useful because most existing evaluations ignore hardware limits and running costs entirely. They also flag an odd result where QLoRA cuts memory but raises energy use for small models, which challenges common assumptions about fine-tuning on edge hardware. The framework itself and the named metrics are new enough to stand out from prior efficiency papers that stayed narrower. The three industrial tasks give the results some grounding in realistic workloads rather than generic benchmarks. The soft spots sit in the methods. No formulas, exclusion rules, or raw numbers appear for how the metrics were derived, and there are no sensitivity checks on cost assumptions like hardware amortization or per-token pricing. The dominance claim for sub-2B models could move if the workload mix or reliability needs change, especially in regulated domains. The QLoRA anomaly is stated without controls or data tables to back it. This paper is aimed at practitioners who deploy models on limited hardware and want metrics beyond accuracy scores. Readers working on edge or cost-sensitive applications will pick up concrete ideas from the framework, even if they have to redo the numbers themselves. It is worth sending to peer review because the topic is timely and the metrics address a real blind spot, though the referees will need to see full derivations and robustness tests before the claims can be trusted.

Referee Report

3 major / 2 minor

Summary. The paper introduces EDGE-EVAL, an industry-oriented benchmarking framework for evaluating LLMs on legacy NVIDIA Tesla T4 GPUs across three industrial tasks (healthcare, finance, enterprise). It defines five new deployment metrics—Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW), System Density (ρsys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)—and reports that <2B-parameter models (e.g., LLaMA-3.2-1B INT4) dominate larger baselines, achieving median ROI break-even in 14 requests, 3× higher energy-normalized intelligence than 7B models, and >6,900 tokens/s/GB under 4-bit quantization, while noting a QLoRA efficiency anomaly that increases adaptation energy up to 7× for small models.

Significance. If the new metrics are shown to be robust proxies, the work could meaningfully shift industry deployment practices toward smaller models for cost, energy, and hardware-constrained settings, directly addressing the deployment-evaluation gap in accuracy-centric benchmarks. The empirical focus on legacy T4 GPUs and concrete numbers (break-even, IPW ratios) provide actionable, falsifiable predictions for practitioners.

major comments (3)

[§3] §3 (Metric Definitions): The five custom metrics (Nbreak, IPW, ρsys, Ctax, Qret) are introduced without sensitivity analysis to alternative cost assumptions (hardware amortization, per-token revenue, workload mix) or comparison to standard TCO models; the central claim that <2B models dominate rests on these being faithful proxies, yet the abstract and results provide no indication of stress-testing against healthcare/finance reliability constraints.
[Results] Results (LLaMA-3.2-1B numbers): The reported median break-even of 14 requests, 3× IPW advantage, and 6,900 tokens/s/GB lack error bars, data exclusion rules, or controls for the three tasks and hardware variability; this undercuts the cross-model dominance claim given the low soundness noted in abstract-only review.
[Experimental setup] Experimental setup (QLoRA anomaly): The claim that QLoRA increases adaptation energy by up to 7× for small models is presented without baseline numbers, controls, or quantification of the efficiency anomaly, which is load-bearing for challenging quantization-aware training assumptions.

minor comments (2)

[Abstract] Notation: ρsys uses a non-standard symbol that should be defined explicitly on first use and checked for consistency with system-density literature.
[Abstract] The abstract states results without referencing the specific sections or tables containing the underlying data and task definitions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the robustness of our metric definitions, statistical reporting, and experimental controls. We address each major comment below and will incorporate the suggested enhancements in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Metric Definitions): The five custom metrics (Nbreak, IPW, ρsys, Ctax, Qret) are introduced without sensitivity analysis to alternative cost assumptions (hardware amortization, per-token revenue, workload mix) or comparison to standard TCO models; the central claim that <2B models dominate rests on these being faithful proxies, yet the abstract and results provide no indication of stress-testing against healthcare/finance reliability constraints.

Authors: We agree that sensitivity analysis and explicit comparisons would strengthen the presentation of the metrics. In the revised manuscript, we will expand §3 with a dedicated sensitivity analysis subsection varying hardware amortization (1–3 years), per-token revenue assumptions, and workload mixes across the three tasks. We will also benchmark our metrics against standard TCO models and add stress-testing for reliability constraints by incorporating conservative proxies for failure rates in healthcare and finance scenarios, verifying that the <2B model dominance holds under these conditions. revision: yes
Referee: [Results] Results (LLaMA-3.2-1B numbers): The reported median break-even of 14 requests, 3× IPW advantage, and 6,900 tokens/s/GB lack error bars, data exclusion rules, or controls for the three tasks and hardware variability; this undercuts the cross-model dominance claim given the low soundness noted in abstract-only review.

Authors: We acknowledge the value of greater statistical transparency. The revised results section will include error bars computed from at least five independent runs per configuration, explicit data exclusion rules (e.g., removal of runs affected by transient hardware faults), and per-task breakdowns with controls for T4 GPU variability. These additions will provide clearer support for the reported median values and the cross-model dominance findings. revision: yes
Referee: [Experimental setup] Experimental setup (QLoRA anomaly): The claim that QLoRA increases adaptation energy by up to 7× for small models is presented without baseline numbers, controls, or quantification of the efficiency anomaly, which is load-bearing for challenging quantization-aware training assumptions.

Authors: We will revise the experimental setup to include baseline adaptation energy measurements without QLoRA, detailed controls (fixed batch sizes, training steps, and hardware), and full per-model quantification of the energy increase. This will make the anomaly claim more precise and better substantiate its implications for small-model edge deployment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with independently defined metrics

full rationale

The paper presents an empirical evaluation framework (EDGE-EVAL) that applies five newly defined deployment metrics (Nbreak, IPW, ρsys, Ctax, Qret) to benchmark results on specific tasks and hardware. These metrics are introduced as direct operational proxies without any equations, fitted parameters, or predictions that reduce to the input data by construction. No self-citations, ansatzes, or uniqueness theorems are invoked in the provided abstract or description to justify core claims. The derivation chain consists of straightforward measurement and comparison, remaining self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 5 invented entities

The central claim rests on the assumption that the selected tasks and metrics represent industry constraints; no free parameters are explicitly fitted in the abstract, but the metrics themselves are invented constructs.

axioms (2)

domain assumption The three industrial tasks adequately proxy real deployment workloads in healthcare, finance, and automation.
Invoked when claiming dominance across economic dimensions without broader validation.
domain assumption Legacy T4 GPU performance and energy measurements generalize to other hardware.
Central to all reported numbers but not justified in abstract.

invented entities (5)

Economic Break-Even (Nbreak) no independent evidence
purpose: Quantify number of requests needed for positive ROI
New metric introduced to capture profitability
Intelligence-Per-Watt (IPW) no independent evidence
purpose: Measure energy-normalized intelligence
New metric for efficiency comparison
System Density (ρsys) no independent evidence
purpose: Capture hardware scaling capacity
New metric for server utilization
Cold-Start Tax (Ctax) no independent evidence
purpose: Measure serverless feasibility cost
New metric for startup overhead
Quantization Fidelity (Qret) no independent evidence
purpose: Assess compression safety
New metric for quality loss under quantization

pith-pipeline@v0.9.0 · 5619 in / 1428 out tokens · 29082 ms · 2026-05-10T02:24:18.234351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 19 canonical work pages · 9 internal anchors

[1]

Ahmed Hadi Ali Al-Jumaili, Ravie Chandren Muniyandi, Mohammad Kamrul Hasan, Johnny Koh Siaw Paw, and Mandeep Jit Singh. 2023. Big data analytics using cloud computing based frameworks for power management systems: Status, constraints, and future recommendations. Sensors, 23(6):2952

2023
[2]

Basem Almadani, Hunain Kaisar, Irfan Rashid Thoker, and Farouq Aliyu. 2025. A systematic survey of distributed decision support systems in healthcare. Systems, 13(3):157

2025
[4]

Christian Bauer, Samira Afzal, Sandro Linder, Radu Prodan, and Christian Timmerer. 2024. Greem: An open-source energy measurement tool for video processing. In Proceedings of the 15th ACM Multimedia Systems Conference, pages 264--270

2024
[5]

Bogdan-Iulian Ciubotaru. 2025. Generative ai and large language models: A comprehensive scientific review

2025
[6]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088--10115

2023
[7]

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029--3051

2023
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Rakibul Hasan and Samia Akter. 2022. Information system-based decision support tools: A systematic review of strategic applications in service-oriented enterprises. Review of Applied Science and Technology, 1(04):26--65

2022
[13]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3

2022
[14]

Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. https://arxiv.org/abs/1903.07486 Dissecting the nvidia turing t4 gpu via microbenchmarking . Preprint, arXiv:1903.07486

work page Pith review arXiv 2019
[16]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611--626

2023
[17]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459--9474

2020
[19]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74--81

2004
[20]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6:87--100

2024
[21]

Vasile Denis Manolescu, Hamzah AlZu'bi, and Emanuele Lindo Secco. 2025. Interactive conversational ai with iot devices for enhanced human-robot interaction. Journal of Intelligent Communication

2025
[23]

C Nvidia. 2018. Nvidia turing gpu architecture. NVIDIA Whitepaper, 1

2018
[24]

David Patterson, Joseph Gonzalez, Urs H \"o lzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and Jeff Dean. 2022. The carbon footprint of machine learning training will plateau, then shrink. Computer, 55(7):18--28

2022
[25]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Nikolaos Schizas, Aristeidis Karras, Christos Karras, and Spyros Sioutas. 2022. Tinyml for ultra-low power ai and large scale iot deployments: A systematic review. Future Internet, 14(12):363

2022
[28]

Zohaib Hasan Siddiqui, Jiechao Gao, Ebad Shabbir, Mohammad Anas Azeez, Rafiq Ali, Gautam Siddharth Kashyap, and Usman Naseem. 2025. Llms on a budget? say hola. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1035--1043

2025
[29]

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in neural information processing systems, 35:27168--27183

2022
[30]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595--46623

2023
[31]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[32]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[33]

2024 , eprint=

Qwen2 Technical Report , author=. 2024 , eprint=

2024
[34]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024
[35]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review arXiv
[36]

Proceedings of machine learning and systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=
[37]

Low-power computer vision , pages=

A survey of quantization methods for efficient neural network inference , author=. Low-power computer vision , pages=. 2022 , publisher=

2022
[38]

Advances in neural information processing systems , volume=

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers , author=. Advances in neural information processing systems , volume=
[39]

int8 (): 8-bit matrix multiplication for transformers at scale , author=

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in neural information processing systems , volume=
[40]

2021 , eprint=

A White Paper on Neural Network Quantization , author=. 2021 , eprint=

2021
[41]

Advances in neural information processing systems , volume=

Qlora: Efficient finetuning of quantized llms , author=. Advances in neural information processing systems , volume=
[42]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
[43]

arXiv preprint arXiv:2303.15647 , year=

Scaling down to scale up: A guide to parameter-efficient fine-tuning , author=. arXiv preprint arXiv:2303.15647 , year=

work page arXiv
[44]

The Power of Scale for Parameter-Efficient Prompt Tuning

The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=

work page internal anchor Pith review arXiv
[45]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
[46]

Future Internet , volume=

TinyML for ultra-low power AI and large scale IoT deployments: A systematic review , author=. Future Internet , volume=. 2022 , publisher=

2022
[47]

Carbon Emissions and Large Neural Network Training

Carbon emissions and large neural network training , author=. arXiv preprint arXiv:2104.10350 , year=

work page internal anchor Pith review arXiv
[48]

Computer , volume=

The carbon footprint of machine learning training will plateau, then shrink , author=. Computer , volume=. 2022 , publisher=

2022
[49]

Journal of machine learning research , volume=

Estimating the carbon footprint of bloom, a 176b parameter language model , author=. Journal of machine learning research , volume=
[50]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review arXiv 2009
[51]

2021 , journal =

Mlperf tiny benchmark , author=. arXiv preprint arXiv:2106.07597 , year=

work page arXiv
[52]

Holistic Evaluation of Language Models

Holistic evaluation of language models , author=. arXiv preprint arXiv:2211.09110 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Evaluating quantized large language models,

Evaluating quantized large language models , author=. arXiv preprint arXiv:2402.18158 , year=

work page arXiv
[54]

Systematic characterization of llm quantization: A performance, energy, and quality perspective.arXiv preprint arXiv:2508.16712, 2025

Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective , author=. arXiv preprint arXiv:2508.16712 , year=

work page arXiv
[55]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
[56]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Squad: 100,000+ questions for machine comprehension of text , author=. arXiv preprint arXiv:1606.05250 , year=

work page internal anchor Pith review arXiv
[57]

Know What You Don't Know: Unanswerable Questions for SQuAD

Know what you don't know: Unanswerable questions for SQuAD , author=. arXiv preprint arXiv:1806.03822 , year=

work page Pith review arXiv
[58]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization , author=. arXiv preprint arXiv:1808.08745 , year=

work page Pith review arXiv
[59]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Enhancing chat language models by scaling high-quality instructional conversations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[60]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
[61]

arXiv preprint arXiv:2204.04991 , year=

TRUE: Re-evaluating factual consistency evaluation , author=. arXiv preprint arXiv:2204.04991 , year=

work page arXiv
[62]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
[63]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review arXiv
[64]

Even When Users Do Not Intend To , year=

Fine-tuning Aligned Language Models Compromises Safety , author=. Even When Users Do Not Intend To , year=
[65]

Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models,

Investigating the impact of quantization methods on the safety and reliability of large language models , author=. arXiv preprint arXiv:2502.15799 , year=

work page arXiv
[66]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[67]

2023 , publisher=

Stanford alpaca: An instruction-following llama model , author=. 2023 , publisher=

2023
[68]

IEEE Transactions on Cloud Computing , volume=

Performance modeling of serverless computing platforms , author=. IEEE Transactions on Cloud Computing , volume=. 2020 , publisher=

2020
[69]

2019 , eprint=

Dissecting the NVidia Turing T4 GPU via Microbenchmarking , author=. 2019 , eprint=

2019
[70]

NVIDIA Whitepaper , volume=

Nvidia turing gpu architecture , author=. NVIDIA Whitepaper , volume=
[71]

Proceedings of the 15th ACM Multimedia Systems Conference , pages=

GREEM: An Open-Source Energy Measurement Tool for Video Processing , author=. Proceedings of the 15th ACM Multimedia Systems Conference , pages=
[72]

Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

2020
[73]

Peft: State-of-the-art parameter-efficient fine-tuning methods , author=
[74]

von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =
[75]

2025 , publisher=

Electricity 2025: Analysis and Forecast to 2027 , author=. 2025 , publisher=

2025
[76]

The 2022 EPA Automotive Trends Report: Greenhouse Gas Emissions, Fuel Economy, and Technology Since 1975 , author=

2022
[77]

2023 , eprint=

Generative Language Models and Automated Influence Operations: Emerging Threats and Potential Mitigations , author=. 2023 , eprint=

2023
[78]

Systems , volume=

A systematic survey of distributed decision support systems in healthcare , author=. Systems , volume=. 2025 , publisher=

2025
[79]

2025 , publisher=

Generative AI and Large Language Models: A Comprehensive Scientific Review , author=. 2025 , publisher=

2025
[80]

Sensors , volume=

Big data analytics using cloud computing based frameworks for power management systems: Status, constraints, and future recommendations , author=. Sensors , volume=. 2023 , publisher=

2023
[81]

Review of Applied Science and Technology , volume=

Information System-Based Decision Support Tools: A Systematic Review Of Strategic Applications In Service-Oriented Enterprises , author=. Review of Applied Science and Technology , volume=
[82]

Journal of Intelligent Communication , year=

Interactive Conversational AI with IoT Devices for Enhanced Human-Robot Interaction , author=. Journal of Intelligent Communication , year=
[83]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

LLMs on a Budget? Say HOLA , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

2025
[84]

arXiv preprint arXiv:2507.01042 , year=

Can Argus Judge Them All? Comparing VLMs Across Domains , author=. arXiv preprint arXiv:2507.01042 , year=

work page arXiv