pith. machine review for the scientific record. sign in

arxiv: 2211.09110 · v2 · submitted 2022-11-16 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Holistic Evaluation of Language Models

Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R\'e, Deepak Narayanan, Diana Acosta-Navas, Dilara Soylu, Dimitris Tsipras, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Michihiro Yasunaga, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Percy Liang, Peter Henderson, Qian Huang, Rishi Bommasani, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Tony Lee, Vishrav Chaudhary, William Wang, Xuechen Li, Yian Zhang, Yifan Mai, Yuhuai Wu, Yuhui Zhang, Yuta Koreeda

classification 💻 cs.CL cs.AIcs.LG
keywords modelsscenarioslanguagemetricsevaluationhelmcoreaccuracy
0
0 comments X
read the original abstract

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring

    cs.CY 2026-05 accept novelty 8.0

    Continuous auditing creates an unavoidable cover regime in which static auditors cannot simultaneously eliminate coverage and granularity failures, shown via new policies, strategies, and a reproducible simulator.

  2. SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting

    q-bio.NC 2026-05 unverdicted novelty 7.0

    SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...

  3. HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

    cs.CL 2026-05 unverdicted novelty 7.0

    Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

  4. Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

    cs.HC 2026-05 accept novelty 7.0

    LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

  5. CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

    cs.CV 2026-05 conditional novelty 7.0

    Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

  6. iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

    cs.CV 2026-05 unverdicted novelty 7.0

    iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).

  7. LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites

    cs.LG 2026-05 unverdicted novelty 7.0

    LLMSpace is the first modeling framework that jointly calculates operational and embodied carbon emissions for LLM inference on LEO satellites, incorporating radiation-hardened hardware, peripheral systems, and LLM wo...

  8. LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites

    cs.LG 2026-05 unverdicted novelty 7.0

    LLMSpace is the first framework to jointly model operational and embodied carbon for LLM inference on LEO satellites, incorporating radiation-hardened hardware, peripheral systems, and workload patterns such as prefil...

  9. Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

    cs.DC 2026-05 unverdicted novelty 7.0

    Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

  10. The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

    cs.LG 2026-05 unverdicted novelty 7.0

    An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.

  11. TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

    cs.CV 2026-04 accept novelty 7.0

    TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

  12. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  13. SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

    cs.CL 2026-04 accept novelty 7.0

    SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.

  14. An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

    cs.AI 2026-04 unverdicted novelty 7.0

    An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...

  15. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  16. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  17. Causal Bias Detection in Generative Artifical Intelligence

    cs.AI 2026-05 unverdicted novelty 6.0

    A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.

  18. Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing

    cs.CR 2026-05 unverdicted novelty 6.0

    GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.

  19. Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

    cs.AI 2026-05 unverdicted novelty 6.0

    Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

  20. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  21. Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

    cs.HC 2026-05 unverdicted novelty 6.0

    A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.

  22. Query-efficient model evaluation using cached responses

    cs.LG 2026-05 unverdicted novelty 6.0

    DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

  23. ModelLens: Finding the Best for Your Task from Myriads of Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.

  24. When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

    cs.MA 2026-05 unverdicted novelty 6.0

    CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.

  25. When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

    cs.MA 2026-05 unverdicted novelty 6.0

    CAFE detects positive distributional Jensen Gaps across five multi-agent LLM architectures on a banking-risk benchmark, showing that quality drops under semantic stress can coexist with statistically detectable antifr...

  26. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  27. What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.

  28. Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

    cs.AI 2026-05 unverdicted novelty 6.0

    The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...

  29. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  30. Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models

    cs.IR 2026-04 conditional novelty 6.0

    RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.

  31. Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

    cs.SE 2026-04 unverdicted novelty 6.0

    Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...

  32. Are Large Language Models Economically Viable for Industry Deployment?

    cs.CL 2026-04 unverdicted novelty 6.0

    Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.

  33. Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier

    cs.AI 2026-04 unverdicted novelty 6.0

    ISOPro replaces learned reward models with deterministic verifiers in a continuous evaluation setup for LLMs, delivering larger average capability gains than GRPO-LoRA across small models in scheduling and MBPP domain...

  34. Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.

  35. The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

    cs.AI 2026-04 unverdicted novelty 6.0

    Execution and refusal in tool-using LLM agents form separable behavioral dimensions whose joint distribution shifts systematically with normative regimes and autonomy scaffolding.

  36. BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

    cs.CL 2026-04 unverdicted novelty 6.0

    BERT-as-a-Judge fine-tunes a BERT encoder on synthetic question-candidate-reference triplets to judge answer correctness, outperforming lexical baselines and matching larger LLM judges across 36 models and 15 tasks.

  37. AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.

  38. SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

    cs.SE 2026-04 unverdicted novelty 6.0

    SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.

  39. Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing

    cs.AI 2026-04 unverdicted novelty 6.0

    Frontier AI models default to procedural secularism and score 17 points lower on Christian human-flourishing criteria than on pluralistic ones, with a 31-point gap in faith and spirituality.

  40. Measuring Representation Robustness in Large Language Models for Geometry

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...

  41. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  42. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  43. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    cs.CL 2023-06 accept novelty 6.0

    GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.

  44. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  45. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    cs.CL 2022-11 unverdicted novelty 6.0

    BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

  46. Mental Health AI Safety Claims Must Preserve Temporal Evidence

    cs.AI 2026-05 unverdicted novelty 5.0

    Mental health AI safety evaluations that discard temporal sequence and accumulation produce invalid conclusions; the paper formalizes this as Temporal Safety Non-Identifiability and proposes SCOPE-MH as a reporting st...

  47. When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

    cs.LG 2026-05 unverdicted novelty 5.0

    A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

  48. DAO-enabled decentralized physical AI: A new paradigm for human-machine collaboration

    cs.MA 2026-05 unverdicted novelty 5.0

    DePAI is a proposed democratic architecture that integrates DAOs, DePIN, and AI to enable human oversight of machine execution in physical-digital systems under transparent on-chain rules.

  49. Reasoning emerges from constrained inference manifolds in large language models

    cs.LG 2026-05 unverdicted novelty 5.0

    Reasoning in LLMs emerges from inference dynamics forming constrained low-dimensional manifolds that preserve non-degenerate information volume, rather than from compression alone.

  50. Heterogeneous Scientific Foundation Model Collaboration

    cs.AI 2026-04 unverdicted novelty 5.0

    Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

  51. TRUST: A Framework for Decentralized AI Service v.0.1

    cs.AI 2026-04 unverdicted novelty 5.0

    TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...

  52. To Copilot and Beyond: 22 AI Systems Developers Want Built

    cs.SE 2026-04 unverdicted novelty 5.0

    Survey of 860 developers reveals 22 desired AI systems for non-coding tasks with explicit constraints on authority, provenance, and quality signals, framed as bounded delegation where AI handles assembly work but not ...

  53. PaLM 2 Technical Report

    cs.CL 2023-05 unverdicted novelty 5.0

    PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

  54. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  55. Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

    cs.AI 2026-05 unverdicted novelty 4.0

    Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.

  56. Measuring AI Reasoning: A Guide for Researchers

    cs.AI 2026-05 unverdicted novelty 4.0

    Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

  57. Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation

    cs.CL 2026-04 unverdicted novelty 4.0

    Closure of the Perspective API exposes structural dependence on a single proprietary toxicity scorer, leaving non-updatable benchmarks and irreproducible results while risking continued reliance on closed LLMs.

  58. From World-Gen to Quest-Line: A Dependency-Driven Prompt Pipeline for Coherent RPG Generation

    cs.CL 2026-04 unverdicted novelty 4.0

    A dependency-aware prompt pipeline with structured JSON intermediates produces coherent, scalable RPG worlds and quests from LLMs.

  59. Statistical Software Engineering with Tuned Variables

    cs.SE 2026-04 unverdicted novelty 4.0

    AI system maintenance requires treating configuration choices as versioned governed tuned variables promoted via statistical evidence from sampled evaluations.

  60. An Empirical Study of Perceptions of General LLMs and Multimodal LLMs on Hugging Face

    cs.SE 2026-04 unverdicted novelty 4.0

    Hugging Face discussions show that access barriers, output quality, and setup complexity are the main user concerns for both general and multimodal LLMs.