Recognition: unknown
Holistic Evaluation of Language Models
read the original abstract
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
This paper has not been read by Pith yet.
Forward citations
Cited by 60 Pith papers
-
A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring
Continuous auditing creates an unavoidable cover regime in which static auditors cannot simultaneously eliminate coverage and granularity failures, shown via new policies, strategies, and a reproducible simulator.
-
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...
-
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
-
Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations
LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models
iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).
-
LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites
LLMSpace is the first modeling framework that jointly calculates operational and embodied carbon emissions for LLM inference on LEO satellites, incorporating radiation-hardened hardware, peripheral systems, and LLM wo...
-
LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites
LLMSpace is the first framework to jointly model operational and embodied carbon for LLM inference on LEO satellites, incorporating radiation-hardened hardware, peripheral systems, and workload patterns such as prefil...
-
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
-
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
-
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
-
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
Causal Bias Detection in Generative Artifical Intelligence
A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
-
Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing
GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
-
Query-efficient model evaluation using cached responses
DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
-
ModelLens: Finding the Best for Your Task from Myriads of Models
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
-
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.
-
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
CAFE detects positive distributional Jensen Gaps across five multi-agent LLM architectures on a banking-risk benchmark, showing that quality drops under semantic stress can coexist with statistically detectable antifr...
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
-
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models
RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.
-
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...
-
Are Large Language Models Economically Viable for Industry Deployment?
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
-
Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
ISOPro replaces learned reward models with deterministic verifiers in a continuous evaluation setup for LLMs, delivering larger average capability gains than GRPO-LoRA across small models in scheduling and MBPP domain...
-
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
-
The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment
Execution and refusal in tool-using LLM agents form separable behavioral dimensions whose joint distribution shifts systematically with normative regimes and autonomy scaffolding.
-
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
BERT-as-a-Judge fine-tunes a BERT encoder on synthetic question-candidate-reference triplets to judge answer correctness, outperforming lexical baselines and matching larger LLM judges across 36 models and 15 tasks.
-
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
-
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
-
Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing
Frontier AI models default to procedural secularism and score 17 points lower on Christian human-flourishing criteria than on pluralistic ones, with a 31-point gap in faith and spirituality.
-
Measuring Representation Robustness in Large Language Models for Geometry
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
Mental Health AI Safety Claims Must Preserve Temporal Evidence
Mental health AI safety evaluations that discard temporal sequence and accumulation produce invalid conclusions; the paper formalizes this as Temporal Safety Non-Identifiability and proposes SCOPE-MH as a reporting st...
-
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
-
DAO-enabled decentralized physical AI: A new paradigm for human-machine collaboration
DePAI is a proposed democratic architecture that integrates DAOs, DePIN, and AI to enable human oversight of machine execution in physical-digital systems under transparent on-chain rules.
-
Reasoning emerges from constrained inference manifolds in large language models
Reasoning in LLMs emerges from inference dynamics forming constrained low-dimensional manifolds that preserve non-degenerate information volume, rather than from compression alone.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
TRUST: A Framework for Decentralized AI Service v.0.1
TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...
-
To Copilot and Beyond: 22 AI Systems Developers Want Built
Survey of 860 developers reveals 22 desired AI systems for non-coding tasks with explicit constraints on authority, provenance, and quality signals, framed as bounded delegation where AI handles assembly work but not ...
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
-
Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation
Closure of the Perspective API exposes structural dependence on a single proprietary toxicity scorer, leaving non-updatable benchmarks and irreproducible results while risking continued reliance on closed LLMs.
-
From World-Gen to Quest-Line: A Dependency-Driven Prompt Pipeline for Coherent RPG Generation
A dependency-aware prompt pipeline with structured JSON intermediates produces coherent, scalable RPG worlds and quests from LLMs.
-
Statistical Software Engineering with Tuned Variables
AI system maintenance requires treating configuration choices as versioned governed tuned variables promoted via statistical evidence from sampled evaluations.
-
An Empirical Study of Perceptions of General LLMs and Multimodal LLMs on Hugging Face
Hugging Face discussions show that access barriers, output quality, and setup complexity are the main user concerns for both general and multimodal LLMs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.