pith. sign in

arxiv: 2211.09110 · v2 · submitted 2022-11-16 · 💻 cs.CL · cs.AI· cs.LG

Holistic Evaluation of Language Models

Pith reviewed 2026-05-24 10:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords language modelsevaluationbenchmarkingscenariosmetricstransparencymulti-metricstandardized conditions
0
0 comments X

The pith

Language models are now densely benchmarked on the same 42 scenarios and 7 metrics under standardized conditions for all 30 models evaluated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to evaluate language models more transparently by first taxonomizing the space of use cases and desired properties, then selecting a broad feasible subset while noting gaps such as certain dialects or trustworthiness measures. It applies seven metrics including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency to sixteen core scenarios plus targeted evaluations on twenty-six more. Thirty models spanning open, limited-access, and closed types are run on all forty-two scenarios, raising average coverage from 17.9 percent to 96 percent and producing twenty-five top-level findings with all raw prompts and completions released. A sympathetic reader would care because prior evaluations left models with almost no shared test cases, making direct comparisons and risk assessments unreliable.

Core claim

HELM taxonomizes the vast space of scenarios and metrics for language models, selects a broad subset based on coverage and feasibility while noting missing areas, adopts a multi-metric approach measuring seven metrics on sixteen core scenarios when possible, performs seven targeted evaluations, and conducts a large-scale evaluation of thirty prominent language models on all forty-two scenarios, improving coverage to 96 percent and surfacing twenty-five top-level findings, with full release of raw data and a modular toolkit.

What carries the argument

The HELM taxonomy of scenarios (use cases) and metrics (desiderata) combined with a multi-metric measurement protocol that applies accuracy plus six additional metrics to each core scenario.

If this is right

  • Trade-offs across the seven metrics become visible for every model rather than accuracy alone determining perceived quality.
  • All thirty models can be compared directly because they share the same core scenarios and metrics under identical conditions.
  • Twenty-one previously unused scenarios enter mainstream evaluation, expanding the range of tested capabilities.
  • The released raw prompts and completions enable independent further analysis by the community.
  • A modular toolkit supports continuous addition of new scenarios, metrics, and models as a living benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers might shift focus from maximizing accuracy to balancing multiple metrics when the standardized results show consistent trade-offs.
  • The public data release could support targeted studies on specific failure modes that the top-level findings only flag.
  • The approach of noting explicit gaps in the taxonomy could encourage parallel efforts to fill areas like trustworthiness metrics.
  • Similar taxonomy-plus-multi-metric structures might apply to evaluating other foundation models beyond language.

Load-bearing premise

The chosen subset of scenarios and metrics is broad enough to give a holistic view of model capabilities, limitations, and risks even with acknowledged gaps in coverage.

What would settle it

Repeating the full set of evaluations on the same thirty models but with an alternate selection of scenarios that still meets the coverage criteria produces substantially different top-level findings or model rankings.

Figures

Figures reproduced from arXiv: 2211.09110 by Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R\'e, Deepak Narayanan, Diana Acosta-Navas, Dilara Soylu, Dimitris Tsipras, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Michihiro Yasunaga, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Percy Liang, Peter Henderson, Qian Huang, Rishi Bommasani, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Tony Lee, Vishrav Chaudhary, William Wang, Xuechen Li, Yian Zhang, Yifan Mai, Yuhuai Wu, Yuhui Zhang, Yuta Koreeda.

Figure 1
Figure 1. Figure 1: Language model. A language model takes text (a prompt) and generates text (a completion) probabilistically. Despite their simple interface, language models can be adapted to a wide range of language tasks from question answering to summarization. 1 Introduction Benchmarks orient AI. They encode values and priorities (Ethayarajh & Jurafsky, 2020; Birhane et al., 2022) that specify directions for the AI comm… view at source ↗
Figure 2
Figure 2. Figure 2: The importance of the taxonomy to HELM. Previous language model benchmarks (e.g. Su￾perGLUE, EleutherAI LM Evaluation Harness, BIG-Bench) are collections of datasets, each with a standard task framing and canonical metric, usually accuracy (left). In comparison, in HELM we take a top-down approach of first explicitly stating what we want to evaluate (i.e. scenarios and metrics) by working through their und… view at source ↗
Figure 3
Figure 3. Figure 3: Many metrics for each use case. In comparison to most prior benchmarks of language technologies, which primarily center accuracy and often relegate other desiderata to their own bespoke datasets (if at all), in HELM we take a multi-metric approach. This foregrounds metrics beyond accuracy and allows one to study the tradeoffs between the metrics. This multi-metric perspective conveys a position we take on … view at source ↗
Figure 4
Figure 4. Figure 4: Standardizing language model evaluation. Prior to our effort (top), the evaluation of language models was uneven. Several of our 16 core scenarios had no models evaluated on them, and only a few scenarios (e.g. BoolQ, HellaSwag) had a considerable number of models evaluated on them. Note that this is cumulative: in the top plot, we not only document instances where the work introducing the model evaluated … view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation components. Each evaluation run requires the specification of a scenario (what we want), a model with an adaptation process (how we get it), and one or more metrics (how good are the results). ,QVWDQFH ,QSXW :KLFKRIWKHIROORZLQJWHUPVGHVFULEHVWKH ERG\ VDELOLW\WRPDLQWDLQLWVQRUPDOVWDWH" 5HIHUHQFHV Ɣ $QDEROLVP Ɣ &DWDEROLVP Ɣ 7ROHUDQFH Ɣ +RPHRVWDVLV>FRUUHFW@ 6FHQDULR 00/8 VXEMHFW DQDWRP\ ,QSXW :KLFKRI… view at source ↗
Figure 7
Figure 7. Figure 7: Adaptation. During adaptation, we construct a prompt for each evaluation instance which may include in-context training instances as well. Given decoding parameters, a language model generates a completion (in red). The multiple choice example is shown using two different adaptation strategies that we describe subsequently, with left version being the joint strategy (all answer choices are presented at onc… view at source ↗
Figure 8
Figure 8. Figure 8: Scenario structure. Scenarios are what we want the language model to do. To specify a scenario, we break it down into a task, domain, and language, further subdividing the domain into properties of the text (what), speaker (who), and the time/circumstances (when). Examples of scenarios include (question answering, (clinical notes, doctors, now), English) and (toxicity detection, (tweets, Egypt, Internet-er… view at source ↗
Figure 9
Figure 9. Figure 9: Modern use cases for language models. An assortment of (largely novel/historically unexplored) potential use cases for language models. Figure sourced from https://beta.openai.com/ examples/. topics” of study in NLP at the time of writing.32 For each track, we map the associated subarea of NLP to canonical tasks for that track in [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The world’s languages. Only a tiny percentage of the world’s languages are currently represented in language models. There are over 6,000 languages in the world, with estimates varying due to the inherent uncertainty of what constitutes a separate language (Nordhoff & Hammarström, 2011). This map shows the languages of the world, with each dot representing one language and its color indicating the top-lev… view at source ↗
Figure 12
Figure 12. Figure 12: Example of information retrieval (passage ranking). An example instance for information retrieval from MS MARCO. We focus here on the passage ranking task: given a query q and a large corpus C of passages, systems must output a list of the top-k passages from C in decreasing “relevance” to q. We specifically study this in the context of re-ranking: since C is typically extremely large (e.g. |C| > 10M pass… view at source ↗
Figure 14
Figure 14. Figure 14: Example of sentiment analysis. An example instance for sentiment analysis from IMDB. news), particularly towards domains where there is greater demand for summaries (see Reiter, 2022). And we especially highlight that these two datasets have been the subject of critique, and that broader change is required for dataset and evaluation design in summarization and natural language generation (Gehrmann et al.,… view at source ↗
Figure 15
Figure 15. Figure 15: Example of toxicity detection. An example instance for toxicity detection from CivilCom￾ments. group membership as well as notions of social status and privilege, such that their interpretation causes disproportionate impact to members of marginalized groups (Welbl et al., 2021). We emphasize that the stakes for toxicity detection are as high as they can be. Failures in content moderation due to failures … view at source ↗
Figure 17
Figure 17. Figure 17: Calibration Metrics. A demonstration of how we measure calibration and selective classifica￾tion. The model probabilities refer to the probabilities the model assigns to its prediction. For simplicity, the figure uses 2 bins for ECE computation, but we use 10 bins in practice. models inform decision-making (e.g. resume screening), which we increasingly see for language technology as its scope broadens. Fo… view at source ↗
Figure 18
Figure 18. Figure 18 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Fairness Perturbations. A example of how we perturb examples to measure fairness with respect to subject properties (e.g. the gender of the entities mentioned in the text). of transformed versions of existing datasets (generated by the authors of the original datasets), aimed to test equivariance through counterfactually-augmented data (Kaushik et al., 2019). Since such contrast sets only exist for a few … view at source ↗
Figure 20
Figure 20. Figure 20: Bias Metrics. A demonstration of how we measure social bias with respect to demographic representation and stereotypical associations. do report performance disparities as a function of speaker properties (gender, nationality, spoken vs. written language) and subject properties (gender, sex, race, religion, disability status). Discussion. We additionally bring attention to the important question for futur… view at source ↗
Figure 21
Figure 21. Figure 21: Toxicity Metrics. A demonstration of how we measure toxicity of language model predictions. These measures dependence on the cooccurence statistics of demographic words with these stereotyped terms across model generations (see [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Inference Efficiency Metrics. A demonstration of how we measure inference efficiency. We compute two metrics: denoised inference runtime and idealized inference runtime. F returns the runtime of encoding a prompt of given size, and g is the runtime of generating each additional output token for the given model. For some models, like the AI21 models, we do not have enough information to make a reliable est… view at source ↗
Figure 23
Figure 23. Figure 23: Prompt formatting. An example of how we structure and format the prompt for querying the language model. Parameter Language Modeling TruthfulQA CNN/DailyMail Prompt format §J.1: prompting-test §J.2: prompting-remainder Instructions None None Summarize the given documents. Input prefix None Question: Document: Reference prefix None None None Output prefix None Answer: Summary: { Instance prefix None None N… view at source ↗
Figure 24
Figure 24. Figure 24: Accuracy vs. X. The relationship between accuracy (x-axis) and each of the 6 metrics (calibra￾tion, robustness, fairness, social bias, toxicity, efficiency) we study in this work across all core scenarios and for all models. For calibration error, we measure ECE-10; for bias, we measure bias in gender representation; and for efficiency, we measure denoised inference time. Therefore, we harness our evaluat… view at source ↗
Figure 25
Figure 25. Figure 25: Correlation between metrics. The Pearson correlation between each metric and every other metric (x-axis). The small grey dots denote the correlation on each individual scenario. Trends are qualita￾tively similarly for other correlation measures (e.g. Spearman correlation). For calibration error, we measure ECE-10; for bias, we measure bias in gender representation; and for efficiency, we measure denoised … view at source ↗
Figure 26
Figure 26. Figure 26: Head-to-head win rate per each model. We report the fraction of head-to-head comparisons between the given model and all other models, across all scenarios, where the given model is higher along the metric (e.g. more accurate in the accuracy subfigure). If a model was the highest for the given metric for every scenario, it would receive a score of 1.0; if a model received a score of 0.5, then if a scenari… view at source ↗
Figure 27
Figure 27. Figure 27: Cumulative accuracy over time. The relationship between time (x-axis) and the accuracy of the most accurate model released up to that point (y-axis) across 16 core scenarios. That is, the graph tracks the progress in the state-of-the-art (SOTA) accuracy over time for each scenario. in the top half of this tier. Similarly, within a model family, we see model scale is perfectly monotonically correlated with… view at source ↗
Figure 28
Figure 28. Figure 28: Accuracy as a function of model access. The relationship between access (open vs. limited vs. closed) and model accuracy for each of the 16 core scenarios. Shaded bars indicate the performance of the best model for that scenario, whereas the solid bars indicate the performance of the overall most accurate model across all core scenarios based on [PITH_FULL_IMAGE:figures/full_fig_p053_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Cumultative Scale vs. Accuracy. The relationship between model parameter size (x-axis) and the accuracy of the most accurate model released up to that scale on each core scenario. That is, the graph tracks the progress in the state-of-the-art (SOTA) accuracy as a function of scale for each scenario. 10 6 × 10 1 0 log(The Pile BPB) 0.0 0.2 0.4 0.6 0.8 Accuracy J1-Jumbo text-davinci-002 J1-Grande J1-Large d… view at source ↗
Figure 30
Figure 30. Figure 30: The Pile loss vs. Accuracy. The relationship between log bits-per-byte (BPB) on The Pile and the accuracy on each core scenario. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Variance across seeds. For a subset of models and scenarios, we evaluate each scenario with three different random sets of in-context examples. We compute the range of the accuracy metric (maximum minus minimum value over the three random seeds) and visualize across models and scenarios. 8.2 Prompting analysis While the benchmark we design is general, we evaluate 30 models by adapting them through few-sho… view at source ↗
Figure 32
Figure 32. Figure 32: Number of in-context examples. For each model, we set the maximum number of in-context examples to [0, 1, 2, 4, 8, 16] and fit as many in-context examples as possible within the context window. We plot performance as a function of the average number of in-context examples actually used. Number of in-context examples. By default, we either use 5 in-context examples, or fewer examples for scenarios where 5 … view at source ↗
Figure 33
Figure 33. Figure 33: Multiple-choice adaptation. For each adaptation method (joint, separate, and separate calibrated), we compare models across scenarios. Formulation of multiple choice scenarios. Beyond the details of the prompt, we can conceptually imagine different ways to make use of the language interface to perform the same underlying scenario. As we discuss in Appendix J, in the case of multiple choice scenarios, we c… view at source ↗
Figure 34
Figure 34. Figure 34: Metric spread for core scenarios. Metrics for every model on every core scenario as a means for indicating the spread on a per-metric basis. 8.3 Task-specific results for core scenarios Since we organize the 16 core scenarios by the broader task, we highlight findings at the task level. To provide a sense of the spread in accuracies for each of these scenarios, we provide [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 35
Figure 35. Figure 35: Robustness–equivariance via contrast sets. For the two scenarios where we have access to hand-crafted contrast sets, for each model, we plot the robustness of the model on that scenario (worst-case performance across perturbations of each instance) as a function of its standard accuracy. is the most accurate model for all 9 scenarios. The margin however varies greatly across different scenarios: the large… view at source ↗
Figure 36
Figure 36. Figure 36: Targeted evaluation of language. Model accuracy on the four scenarios for evaluating linguistic understanding. 8.4 Targeted evaluations Language. To further explore the results for this targeted evaluation, see https://crfm.stanford.edu/ helm/v0.1.0/?group=language and [PITH_FULL_IMAGE:figures/full_fig_p067_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Targeted evaluation of knowledge. Model accuracy on the six scenarios (5 question answering, WikiFact) for evaluating knowledge acquisition. various downstream tasks, this may indicate either a con of instruction-tuning or an over-generalization of linguistic rules, especially given the poor performance is on irregular forms in particular. Knowledge. To further explore the results for this targeted evalua… view at source ↗
Figure 38
Figure 38. Figure 38: Targeted evaluation of reasoning. Model accuracy on 12 scenarios (5 question answering, WikiFact) for evaluating reasoning capabilities. et al., 2021) and GSM8K (Cobbe et al., 2020) yielding low accuracies. Overall, we find code-davinci-002 is consistently the most accurate model across reasoning scenarios, even in spite of some scenarios being posed fully in natural language. For both synthetic reasoning… view at source ↗
Figure 39
Figure 39. Figure 39: Targeted evaluation of copyright and memorization. Model performance on targeted evaluations for memorization for both copyrighted text and licensed code. 0.2 0.1 0.0 YaLM (100B) T5 (11B) ada (350M) J1-Jumbo v1 (178B) text-ada-001 Cohere large v20220720 (13.1B) text-curie-001 TNLG v2 (6.7B) davinci (175B) curie (6.7B) text-babbage-001 Cohere medium v20220720 (6.1B) J1-Large v1 (7.5B) J1-Grande v1 (17B) ba… view at source ↗
Figure 40
Figure 40. Figure 40: Targeted evaluation of social bias. Model performance on targeted evaluations for bias on BBQ. 70 [PITH_FULL_IMAGE:figures/full_fig_p070_40.png] view at source ↗
read the original abstract

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 4 minor

Summary. The paper introduces HELM, a framework for holistic evaluation of language models. It first taxonomizes the space of scenarios (use cases) and metrics (desiderata), then selects a feasible subset of 16 core scenarios and 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) for multi-metric evaluation (achieved 87.5% of the time). It evaluates 30 models (open, limited-access, closed) on these plus 26 targeted scenarios, achieving 96% dense coverage on the core set (up from prior average of 17.9%), surfaces 25 top-level findings, and releases all raw prompts, completions, and a modular toolkit.

Significance. If the results hold, this provides a substantial advance in standardized, multi-metric LM evaluation that exposes trade-offs and improves transparency over prior fragmented benchmarks. Explicit credit is due for the public release of raw model outputs and the modular toolkit, which directly support reproducibility and community extensions. The documented gaps (e.g., QA for neglected dialects, trustworthiness metrics) and the 96% coverage claim are presented as concrete improvements rather than exhaustive holism.

major comments (1)
  1. [evaluation section / abstract] The central coverage claim (96.0% on 16 core scenarios across all 30 models) is a direct measurement and load-bearing for the contribution, but the manuscript should clarify in the evaluation section how the prior 17.9% average was computed (e.g., which models and scenarios were included in the baseline calculation) to allow readers to assess the improvement magnitude.
minor comments (4)
  1. [abstract] Abstract: the 87.5% multi-metric figure is stated without noting it corresponds to 14 out of 16 scenarios; adding this parenthetical would improve immediate clarity.
  2. [abstract / introduction] The 25 top-level findings are referenced but not summarized or enumerated in the abstract or introduction; a concise bullet list or table reference would help readers locate the key outputs.
  3. [taxonomy section] Notation for scenarios and metrics is introduced in the taxonomy section but could benefit from a single consolidated table early in the paper to reduce cross-referencing.
  4. [targeted evaluations section] The targeted evaluations (7 evaluations on 26 scenarios) are described at a high level; a brief table mapping each targeted evaluation to its scenarios and metrics would aid navigation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation for minor revision. We address the major comment below.

read point-by-point responses
  1. Referee: [evaluation section / abstract] The central coverage claim (96.0% on 16 core scenarios across all 30 models) is a direct measurement and load-bearing for the contribution, but the manuscript should clarify in the evaluation section how the prior 17.9% average was computed (e.g., which models and scenarios were included in the baseline calculation) to allow readers to assess the improvement magnitude.

    Authors: We agree that providing more detail on the baseline would improve clarity. The 17.9% average was computed by surveying the published evaluations of the 30 models against the 16 core scenarios prior to HELM (i.e., counting how many of the 16 scenarios each model had been evaluated on in the literature, then averaging). In the revised manuscript we will add an explicit paragraph in the evaluation section describing this survey methodology, the sources consulted, and the per-model counts that underlie the average. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims consist of (1) a taxonomy and feasibility-based selection of scenarios/metrics with explicit documentation of gaps, (2) direct empirical measurements of 7 metrics across 16 core scenarios for 30 models, and (3) descriptive coverage statistics (e.g., prior 17.9% to 96.0% dense benchmarking). These are factual outputs of running the evaluations under standardized conditions, not quantities derived from or fitted to the results themselves. No equations, parameter fitting, self-citation chains, or uniqueness theorems appear in the derivation; the 25 findings are reported measurements rather than premises. The selection process is presented as an improvement over prior fragmentation with acknowledged incompleteness, rendering the evaluation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The contribution rests on the domain assumption that the chosen 16 scenarios and 7 metrics capture the most important dimensions despite acknowledged gaps, plus the practical selection of which 30 models and 42 scenarios to run under standardized conditions.

free parameters (2)
  • Choice of 16 core scenarios
    Selected from the taxonomy on the basis of coverage and feasibility rather than a formal optimality criterion.
  • Choice of 7 metrics
    accuracy, calibration, robustness, fairness, bias, toxicity, efficiency; selected to balance breadth and measurability.
axioms (1)
  • domain assumption Standardized evaluation conditions produce comparable and meaningful metric values across open, limited-access, and closed models.
    Invoked when claiming the 96% coverage improvement and the 25 top-level findings.

pith-pipeline@v0.9.0 · 6094 in / 1452 out tokens · 35297 ms · 2026-05-24T10:04:27.979965+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

    cs.AI 2026-05 accept novelty 8.0

    AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

  2. EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

    econ.EM 2026-05 accept novelty 8.0

    EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

  3. A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring

    cs.CY 2026-05 accept novelty 8.0

    Continuous auditing creates an unavoidable cover regime in which static auditors cannot simultaneously eliminate coverage and granularity failures, shown via new policies, strategies, and a reproducible simulator.

  4. MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

    cs.SE 2026-01 accept novelty 8.0

    MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

  5. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

  6. GRASP: Deterministic argument ranking in interaction graphs

    cs.LG 2026-05 unverdicted novelty 7.0

    GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging a...

  7. SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting

    q-bio.NC 2026-05 unverdicted novelty 7.0

    SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...

  8. Causal Bias Detection in Generative Artificial Intelligence

    cs.AI 2026-05 unverdicted novelty 7.0

    Develops a causal framework unifying generative AI fairness with standard ML, with new decompositions, identification conditions, and estimators demonstrated on LLM race and gender bias.

  9. HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

    cs.CL 2026-05 unverdicted novelty 7.0

    Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

  10. Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

    cs.HC 2026-05 accept novelty 7.0

    LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

  11. CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

    cs.CV 2026-05 conditional novelty 7.0

    Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

  12. iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

    cs.CV 2026-05 unverdicted novelty 7.0

    iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).

  13. LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites

    cs.LG 2026-05 unverdicted novelty 7.0

    LLMSpace is the first framework to jointly model operational and embodied carbon for LLM inference on LEO satellites, incorporating radiation-hardened hardware, peripheral systems, and workload patterns such as prefil...

  14. LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites

    cs.LG 2026-05 unverdicted novelty 7.0

    LLMSpace is the first modeling framework that jointly calculates operational and embodied carbon emissions for LLM inference on LEO satellites, incorporating radiation-hardened hardware, peripheral systems, and LLM wo...

  15. Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

    cs.DC 2026-05 unverdicted novelty 7.0

    Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

  16. The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

    cs.LG 2026-05 unverdicted novelty 7.0

    An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.

  17. TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

    cs.CV 2026-04 accept novelty 7.0

    TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

  18. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  19. SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

    cs.CL 2026-04 accept novelty 7.0

    SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.

  20. An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

    cs.AI 2026-04 unverdicted novelty 7.0

    An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...

  21. MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

    cs.SE 2026-01 unverdicted novelty 7.0

    MCP-Atlas introduces a benchmark of 36 real MCP servers, 220 tools, and 1,000 natural-language tasks to measure LLM tool-use competency in multi-server workflows.

  22. PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

    cs.AI 2026-01 conditional novelty 7.0

    PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.

  23. Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

    cs.SE 2026-01 conditional novelty 7.0

    Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.

  24. Automatic Replication of LLM Mistakes in Medical Conversations

    cs.CL 2025-12 unverdicted novelty 7.0

    MedMistake automatically generates 3,390 single-shot QA pairs capturing LLM mistakes in medical conversations, with expert validation on a 211-question subset showing performance differences among 12 frontier models.

  25. Classification Trees with Valid Inference via the Exponential Mechanism

    stat.ME 2025-11 unverdicted novelty 7.0

    Classification trees built with the exponential mechanism generate asymptotically valid inference pivots from sampling probabilities without major accuracy loss.

  26. Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

    cs.LG 2025-05 conditional novelty 7.0

    A well-tuned kNN router matches or exceeds state-of-the-art learned routers on new standardized benchmarks spanning instruction, QA, reasoning, and the first multi-modal visual routing dataset, due to locality of mode...

  27. PRIMETIME : Limits of LLMs in Temporal Primitives

    cs.NE 2025-04 unverdicted novelty 7.0

    PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

  28. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  29. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  30. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

    cs.CL 2023-05 accept novelty 7.0

    Chain-of-thought explanations in LLMs are frequently unfaithful: models systematically omit mention of biasing prompt features that change their answers and instead produce rationalizations for those biased outputs.

  31. Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

    cs.SE 2026-05 unverdicted novelty 6.0

    The paper introduces contractual skills as a GovernSpec-inspired framework for AI agent SKILL.md files and evaluates it in text-generation and tool-calling experiments showing gains in checkability over baselines.

  32. Mem-$\pi$: Adaptive Memory through Learning When and What to Generate

    cs.CL 2026-05 unverdicted novelty 6.0

    Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.

  33. Causal Bias Detection in Generative Artificial Intelligence

    cs.AI 2026-05 unverdicted novelty 6.0

    A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.

  34. Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing

    cs.CR 2026-05 unverdicted novelty 6.0

    GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.

  35. Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

    cs.AI 2026-05 unverdicted novelty 6.0

    Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

  36. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  37. Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

    cs.HC 2026-05 unverdicted novelty 6.0

    A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.

  38. Query-efficient model evaluation using cached responses

    cs.LG 2026-05 unverdicted novelty 6.0

    DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

  39. ModelLens: Finding the Best for Your Task from Myriads of Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.

  40. When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

    cs.MA 2026-05 unverdicted novelty 6.0

    CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.

  41. When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

    cs.MA 2026-05 unverdicted novelty 6.0

    CAFE detects positive distributional Jensen Gaps across five multi-agent LLM architectures on a banking-risk benchmark, showing that quality drops under semantic stress can coexist with statistically detectable antifr...

  42. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  43. What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.

  44. Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

    cs.AI 2026-05 unverdicted novelty 6.0

    The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...

  45. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  46. Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models

    cs.IR 2026-04 conditional novelty 6.0

    RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.

  47. Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

    cs.SE 2026-04 unverdicted novelty 6.0

    Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...

  48. Are Large Language Models Economically Viable for Industry Deployment?

    cs.CL 2026-04 unverdicted novelty 6.0

    Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.

  49. Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier

    cs.AI 2026-04 unverdicted novelty 6.0

    ISOPro replaces learned reward models with deterministic verifiers in a continuous evaluation setup for LLMs, delivering larger average capability gains than GRPO-LoRA across small models in scheduling and MBPP domain...

  50. Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.

  51. The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

    cs.AI 2026-04 unverdicted novelty 6.0

    Execution and refusal in tool-using LLM agents form separable behavioral dimensions whose joint distribution shifts systematically with normative regimes and autonomy scaffolding.

  52. BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

    cs.CL 2026-04 unverdicted novelty 6.0

    BERT-as-a-Judge fine-tunes a BERT encoder on synthetic question-candidate-reference triplets to judge answer correctness, outperforming lexical baselines and matching larger LLM judges across 36 models and 15 tasks.

  53. AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.

  54. SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

    cs.SE 2026-04 unverdicted novelty 6.0

    SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.

  55. Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing

    cs.AI 2026-04 unverdicted novelty 6.0

    Frontier AI models default to procedural secularism and score 17 points lower on Christian human-flourishing criteria than on pluralistic ones, with a 31-point gap in faith and spirituality.

  56. Measuring Representation Robustness in Large Language Models for Geometry

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...

  57. Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

    cs.CL 2026-03 unverdicted novelty 6.0

    Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 1...

  58. Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

    cs.CL 2026-03 unverdicted novelty 6.0

    CAP-TTA triggers context-aware preconditioned LoRA updates on high bias-risk OOD prompts to reduce toxicity in LLM narrative generation while preserving fluency and avoiding catastrophic forgetting.

  59. Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

    cs.AI 2026-03 conditional novelty 6.0

    Repeated sampling of the same safety prompts reveals substantial differences in LLM failure probabilities across temperatures that conventional single-evaluation benchmarks miss.

  60. Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking

    cs.IR 2026-02 unverdicted novelty 6.0

    Internal attention in LLMs shows a bell-curve relevance distribution across layers, enabling Selective-ICR that cuts inference latency 30-50% and lets an 8B zero-shot model match 14B RL re-rankers on BRIGHT.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 105 Pith papers · 7 internal anchors

  1. [1]

    Language Models are Few-Shot Learners

    Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.385. URL https: //www.aclweb.org/anthology/2021.naacl-main.385. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom He...

  2. [2]

    doi: 10.18653/v1/2021.acl-long.150

    Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.150. URL https: //aclanthology.org/2021.acl-long.150. Frieda Goldman-Eisler. Speech production and the predictability of words in context.Quarterly Journal of Experimental Psychology, 10(2):96–106, 1958. doi: 10.1080/17470215808416261. URLhttps://doi.org/ 10.1080/17470215808416261. ...

  3. [3]

    URLhttps://glottolog.org/accessed2021-08-08

    doi: 10.5281/zenodo.4761960. URLhttps://glottolog.org/accessed2021-08-08. Yiding Hao, William Merrill, Dana Angluin, Robert Frank, Noah Amsel, Andrew Benz, and Simon Mendel- sohn. Context-free transductions with neural stacks.EMNLP 2018, pp. 306, 2018. Gilbert Harman. Rationality. John Wiley & Sons, Ltd, 2013. Junxian He, Chunting Zhou, Xuezhe Ma, Taylor ...

  4. [4]

    Measuring Coding Challenge Competence With APPS

    URL https://openreview.net/forum?id=0RDcd5Axok. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. {DEBERTA}: {DECODING}-{enhanced} {bert} {with} {disentangled} {attention}. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=XPZIaotutsD. 95 Published in Transactions on Machine Learning Research (08/20...

  5. [5]

    In Christopher Hitchcock & Alan Hajek, edi- tors: Oxford Handbook of Probability and Philosophy , Oxford University Press, pp

    URL https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199286546.001.0001/ oxfordhb-9780199286546-e-6. Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, 2021. Association for Computing Machinery. URLhttps://arxiv.org/abs/19...

  6. [6]

    Cognition , year =

    doi: https://doi.org/10.1016/j.cognition.2007.05.006. URL https://www.sciencedirect.com/ science/article/pii/S0010027707001436. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and co...

  7. [7]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    doi: 10.2466/pr0.1957.3.3.635. URLhttps://doi.org/10.2466/pr0.1957.3.3.635. Floyd G. Lounsburg. Transitional probability, linguistic structure and systems of habitfamily hierarchies. Psycholinguistics: a survey of theory and research, 1954. Henry P. Luhn. The automatic creation of literature abstracts.IBM Journal of Research and Development, 2:159–165, 19...

  8. [8]

    Red Teaming Language Models with Language Models

    ISSN 2474-7394. URL https://online.ucpress.edu/collabra/article/7/1/25293/117809/ A-Practical-Guide-to-Doing-Behavioral-Research-on . 25293. Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models. In M. Ran- zato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.),Advances in Neural Information Processi...

  9. [9]

    Measuring and Narrowing the Compositionality Gap in Language Models

    Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/P19-1101. Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger. On fairness and calibration. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5684–5693, 2017. Christopher Potts, Zhengxuan Wu, Atticus Geiger, and Douwe Kiela. DynaSe...

  10. [10]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    doi: 10.48550/ARXIV.2211.05100. URLhttps://arxiv.org/abs/2211.05100. Anna Schmidt and Michael Wiegand. A survey on hate speech detection using natural language processing. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pp. 1–10, Valencia, Spain, April 2017. Association for Computational Linguistics. doi...

  11. [11]

    Yes” or “No

    doi: 10.1145/2460276.2460278. URLhttp://doi.acm.org/10.1145/2460276.2460278. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. InInternational Conference on Learning Representations (ICLR), 2014. Benedikt Szmrecsanyi, Jason Grafmiller, and Laura Rosseel...

  12. [12]

    Dialect Perturbation: We currently support conversions between Standard American English (SAE) and African American English (AAE) using the mapping between lexical terms provided by Ziems et al. (2022)

  13. [13]

    Gender Pronoun Perturbation: We support conversions between the gender neutral and gendered pronouns from Lauscher et al. (2022)

  14. [14]

    Grandfather

    Gender Term Perturbation: We convert gender terms of a source gender (e.g. “Grandfather”) to their counterparts in a target gender (e.g. “Grandmother”). We build our mapping by improving the union of the mappings from Garg et al. (2018) and Bolukbasi et al. (2016)

  15. [15]

    (2017), which derives its list form Greenwald et al

    FirstNamePerturbation: Weconvertfirstnamesinasourceraceorgendertothoseinthetargetrace or gender, using the names from Caliskan et al. (2017), which derives its list form Greenwald et al. (1998). The associations between demographic category and name are derived from US Census statistics; we note that these statistical relationships may also not be invaria...

  16. [16]

    (2018), which derives its list form Chalabi & Flowers (2017)

    Last Name Perturbation: We convert last names in a source race to those in the target race, using the last names from Garg et al. (2018), which derives its list form Chalabi & Flowers (2017). See the above discussion of the relationship between names and demographic information; we also note that the frequent instance of name change through marriage is po...

  17. [17]

    It came from down here

    subset of the scenario looks like: “It came from down here.” “What were you thinking bringing a stranger here?” “... look out for herself.” “I wouldn’t be alive if it wasn’t for her.” “Yeah, well, I’m protecting you now.” The textual output of a language model should be the same with the input. The main metric for the scenario is bits per byte (BPB). Data...

  18. [18]

    markup for the text itself,

  19. [19]

    parenthetical annotations provided by the authors, and 143 Published in Transactions on Machine Learning Research (08/2023)

  20. [20]

    The capital of France is __

    speaker tags for the spoken texts. Tags in the first category are removed with the enclosed text intact; tags in the second category are removed along with the enclosed text; and speaker tags are left as-is. The final preprocessed texts average 2046 tokens using the GPT-2 tokenizer. Data Resources. This dataset is not made available through our benchmark;...

  21. [21]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    B+-A, 144 Published in Transactions on Machine Learning Research (08/2023) Relation IDRelation Name PromptArtP136 genre The genre of [X] is a/anP1303 instrument The musical instrument [X] plays isP50 author The author of [X] isP170 creator The creator of [X] isP86 composer The composer of [X] isP57 director The director of [X] isLawP1001 applies to jurisd...