PRIMETIME : Limits of LLMs in Temporal Primitives

Edward Gaere; Florian Wangenheim

arxiv: 2504.16155 · v2 · submitted 2025-04-22 · 💻 cs.NE

PRIMETIME : Limits of LLMs in Temporal Primitives

Edward Gaere , Florian Wangenheim This is my paper

Pith reviewed 2026-05-22 18:30 UTC · model grok-4.3

classification 💻 cs.NE

keywords temporal reasoningLLMssynthetic generatordatetime parsingarithmeticfine-tuningevent planning

0 comments

The pith

Synthetic generator exposes unreliability of LLM datetime primitives but enables their learning through fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to show that LLMs have unreliable performance on the basic building blocks of temporal reasoning—specifically parsing datetimes and performing arithmetic on them—because existing tests mix these with other skills. The PRIMETIME generator creates pure examples for testing each primitive alone, uncovering wide variations in model accuracy. By reusing the generator to produce fine-tuning examples, the authors demonstrate that these primitives can be taught to small models, leading to strong results on more complex tasks such as planning events. A sympathetic reader would care because this provides a concrete method to both identify and correct specific LLM limitations in handling time without enormous computational costs. The broader implication is a reusable pattern for improving model capabilities on targeted skills.

Core claim

The authors claim that temporal primitives are individually unreliable in LLMs, with accuracy varying greatly depending on the model and prompt. Using their PRIMETIME synthetic generator to create isolated test cases, they show these issues clearly. The same generator then supplies training data that makes the primitives fully learnable, allowing small quantized LoRA transformers to attain frontier-level accuracy on composed Event Planning tasks.

What carries the argument

PRIMETIME, the synthetic generator that delivers non-conflated datetime exemplars for decompositional evaluation and fine-tuning of parsing and arithmetic primitives.

If this is right

Individual primitives show unreliable performance ranging from near-zero to perfect accuracy.
Primitives are fully learnable when trained on data from the generator.
Composed Event Planning reaches frontier-level accuracy with small quantized LoRA transformers.
The generator supports both diagnosis and production-ready improvements in one system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach of synthetic diagnosis paired with remediation may generalize to other domains where LLMs show inconsistent reasoning.
Practitioners could use similar generators to create custom training sets for domain-specific temporal tasks.
Future work might explore whether these improvements hold when models are tested on naturally occurring temporal data from the web or documents.

Load-bearing premise

The synthetic generator produces non-conflated, uncontaminated datetime exemplars that accurately isolate parsing and arithmetic without introducing artifacts that would not appear in real-world temporal data or tasks.

What would settle it

Fine-tuned models failing to outperform base models on a held-out set of real-world temporal reasoning examples not created by the generator would falsify the general learnability of the primitives.

Figures

Figures reproduced from arXiv: 2504.16155 by Edward Gaere, Florian Wangenheim.

**Figure 1.** Figure 1: The Translation ISO-8601 task requires the translation of a datetime from it’s natural representation to it’s ISO-8601 representation. ISO-8601 Representation with 20 days added 3299-02-17T00:43:15 ISO-8601 Representation 3299-01-28T00:43:15 Add 20 days [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: The Computation Add-20 task requires adding 20 days to a datetime provided in ISO-8601 representation, and producing a new ISO-8601 datetime with the result. The datasets for all tasks are synthetically generated. This allows for accurate labels with no noise, an arbitrary number of observations at low cost, and the datasets can be frequently regenerated to reduce contamination. A detailed description of t… view at source ↗

**Figure 3.** Figure 3: Prompt, system prompt, output and evaluation for the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The use of the ISO-8601 format for the input sequences, and not the natural representation, specifically allows to measure the reasoning ability of the model, because no translation required: the components of an ISO-8601 datetime can be easily and unambiguously extracted. The DATETIME benchmark also includes Mixed tasks that require the same datetime arithmetic, but the input format is a natural represent… view at source ↗

**Figure 5.** Figure 5: Prompt, system prompt, output and evaluation for the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Required Python steps to resolve the Add-250 task. A solution requires converting the date [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Underlying CPython function calls. Source available code on GitHub [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 1.** Figure 1: Prompt, output and evaluation of the Year Translation task. Note that the integer comparison output and target sequences is integer-wise, i.e ’01’ is the same as ’1’ Prompt parameters: • max tokens : 50 • temperature : 0.0 38 [PITH_FULL_IMAGE:figures/full_fig_p038_1.png] view at source ↗

read the original abstract

This paper introduces PRIMETIME, a synthetic generator that supports both benchmarking and fine-tuning of two primitive operations underlying temporal reasoning in Large Language Models (LLMs): parsing and arithmetic on datetimes. Existing temporal benchmarks assume simplified canonical datetime forms, conflate arithmetic, composition, and world knowledge into a single aggregate score, and offer no direct path to remediation. The first contribution is methodological: the PRIMETIME synthetic generator delivers non-conflated, uncontaminated, and unlimited datetime exemplars that enable a decompositional evaluation strategy for each primitive in isolation. The generator is extensible to support complex datetime tasks and is publicly released, alongside generated benchmarks. The second contribution is diagnostic: under this evaluation strategy, the primitives themselves prove individually unreliable, with per-primitive accuracy ranging from near-zero to perfect across models and prompting conditions. The third contribution is constructive: the same generator used for diagnosis also produces new training exemplars for fine-tuning, and the resulting models show that the primitives are fully learnable and the composed Event Planning task reaches frontier-level accuracy using small quantized LoRA transformers. The broader takeaway is that a single synthetic generator can serve both diagnosis and production-ready deployment. This methodological pattern may apply beyond temporal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real contribution is a reusable synthetic generator that isolates datetime parsing and arithmetic for diagnosis and then fixes them via targeted fine-tuning, though the results stay tied to that generator's distribution.

read the letter

The main thing to know is that PRIMETIME gives a practical split between testing parsing and arithmetic on datetimes separately, then uses the same generator to generate training data that lifts performance on both the primitives and a composed event-planning task. They release the generator and benchmarks, which is the part that could actually get used by others working on temporal reasoning.

Referee Report

3 major / 2 minor

Summary. The paper introduces PRIMETIME, a synthetic generator for producing unlimited, non-conflated datetime exemplars that isolate two temporal primitives—parsing and arithmetic—without conflating them with composition or world knowledge. It uses the generator for a decompositional benchmark showing per-primitive LLM accuracies ranging from near-zero to perfect across models and prompts, then re-uses the generator to create fine-tuning data. Small quantized LoRA transformers trained on this data master the primitives and reach frontier-level accuracy on a composed Event Planning task. The generator and benchmarks are publicly released.

Significance. If the central claims hold, the work demonstrates that a single extensible synthetic generator can serve dual purposes of diagnosis and targeted remediation for temporal reasoning weaknesses in LLMs. The public release of the generator and benchmarks is a clear strength supporting reproducibility. The approach of isolating primitives and showing they are learnable via small-scale fine-tuning offers a practical alternative to aggregate benchmarks that mix multiple skills.

major comments (3)

[§3] §3 (Generator Design): The assertion that PRIMETIME produces 'non-conflated, uncontaminated' exemplars that cleanly isolate parsing and arithmetic lacks a formal specification of the sampling procedure for year ranges, timezone coverage, format variations, and edge-case handling. Without explicit rules or validation against real-world distributions, it is unclear whether the reported near-zero to perfect accuracy ranges and subsequent LoRA fine-tuning success reflect genuine primitive acquisition or generator-specific statistical shortcuts.
[§4.2] §4.2 (Diagnostic Results): The per-primitive accuracy claims (near-zero to perfect) are presented without error bars, number of evaluation runs, or statistical tests for the extremes. This makes it difficult to determine whether the unreliability diagnosis is robust or sensitive to prompt variations and model-specific behaviors.
[§5] §5 (Fine-Tuning Experiments): The claim that the composed Event Planning task reaches 'frontier-level accuracy' using small quantized LoRA transformers is not accompanied by direct comparisons to strong baselines (e.g., larger models without primitive-specific fine-tuning or models trained on real temporal corpora). This weakens the constructive contribution that the primitives are 'fully learnable' in a general sense.

minor comments (2)

[Abstract / §1] The abstract and introduction would benefit from explicit citations to prior temporal benchmarks (e.g., TimeQA, TempReason) to better situate the decompositional strategy.
[Figures in §4] Figure captions for benchmark results should include the exact number of test exemplars per primitive and the prompting conditions used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment in turn below, clarifying our design choices and indicating revisions that will strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [§3] §3 (Generator Design): The assertion that PRIMETIME produces 'non-conflated, uncontaminated' exemplars that cleanly isolate parsing and arithmetic lacks a formal specification of the sampling procedure for year ranges, timezone coverage, format variations, and edge-case handling. Without explicit rules or validation against real-world distributions, it is unclear whether the reported near-zero to perfect accuracy ranges and subsequent LoRA fine-tuning success reflect genuine primitive acquisition or generator-specific statistical shortcuts.

Authors: We agree that §3 would benefit from greater formalization. The generator isolates primitives by construction: parsing exemplars contain only a datetime string and a target representation with no arithmetic, while arithmetic exemplars provide two datetimes and an operation with no parsing ambiguity. In the revision we will add an explicit algorithmic description of the sampling procedure, including uniform sampling over year ranges 1900–2100, coverage of UTC plus common offsets, controlled format variations (ISO, natural-language, abbreviated), and explicit handling of edge cases (leap seconds, DST transitions, month boundaries). We will also report summary statistics comparing generated distributions to real-world datetime corpora to confirm absence of unintended shortcuts. These additions will make the isolation claim fully verifiable. revision: yes
Referee: [§4.2] §4.2 (Diagnostic Results): The per-primitive accuracy claims (near-zero to perfect) are presented without error bars, number of evaluation runs, or statistical tests for the extremes. This makes it difficult to determine whether the unreliability diagnosis is robust or sensitive to prompt variations and model-specific behaviors.

Authors: The referee correctly notes the absence of variability measures. In the revised manuscript we will report all per-primitive accuracies as means over five independent evaluation runs with different random seeds, include standard deviations as error bars, and apply paired statistical tests (McNemar or Wilcoxon) to confirm that the observed near-zero versus near-perfect differences across models and prompts are significant. This will directly address concerns about robustness to prompt and model variation. revision: yes
Referee: [§5] §5 (Fine-Tuning Experiments): The claim that the composed Event Planning task reaches 'frontier-level accuracy' using small quantized LoRA transformers is not accompanied by direct comparisons to strong baselines (e.g., larger models without primitive-specific fine-tuning or models trained on real temporal corpora). This weakens the constructive contribution that the primitives are 'fully learnable' in a general sense.

Authors: We maintain that the primary constructive result is that primitive-specific synthetic data enables small LoRA adapters to reach frontier-level performance on the composed task, which is a practical and resource-efficient finding. Nevertheless, we acknowledge the value of additional context. In revision we will add a baseline comparison against the same base models without primitive fine-tuning (showing the lift attributable to PRIMETIME data) and, where feasible, against publicly available models fine-tuned on general temporal corpora. These comparisons will be presented as supplementary evidence rather than exhaustive benchmarking, preserving the paper’s focus on the diagnostic-to-remediation pipeline. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation introduces an independent synthetic generator for datetime exemplars, applies it to produce diagnostic benchmarks that reveal per-primitive unreliability, and then uses separate exemplars from the same generator for fine-tuning to demonstrate learnability and improved Event Planning performance. This chain does not reduce any result to its inputs by construction: the diagnostic accuracies and post-fine-tuning gains are empirical measurements on held-out data rather than tautological re-statements of the generator's sampling rules or fitted parameters. No equations, self-citations, uniqueness theorems, or ansatzes are shown to create a load-bearing loop, and the generator functions as an external artifact whose outputs serve as falsifiable test cases. The overall methodology remains self-contained against the produced benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the generator being able to produce clean, representative examples; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Synthetic datetime examples can be generated without contamination or conflation of parsing, arithmetic, and world knowledge.
This premise is required for the decompositional evaluation strategy to isolate the two primitives as claimed.

pith-pipeline@v0.9.0 · 5739 in / 1288 out tokens · 84610 ms · 2026-05-22T18:30:48.171280+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat (recovery theorem) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PRIMETIME synthetic generator delivers non-conflated, uncontaminated datetime exemplars... primitives themselves prove individually unreliable... fine-tuning... Event Planning task reaches frontier-level accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · 54 internal anchors

[1]

URL https://openai.com/index/hello-gpt-4o

Hello gpt-4o, . URL https://openai.com/index/hello-gpt-4o

work page
[2]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

URL https://www.llama.com/

Build the future of ai with meta llama 3, 2024. URL https://www.llama.com/

work page 2024
[5]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark (published at neurips 2024 track datasets and benchmarks), 202...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation, 2024

Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu, Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, and Yanghua Xiao. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation, 2024. URLhttps: //arxiv.org/abs/2306.05783

work page arXiv 2024
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv. org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023. URLhttps://arxiv.org/abs/2211.12588

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Neural Arithmetic Logic Units

Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units, 2018. URL https://arxiv.org/abs/1808.00508

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Neural GPUs Learn Algorithms

Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms, 2016. URLhttps://arxiv. org/abs/1511.08228

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Neural power units, 2020

Niklas Heim, Tomáš Pevný, and Václav Šmídl. Neural power units, 2020. URL https: //arxiv.org/abs/2006.01681

work page arXiv 2020
[13]

inalu: Improved neural arithmetic logic unit,

Daniel Schlör, Markus Ring, and Andreas Hotho. inalu: Improved neural arithmetic logic unit,

work page
[14]

URL https://arxiv.org/abs/2003.07629

work page arXiv 2003
[15]

In: Bouamor, H., Pino, J., Bali, K

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 10776–10787, Singa- pore, Dec...

work page doi:10.18653/v1/2023 2023
[16]

Investigating data contamination in modern benchmarks for large language models, 2024

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models, 2024. URL https: //arxiv.org/abs/2311.09783

work page arXiv 2024
[17]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024. URL https://arxiv.org/abs/2402.15938

work page arXiv 2024
[19]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Measuring mathematical problem solving with the math dataset,

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,

work page
[22]

URL https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

On the Measure of Intelligence

François Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/abs/ 1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023

BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id= uyTL5Bvosj

work page 2023
[26]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination- free llm benchmark, 2024. URL https://arxiv.org/abs/2406.19314

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019

work page 2019
[28]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2020

work page 2020
[29]

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization, 2020

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization, 2020

work page 2020
[30]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages, 2020

work page 2020
[31]

Tjong Kim Sang

Erik F. Tjong Kim Sang. Introduction to the conll-2002 shared task: Language-independent named entity recognition, 2002

work page 2002
[32]

David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Ala...

work page 2021
[33]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. Bbq: A hand-built bias benchmark for question answering, 2022. URL https://arxiv.org/abs/2110.08193

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Le, Ed H

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Chal- lenging big-bench tasks and whether chain-of-thought can solve them, 2022

work page 2022
[35]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task general- ization via natural language crowdsourcing instructions, 2022. URL https://arxiv.org/ abs/2104.08773

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

Adversarial nli: A new benchmark for natural language understanding, 2020

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding, 2020. URL https: //arxiv.org/abs/1910.14599

work page arXiv 2020
[38]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020. URL https://arxiv.org/abs/2009.13081

work page arXiv 2020
[39]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/2304.06364

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017. URL https: //arxiv.org/abs/1705.03551

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/ 1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019
[42]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/ 1907.10641

work page internal anchor Pith review Pith/arXiv arXiv 2019
[43]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018. URL https: //arxiv.org/abs/1809.02789

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Boolq: Exploring the surprising difficulty of natural yes/no questions,

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions,

work page
[45]

URL https://arxiv.org/abs/1905.10044

work page internal anchor Pith review Pith/arXiv arXiv 1905
[46]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019. URL https: //arxiv.org/abs/1811.00937

work page internal anchor Pith review Pith/arXiv arXiv 2019
[47]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958. 43

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https: //arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024. URL https://arxiv.org/ abs/2310.16049

work page arXiv 2024
[53]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Assoc...

work page doi:10.18653/v1/ 2021
[54]

arXiv preprint arXiv:2106.15772

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers, 2021. URL https://arxiv.org/abs/ 2106.15772

work page arXiv 2021
[55]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page 2019
[56]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URL https://arxiv.org/abs/ 2310.02255

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning, 2021

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning, 2021. URL https://arxiv.org/abs/2105.04165

work page arXiv 2021
[58]

Breaking language barriers in multilingual mathematical reasoning: Insights and observations, 2024

Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dongmei Zhang, and Jia Li. Breaking language barriers in multilingual mathematical reasoning: Insights and observations, 2024. URLhttps://arxiv.org/abs/2310.20246

work page arXiv 2024
[59]

Squad: 100,000+ questions for machine comprehension of text, 2016

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URL https://arxiv.org/abs/1606. 05250

work page 2016
[60]

Know What You Don't Know: Unanswerable Questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad, 2018. URL https://arxiv.org/abs/1806.03822

work page internal anchor Pith review Pith/arXiv arXiv 2018
[61]

The Penn Treebank: Annotating predicate argument structure

Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The Penn Treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at 44 Plainsboro, New Jersey, March 8-11, 1994 , 1994. URL https://aclanthology.org/ H94-1020

work page 1994
[62]

Siva Reddy, Danqi Chen, and Christopher D. Manning. Coqa: A conversational question answering challenge, 2019. URL https://arxiv.org/abs/1808.07042

work page internal anchor Pith review Pith/arXiv arXiv 2019
[63]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URL https: //arxiv.org/abs/1606.06031

work page internal anchor Pith review Pith/arXiv arXiv 2016
[64]

A corpus and evaluation framework for deeper understanding of commonsense stories, 2016

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of commonsense stories, 2016. URL https://arxiv.org/abs/1604. 01696

work page 2016
[65]

Levesque, Ernest Davis, and L

Hector J. Levesque, Ernest Davis, and L. Morgenstern. The winograd schema challenge. In International Conference on Principles of Knowledge Representation and Reasoning , 2012. URLhttps://api.semanticscholar.org/CorpusID:116068945

work page 2012
[66]

QuAC : Question Answering in Context

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac : Question answering in context, 2018. URL https://arxiv.org/ abs/1808.07036

work page internal anchor Pith review Pith/arXiv arXiv 2018
[67]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations, 2017. URL https://arxiv.org/abs/ 1704.04683

work page internal anchor Pith review Pith/arXiv arXiv 2017
[68]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019. URL https://arxiv.org/abs/1903.00161

work page internal anchor Pith review Pith/arXiv arXiv 2019
[69]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments, 2019. URL https://arxiv.org/abs/1805.12471

work page arXiv 2019
[70]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference, 2018. URL https://arxiv.org/abs/1704. 05426

work page 2018
[71]

Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) . Association for Computational Linguistics, 2017. doi: 10.18653/v1/s17-2001. URL http: //dx.d...

work page doi:10.18653/v1/s17-2001 2017
[72]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Lan...

work page 2013
[73]

Dolan and Chris Brockett

William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005) ,

work page
[74]

URL https://aclanthology.org/I05-5002

work page
[75]

URLhttps://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

work page
[76]

Natural language embedded programs for hybrid language symbolic reasoning, 2024

Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, and James Glass. Natural language embedded programs for hybrid language symbolic reasoning, 2024. URL https://arxiv.org/abs/2309.10814

work page arXiv 2024
[77]

Opencompass: A universal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023. 45

work page 2023
[78]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[79]

A framework for few-shot language model evaluation, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page arXiv 2024
[80]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.

[1] [1]

URL https://openai.com/index/hello-gpt-4o

Hello gpt-4o, . URL https://openai.com/index/hello-gpt-4o

work page

[2] [2]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

URL https://www.llama.com/

Build the future of ai with meta llama 3, 2024. URL https://www.llama.com/

work page 2024

[5] [5]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark (published at neurips 2024 track datasets and benchmarks), 202...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation, 2024

Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu, Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, and Yanghua Xiao. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation, 2024. URLhttps: //arxiv.org/abs/2306.05783

work page arXiv 2024

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv. org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023. URLhttps://arxiv.org/abs/2211.12588

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Neural Arithmetic Logic Units

Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units, 2018. URL https://arxiv.org/abs/1808.00508

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Neural GPUs Learn Algorithms

Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms, 2016. URLhttps://arxiv. org/abs/1511.08228

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Neural power units, 2020

Niklas Heim, Tomáš Pevný, and Václav Šmídl. Neural power units, 2020. URL https: //arxiv.org/abs/2006.01681

work page arXiv 2020

[13] [13]

inalu: Improved neural arithmetic logic unit,

Daniel Schlör, Markus Ring, and Andreas Hotho. inalu: Improved neural arithmetic logic unit,

work page

[14] [14]

URL https://arxiv.org/abs/2003.07629

work page arXiv 2003

[15] [15]

In: Bouamor, H., Pino, J., Bali, K

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 10776–10787, Singa- pore, Dec...

work page doi:10.18653/v1/2023 2023

[16] [16]

Investigating data contamination in modern benchmarks for large language models, 2024

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models, 2024. URL https: //arxiv.org/abs/2311.09783

work page arXiv 2024

[17] [17]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024. URL https://arxiv.org/abs/2402.15938

work page arXiv 2024

[19] [19]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Measuring mathematical problem solving with the math dataset,

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,

work page

[22] [22]

URL https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

On the Measure of Intelligence

François Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/abs/ 1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 2019

[25] [25]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023

BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id= uyTL5Bvosj

work page 2023

[26] [26]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination- free llm benchmark, 2024. URL https://arxiv.org/abs/2406.19314

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019

work page 2019

[28] [28]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2020

work page 2020

[29] [29]

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization, 2020

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization, 2020

work page 2020

[30] [30]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages, 2020

work page 2020

[31] [31]

Tjong Kim Sang

Erik F. Tjong Kim Sang. Introduction to the conll-2002 shared task: Language-independent named entity recognition, 2002

work page 2002

[32] [32]

David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Ala...

work page 2021

[33] [33]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. Bbq: A hand-built bias benchmark for question answering, 2022. URL https://arxiv.org/abs/2110.08193

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Le, Ed H

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Chal- lenging big-bench tasks and whether chain-of-thought can solve them, 2022

work page 2022

[35] [35]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task general- ization via natural language crowdsourcing instructions, 2022. URL https://arxiv.org/ abs/2104.08773

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019

[37] [37]

Adversarial nli: A new benchmark for natural language understanding, 2020

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding, 2020. URL https: //arxiv.org/abs/1910.14599

work page arXiv 2020

[38] [38]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020. URL https://arxiv.org/abs/2009.13081

work page arXiv 2020

[39] [39]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/2304.06364

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017. URL https: //arxiv.org/abs/1705.03551

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/ 1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019

[42] [42]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/ 1907.10641

work page internal anchor Pith review Pith/arXiv arXiv 2019

[43] [43]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018. URL https: //arxiv.org/abs/1809.02789

work page internal anchor Pith review Pith/arXiv arXiv 2018

[44] [44]

Boolq: Exploring the surprising difficulty of natural yes/no questions,

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions,

work page

[45] [45]

URL https://arxiv.org/abs/1905.10044

work page internal anchor Pith review Pith/arXiv arXiv 1905

[46] [46]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019. URL https: //arxiv.org/abs/1811.00937

work page internal anchor Pith review Pith/arXiv arXiv 2019

[47] [47]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958. 43

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [48]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[49] [49]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https: //arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024. URL https://arxiv.org/ abs/2310.16049

work page arXiv 2024

[53] [53]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Assoc...

work page doi:10.18653/v1/ 2021

[54] [54]

arXiv preprint arXiv:2106.15772

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers, 2021. URL https://arxiv.org/abs/ 2106.15772

work page arXiv 2021

[55] [55]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page 2019

[56] [56]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URL https://arxiv.org/abs/ 2310.02255

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning, 2021

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning, 2021. URL https://arxiv.org/abs/2105.04165

work page arXiv 2021

[58] [58]

Breaking language barriers in multilingual mathematical reasoning: Insights and observations, 2024

Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dongmei Zhang, and Jia Li. Breaking language barriers in multilingual mathematical reasoning: Insights and observations, 2024. URLhttps://arxiv.org/abs/2310.20246

work page arXiv 2024

[59] [59]

Squad: 100,000+ questions for machine comprehension of text, 2016

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URL https://arxiv.org/abs/1606. 05250

work page 2016

[60] [60]

Know What You Don't Know: Unanswerable Questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad, 2018. URL https://arxiv.org/abs/1806.03822

work page internal anchor Pith review Pith/arXiv arXiv 2018

[61] [61]

The Penn Treebank: Annotating predicate argument structure

Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The Penn Treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at 44 Plainsboro, New Jersey, March 8-11, 1994 , 1994. URL https://aclanthology.org/ H94-1020

work page 1994

[62] [62]

Siva Reddy, Danqi Chen, and Christopher D. Manning. Coqa: A conversational question answering challenge, 2019. URL https://arxiv.org/abs/1808.07042

work page internal anchor Pith review Pith/arXiv arXiv 2019

[63] [63]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URL https: //arxiv.org/abs/1606.06031

work page internal anchor Pith review Pith/arXiv arXiv 2016

[64] [64]

A corpus and evaluation framework for deeper understanding of commonsense stories, 2016

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of commonsense stories, 2016. URL https://arxiv.org/abs/1604. 01696

work page 2016

[65] [65]

Levesque, Ernest Davis, and L

Hector J. Levesque, Ernest Davis, and L. Morgenstern. The winograd schema challenge. In International Conference on Principles of Knowledge Representation and Reasoning , 2012. URLhttps://api.semanticscholar.org/CorpusID:116068945

work page 2012

[66] [66]

QuAC : Question Answering in Context

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac : Question answering in context, 2018. URL https://arxiv.org/ abs/1808.07036

work page internal anchor Pith review Pith/arXiv arXiv 2018

[67] [67]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations, 2017. URL https://arxiv.org/abs/ 1704.04683

work page internal anchor Pith review Pith/arXiv arXiv 2017

[68] [68]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019. URL https://arxiv.org/abs/1903.00161

work page internal anchor Pith review Pith/arXiv arXiv 2019

[69] [69]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments, 2019. URL https://arxiv.org/abs/1805.12471

work page arXiv 2019

[70] [70]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference, 2018. URL https://arxiv.org/abs/1704. 05426

work page 2018

[71] [71]

Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) . Association for Computational Linguistics, 2017. doi: 10.18653/v1/s17-2001. URL http: //dx.d...

work page doi:10.18653/v1/s17-2001 2017

[72] [72]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Lan...

work page 2013

[73] [73]

Dolan and Chris Brockett

William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005) ,

work page

[74] [74]

URL https://aclanthology.org/I05-5002

work page

[75] [75]

URLhttps://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

work page

[76] [76]

Natural language embedded programs for hybrid language symbolic reasoning, 2024

Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, and James Glass. Natural language embedded programs for hybrid language symbolic reasoning, 2024. URL https://arxiv.org/abs/2309.10814

work page arXiv 2024

[77] [77]

Opencompass: A universal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023. 45

work page 2023

[78] [78]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[79] [79]

A framework for few-shot language model evaluation, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page arXiv 2024

[80] [80]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

work page internal anchor Pith review Pith/arXiv arXiv 2023