pith. sign in

arxiv: 2504.16155 · v2 · submitted 2025-04-22 · 💻 cs.NE

PRIMETIME : Limits of LLMs in Temporal Primitives

Pith reviewed 2026-05-22 18:30 UTC · model grok-4.3

classification 💻 cs.NE
keywords temporal reasoningLLMssynthetic generatordatetime parsingarithmeticfine-tuningevent planning
0
0 comments X

The pith

Synthetic generator exposes unreliability of LLM datetime primitives but enables their learning through fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to show that LLMs have unreliable performance on the basic building blocks of temporal reasoning—specifically parsing datetimes and performing arithmetic on them—because existing tests mix these with other skills. The PRIMETIME generator creates pure examples for testing each primitive alone, uncovering wide variations in model accuracy. By reusing the generator to produce fine-tuning examples, the authors demonstrate that these primitives can be taught to small models, leading to strong results on more complex tasks such as planning events. A sympathetic reader would care because this provides a concrete method to both identify and correct specific LLM limitations in handling time without enormous computational costs. The broader implication is a reusable pattern for improving model capabilities on targeted skills.

Core claim

The authors claim that temporal primitives are individually unreliable in LLMs, with accuracy varying greatly depending on the model and prompt. Using their PRIMETIME synthetic generator to create isolated test cases, they show these issues clearly. The same generator then supplies training data that makes the primitives fully learnable, allowing small quantized LoRA transformers to attain frontier-level accuracy on composed Event Planning tasks.

What carries the argument

PRIMETIME, the synthetic generator that delivers non-conflated datetime exemplars for decompositional evaluation and fine-tuning of parsing and arithmetic primitives.

If this is right

  • Individual primitives show unreliable performance ranging from near-zero to perfect accuracy.
  • Primitives are fully learnable when trained on data from the generator.
  • Composed Event Planning reaches frontier-level accuracy with small quantized LoRA transformers.
  • The generator supports both diagnosis and production-ready improvements in one system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach of synthetic diagnosis paired with remediation may generalize to other domains where LLMs show inconsistent reasoning.
  • Practitioners could use similar generators to create custom training sets for domain-specific temporal tasks.
  • Future work might explore whether these improvements hold when models are tested on naturally occurring temporal data from the web or documents.

Load-bearing premise

The synthetic generator produces non-conflated, uncontaminated datetime exemplars that accurately isolate parsing and arithmetic without introducing artifacts that would not appear in real-world temporal data or tasks.

What would settle it

Fine-tuned models failing to outperform base models on a held-out set of real-world temporal reasoning examples not created by the generator would falsify the general learnability of the primitives.

Figures

Figures reproduced from arXiv: 2504.16155 by Edward Gaere, Florian Wangenheim.

Figure 1
Figure 1. Figure 1: The Translation ISO-8601 task requires the translation of a datetime from it’s natural representation to it’s ISO-8601 representation. ISO-8601 Representation with 20 days added 3299-02-17T00:43:15 ISO-8601 Representation 3299-01-28T00:43:15 Add 20 days [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Computation Add-20 task requires adding 20 days to a datetime provided in ISO-8601 representation, and producing a new ISO-8601 datetime with the result. The datasets for all tasks are synthetically generated. This allows for accurate labels with no noise, an arbitrary number of observations at low cost, and the datasets can be frequently regenerated to reduce contamination. A detailed description of t… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt, system prompt, output and evaluation for the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The use of the ISO-8601 format for the input sequences, and not the natural representation, specifically allows to measure the reasoning ability of the model, because no translation required: the components of an ISO-8601 datetime can be easily and unambiguously extracted. The DATETIME benchmark also includes Mixed tasks that require the same datetime arithmetic, but the input format is a natural represent… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt, system prompt, output and evaluation for the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Required Python steps to resolve the Add-250 task. A solution requires converting the date [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Underlying CPython function calls. Source available code on GitHub [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 1
Figure 1. Figure 1: Prompt, output and evaluation of the Year Translation task. Note that the integer comparison output and target sequences is integer-wise, i.e ’01’ is the same as ’1’ Prompt parameters: • max tokens : 50 • temperature : 0.0 38 [PITH_FULL_IMAGE:figures/full_fig_p038_1.png] view at source ↗
read the original abstract

This paper introduces PRIMETIME, a synthetic generator that supports both benchmarking and fine-tuning of two primitive operations underlying temporal reasoning in Large Language Models (LLMs): parsing and arithmetic on datetimes. Existing temporal benchmarks assume simplified canonical datetime forms, conflate arithmetic, composition, and world knowledge into a single aggregate score, and offer no direct path to remediation. The first contribution is methodological: the PRIMETIME synthetic generator delivers non-conflated, uncontaminated, and unlimited datetime exemplars that enable a decompositional evaluation strategy for each primitive in isolation. The generator is extensible to support complex datetime tasks and is publicly released, alongside generated benchmarks. The second contribution is diagnostic: under this evaluation strategy, the primitives themselves prove individually unreliable, with per-primitive accuracy ranging from near-zero to perfect across models and prompting conditions. The third contribution is constructive: the same generator used for diagnosis also produces new training exemplars for fine-tuning, and the resulting models show that the primitives are fully learnable and the composed Event Planning task reaches frontier-level accuracy using small quantized LoRA transformers. The broader takeaway is that a single synthetic generator can serve both diagnosis and production-ready deployment. This methodological pattern may apply beyond temporal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PRIMETIME, a synthetic generator for producing unlimited, non-conflated datetime exemplars that isolate two temporal primitives—parsing and arithmetic—without conflating them with composition or world knowledge. It uses the generator for a decompositional benchmark showing per-primitive LLM accuracies ranging from near-zero to perfect across models and prompts, then re-uses the generator to create fine-tuning data. Small quantized LoRA transformers trained on this data master the primitives and reach frontier-level accuracy on a composed Event Planning task. The generator and benchmarks are publicly released.

Significance. If the central claims hold, the work demonstrates that a single extensible synthetic generator can serve dual purposes of diagnosis and targeted remediation for temporal reasoning weaknesses in LLMs. The public release of the generator and benchmarks is a clear strength supporting reproducibility. The approach of isolating primitives and showing they are learnable via small-scale fine-tuning offers a practical alternative to aggregate benchmarks that mix multiple skills.

major comments (3)
  1. [§3] §3 (Generator Design): The assertion that PRIMETIME produces 'non-conflated, uncontaminated' exemplars that cleanly isolate parsing and arithmetic lacks a formal specification of the sampling procedure for year ranges, timezone coverage, format variations, and edge-case handling. Without explicit rules or validation against real-world distributions, it is unclear whether the reported near-zero to perfect accuracy ranges and subsequent LoRA fine-tuning success reflect genuine primitive acquisition or generator-specific statistical shortcuts.
  2. [§4.2] §4.2 (Diagnostic Results): The per-primitive accuracy claims (near-zero to perfect) are presented without error bars, number of evaluation runs, or statistical tests for the extremes. This makes it difficult to determine whether the unreliability diagnosis is robust or sensitive to prompt variations and model-specific behaviors.
  3. [§5] §5 (Fine-Tuning Experiments): The claim that the composed Event Planning task reaches 'frontier-level accuracy' using small quantized LoRA transformers is not accompanied by direct comparisons to strong baselines (e.g., larger models without primitive-specific fine-tuning or models trained on real temporal corpora). This weakens the constructive contribution that the primitives are 'fully learnable' in a general sense.
minor comments (2)
  1. [Abstract / §1] The abstract and introduction would benefit from explicit citations to prior temporal benchmarks (e.g., TimeQA, TempReason) to better situate the decompositional strategy.
  2. [Figures in §4] Figure captions for benchmark results should include the exact number of test exemplars per primitive and the prompting conditions used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment in turn below, clarifying our design choices and indicating revisions that will strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [§3] §3 (Generator Design): The assertion that PRIMETIME produces 'non-conflated, uncontaminated' exemplars that cleanly isolate parsing and arithmetic lacks a formal specification of the sampling procedure for year ranges, timezone coverage, format variations, and edge-case handling. Without explicit rules or validation against real-world distributions, it is unclear whether the reported near-zero to perfect accuracy ranges and subsequent LoRA fine-tuning success reflect genuine primitive acquisition or generator-specific statistical shortcuts.

    Authors: We agree that §3 would benefit from greater formalization. The generator isolates primitives by construction: parsing exemplars contain only a datetime string and a target representation with no arithmetic, while arithmetic exemplars provide two datetimes and an operation with no parsing ambiguity. In the revision we will add an explicit algorithmic description of the sampling procedure, including uniform sampling over year ranges 1900–2100, coverage of UTC plus common offsets, controlled format variations (ISO, natural-language, abbreviated), and explicit handling of edge cases (leap seconds, DST transitions, month boundaries). We will also report summary statistics comparing generated distributions to real-world datetime corpora to confirm absence of unintended shortcuts. These additions will make the isolation claim fully verifiable. revision: yes

  2. Referee: [§4.2] §4.2 (Diagnostic Results): The per-primitive accuracy claims (near-zero to perfect) are presented without error bars, number of evaluation runs, or statistical tests for the extremes. This makes it difficult to determine whether the unreliability diagnosis is robust or sensitive to prompt variations and model-specific behaviors.

    Authors: The referee correctly notes the absence of variability measures. In the revised manuscript we will report all per-primitive accuracies as means over five independent evaluation runs with different random seeds, include standard deviations as error bars, and apply paired statistical tests (McNemar or Wilcoxon) to confirm that the observed near-zero versus near-perfect differences across models and prompts are significant. This will directly address concerns about robustness to prompt and model variation. revision: yes

  3. Referee: [§5] §5 (Fine-Tuning Experiments): The claim that the composed Event Planning task reaches 'frontier-level accuracy' using small quantized LoRA transformers is not accompanied by direct comparisons to strong baselines (e.g., larger models without primitive-specific fine-tuning or models trained on real temporal corpora). This weakens the constructive contribution that the primitives are 'fully learnable' in a general sense.

    Authors: We maintain that the primary constructive result is that primitive-specific synthetic data enables small LoRA adapters to reach frontier-level performance on the composed task, which is a practical and resource-efficient finding. Nevertheless, we acknowledge the value of additional context. In revision we will add a baseline comparison against the same base models without primitive fine-tuning (showing the lift attributable to PRIMETIME data) and, where feasible, against publicly available models fine-tuned on general temporal corpora. These comparisons will be presented as supplementary evidence rather than exhaustive benchmarking, preserving the paper’s focus on the diagnostic-to-remediation pipeline. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation introduces an independent synthetic generator for datetime exemplars, applies it to produce diagnostic benchmarks that reveal per-primitive unreliability, and then uses separate exemplars from the same generator for fine-tuning to demonstrate learnability and improved Event Planning performance. This chain does not reduce any result to its inputs by construction: the diagnostic accuracies and post-fine-tuning gains are empirical measurements on held-out data rather than tautological re-statements of the generator's sampling rules or fitted parameters. No equations, self-citations, uniqueness theorems, or ansatzes are shown to create a load-bearing loop, and the generator functions as an external artifact whose outputs serve as falsifiable test cases. The overall methodology remains self-contained against the produced benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the generator being able to produce clean, representative examples; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Synthetic datetime examples can be generated without contamination or conflation of parsing, arithmetic, and world knowledge.
    This premise is required for the decompositional evaluation strategy to isolate the two primitives as claimed.

pith-pipeline@v0.9.0 · 5739 in / 1288 out tokens · 84610 ms · 2026-05-22T18:30:48.171280+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · 54 internal anchors

  1. [1]

    URL https://openai.com/index/hello-gpt-4o

    Hello gpt-4o, . URL https://openai.com/index/hello-gpt-4o

  2. [2]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

  3. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  4. [4]

    URL https://www.llama.com/

    Build the future of ai with meta llama 3, 2024. URL https://www.llama.com/

  5. [5]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark (published at neurips 2024 track datasets and benchmarks), 202...

  6. [6]

    Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation, 2024

    Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu, Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, and Yanghua Xiao. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation, 2024. URLhttps: //arxiv.org/abs/2306.05783

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv. org/abs/2110.14168

  8. [8]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

  9. [9]

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023. URLhttps://arxiv.org/abs/2211.12588

  10. [10]

    Neural Arithmetic Logic Units

    Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units, 2018. URL https://arxiv.org/abs/1808.00508

  11. [11]

    Neural GPUs Learn Algorithms

    Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms, 2016. URLhttps://arxiv. org/abs/1511.08228

  12. [12]

    Neural power units, 2020

    Niklas Heim, Tomáš Pevný, and Václav Šmídl. Neural power units, 2020. URL https: //arxiv.org/abs/2006.01681

  13. [13]

    inalu: Improved neural arithmetic logic unit,

    Daniel Schlör, Markus Ring, and Andreas Hotho. inalu: Improved neural arithmetic logic unit,

  14. [14]

    URL https://arxiv.org/abs/2003.07629

  15. [15]

    In: Bouamor, H., Pino, J., Bali, K

    Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 10776–10787, Singa- pore, Dec...

  16. [16]

    Investigating data contamination in modern benchmarks for large language models, 2024

    Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models, 2024. URL https: //arxiv.org/abs/2311.09783

  17. [17]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  18. [18]

    General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024

    Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024. URL https://arxiv.org/abs/2402.15938

  19. [19]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

  20. [20]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

  21. [21]

    Measuring mathematical problem solving with the math dataset,

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,

  22. [22]

    URL https://arxiv.org/abs/2103.03874

  23. [23]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/abs/1803.05457

  24. [24]

    On the Measure of Intelligence

    François Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/abs/ 1911.01547

  25. [25]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023

    BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id= uyTL5Bvosj

  26. [26]

    LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination- free llm benchmark, 2024. URL https://arxiv.org/abs/2406.19314

  27. [27]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019

  28. [28]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2020

  29. [29]

    Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization, 2020

    Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization, 2020

  30. [30]

    Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

    Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages, 2020

  31. [31]

    Tjong Kim Sang

    Erik F. Tjong Kim Sang. Introduction to the conll-2002 shared task: Language-independent named entity recognition, 2002

  32. [32]

    David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Ala...

  33. [33]

    Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. Bbq: A hand-built bias benchmark for question answering, 2022. URL https://arxiv.org/abs/2110.08193

  34. [34]

    Le, Ed H

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Chal- lenging big-bench tasks and whether chain-of-thought can solve them, 2022

  35. [35]

    Cross-Task Generalization via Natural Language Crowdsourcing Instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task general- ization via natural language crowdsourcing instructions, 2022. URL https://arxiv.org/ abs/2104.08773

  36. [36]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830

  37. [37]

    Adversarial nli: A new benchmark for natural language understanding, 2020

    Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding, 2020. URL https: //arxiv.org/abs/1910.14599

  38. [38]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020. URL https://arxiv.org/abs/2009.13081

  39. [39]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/2304.06364

  40. [40]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017. URL https: //arxiv.org/abs/1705.03551

  41. [41]

    PIQA: Reasoning about Physical Commonsense in Natural Language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/ 1911.11641

  42. [42]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/ 1907.10641

  43. [43]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018. URL https: //arxiv.org/abs/1809.02789

  44. [44]

    Boolq: Exploring the surprising difficulty of natural yes/no questions,

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions,

  45. [45]

    URL https://arxiv.org/abs/1905.10044

  46. [46]

    CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019. URL https: //arxiv.org/abs/1811.00937

  47. [47]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958. 43

  48. [48]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

  49. [49]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

  50. [50]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https: //arxiv.org/abs/2306.05685

  51. [51]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911

  52. [52]

    Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024

    Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024. URL https://arxiv.org/ abs/2310.16049

  53. [53]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Assoc...

  54. [54]

    arXiv preprint arXiv:2106.15772

    Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers, 2021. URL https://arxiv.org/abs/ 2106.15772

  55. [55]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

  56. [56]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URL https://arxiv.org/abs/ 2310.02255

  57. [57]

    Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning, 2021

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning, 2021. URL https://arxiv.org/abs/2105.04165

  58. [58]

    Breaking language barriers in multilingual mathematical reasoning: Insights and observations, 2024

    Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dongmei Zhang, and Jia Li. Breaking language barriers in multilingual mathematical reasoning: Insights and observations, 2024. URLhttps://arxiv.org/abs/2310.20246

  59. [59]

    Squad: 100,000+ questions for machine comprehension of text, 2016

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URL https://arxiv.org/abs/1606. 05250

  60. [60]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad, 2018. URL https://arxiv.org/abs/1806.03822

  61. [61]

    The Penn Treebank: Annotating predicate argument structure

    Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The Penn Treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at 44 Plainsboro, New Jersey, March 8-11, 1994 , 1994. URL https://aclanthology.org/ H94-1020

  62. [62]

    Siva Reddy, Danqi Chen, and Christopher D. Manning. Coqa: A conversational question answering challenge, 2019. URL https://arxiv.org/abs/1808.07042

  63. [63]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URL https: //arxiv.org/abs/1606.06031

  64. [64]

    A corpus and evaluation framework for deeper understanding of commonsense stories, 2016

    Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of commonsense stories, 2016. URL https://arxiv.org/abs/1604. 01696

  65. [65]

    Levesque, Ernest Davis, and L

    Hector J. Levesque, Ernest Davis, and L. Morgenstern. The winograd schema challenge. In International Conference on Principles of Knowledge Representation and Reasoning , 2012. URLhttps://api.semanticscholar.org/CorpusID:116068945

  66. [66]

    QuAC : Question Answering in Context

    Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac : Question answering in context, 2018. URL https://arxiv.org/ abs/1808.07036

  67. [67]

    RACE: Large-scale ReAding Comprehension Dataset From Examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations, 2017. URL https://arxiv.org/abs/ 1704.04683

  68. [68]

    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019. URL https://arxiv.org/abs/1903.00161

  69. [69]

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments, 2019. URL https://arxiv.org/abs/1805.12471

  70. [70]

    Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference, 2018. URL https://arxiv.org/abs/1704. 05426

  71. [71]

    Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) . Association for Computational Linguistics, 2017. doi: 10.18653/v1/s17-2001. URL http: //dx.d...

  72. [72]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Lan...

  73. [73]

    Dolan and Chris Brockett

    William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005) ,

  74. [74]

    URL https://aclanthology.org/I05-5002

  75. [75]

    URLhttps://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

  76. [76]

    Natural language embedded programs for hybrid language symbolic reasoning, 2024

    Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, and James Glass. Natural language embedded programs for hybrid language symbolic reasoning, 2024. URL https://arxiv.org/abs/2309.10814

  77. [77]

    Opencompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023. 45

  78. [78]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

  79. [79]

    A framework for few-shot language model evaluation, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

  80. [80]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

Showing first 80 references.