PRIMETIME : Limits of LLMs in Temporal Primitives
Pith reviewed 2026-05-22 18:30 UTC · model grok-4.3
The pith
Synthetic generator exposes unreliability of LLM datetime primitives but enables their learning through fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that temporal primitives are individually unreliable in LLMs, with accuracy varying greatly depending on the model and prompt. Using their PRIMETIME synthetic generator to create isolated test cases, they show these issues clearly. The same generator then supplies training data that makes the primitives fully learnable, allowing small quantized LoRA transformers to attain frontier-level accuracy on composed Event Planning tasks.
What carries the argument
PRIMETIME, the synthetic generator that delivers non-conflated datetime exemplars for decompositional evaluation and fine-tuning of parsing and arithmetic primitives.
If this is right
- Individual primitives show unreliable performance ranging from near-zero to perfect accuracy.
- Primitives are fully learnable when trained on data from the generator.
- Composed Event Planning reaches frontier-level accuracy with small quantized LoRA transformers.
- The generator supports both diagnosis and production-ready improvements in one system.
Where Pith is reading between the lines
- The approach of synthetic diagnosis paired with remediation may generalize to other domains where LLMs show inconsistent reasoning.
- Practitioners could use similar generators to create custom training sets for domain-specific temporal tasks.
- Future work might explore whether these improvements hold when models are tested on naturally occurring temporal data from the web or documents.
Load-bearing premise
The synthetic generator produces non-conflated, uncontaminated datetime exemplars that accurately isolate parsing and arithmetic without introducing artifacts that would not appear in real-world temporal data or tasks.
What would settle it
Fine-tuned models failing to outperform base models on a held-out set of real-world temporal reasoning examples not created by the generator would falsify the general learnability of the primitives.
Figures
read the original abstract
This paper introduces PRIMETIME, a synthetic generator that supports both benchmarking and fine-tuning of two primitive operations underlying temporal reasoning in Large Language Models (LLMs): parsing and arithmetic on datetimes. Existing temporal benchmarks assume simplified canonical datetime forms, conflate arithmetic, composition, and world knowledge into a single aggregate score, and offer no direct path to remediation. The first contribution is methodological: the PRIMETIME synthetic generator delivers non-conflated, uncontaminated, and unlimited datetime exemplars that enable a decompositional evaluation strategy for each primitive in isolation. The generator is extensible to support complex datetime tasks and is publicly released, alongside generated benchmarks. The second contribution is diagnostic: under this evaluation strategy, the primitives themselves prove individually unreliable, with per-primitive accuracy ranging from near-zero to perfect across models and prompting conditions. The third contribution is constructive: the same generator used for diagnosis also produces new training exemplars for fine-tuning, and the resulting models show that the primitives are fully learnable and the composed Event Planning task reaches frontier-level accuracy using small quantized LoRA transformers. The broader takeaway is that a single synthetic generator can serve both diagnosis and production-ready deployment. This methodological pattern may apply beyond temporal reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PRIMETIME, a synthetic generator for producing unlimited, non-conflated datetime exemplars that isolate two temporal primitives—parsing and arithmetic—without conflating them with composition or world knowledge. It uses the generator for a decompositional benchmark showing per-primitive LLM accuracies ranging from near-zero to perfect across models and prompts, then re-uses the generator to create fine-tuning data. Small quantized LoRA transformers trained on this data master the primitives and reach frontier-level accuracy on a composed Event Planning task. The generator and benchmarks are publicly released.
Significance. If the central claims hold, the work demonstrates that a single extensible synthetic generator can serve dual purposes of diagnosis and targeted remediation for temporal reasoning weaknesses in LLMs. The public release of the generator and benchmarks is a clear strength supporting reproducibility. The approach of isolating primitives and showing they are learnable via small-scale fine-tuning offers a practical alternative to aggregate benchmarks that mix multiple skills.
major comments (3)
- [§3] §3 (Generator Design): The assertion that PRIMETIME produces 'non-conflated, uncontaminated' exemplars that cleanly isolate parsing and arithmetic lacks a formal specification of the sampling procedure for year ranges, timezone coverage, format variations, and edge-case handling. Without explicit rules or validation against real-world distributions, it is unclear whether the reported near-zero to perfect accuracy ranges and subsequent LoRA fine-tuning success reflect genuine primitive acquisition or generator-specific statistical shortcuts.
- [§4.2] §4.2 (Diagnostic Results): The per-primitive accuracy claims (near-zero to perfect) are presented without error bars, number of evaluation runs, or statistical tests for the extremes. This makes it difficult to determine whether the unreliability diagnosis is robust or sensitive to prompt variations and model-specific behaviors.
- [§5] §5 (Fine-Tuning Experiments): The claim that the composed Event Planning task reaches 'frontier-level accuracy' using small quantized LoRA transformers is not accompanied by direct comparisons to strong baselines (e.g., larger models without primitive-specific fine-tuning or models trained on real temporal corpora). This weakens the constructive contribution that the primitives are 'fully learnable' in a general sense.
minor comments (2)
- [Abstract / §1] The abstract and introduction would benefit from explicit citations to prior temporal benchmarks (e.g., TimeQA, TempReason) to better situate the decompositional strategy.
- [Figures in §4] Figure captions for benchmark results should include the exact number of test exemplars per primitive and the prompting conditions used.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. We address each major comment in turn below, clarifying our design choices and indicating revisions that will strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [§3] §3 (Generator Design): The assertion that PRIMETIME produces 'non-conflated, uncontaminated' exemplars that cleanly isolate parsing and arithmetic lacks a formal specification of the sampling procedure for year ranges, timezone coverage, format variations, and edge-case handling. Without explicit rules or validation against real-world distributions, it is unclear whether the reported near-zero to perfect accuracy ranges and subsequent LoRA fine-tuning success reflect genuine primitive acquisition or generator-specific statistical shortcuts.
Authors: We agree that §3 would benefit from greater formalization. The generator isolates primitives by construction: parsing exemplars contain only a datetime string and a target representation with no arithmetic, while arithmetic exemplars provide two datetimes and an operation with no parsing ambiguity. In the revision we will add an explicit algorithmic description of the sampling procedure, including uniform sampling over year ranges 1900–2100, coverage of UTC plus common offsets, controlled format variations (ISO, natural-language, abbreviated), and explicit handling of edge cases (leap seconds, DST transitions, month boundaries). We will also report summary statistics comparing generated distributions to real-world datetime corpora to confirm absence of unintended shortcuts. These additions will make the isolation claim fully verifiable. revision: yes
-
Referee: [§4.2] §4.2 (Diagnostic Results): The per-primitive accuracy claims (near-zero to perfect) are presented without error bars, number of evaluation runs, or statistical tests for the extremes. This makes it difficult to determine whether the unreliability diagnosis is robust or sensitive to prompt variations and model-specific behaviors.
Authors: The referee correctly notes the absence of variability measures. In the revised manuscript we will report all per-primitive accuracies as means over five independent evaluation runs with different random seeds, include standard deviations as error bars, and apply paired statistical tests (McNemar or Wilcoxon) to confirm that the observed near-zero versus near-perfect differences across models and prompts are significant. This will directly address concerns about robustness to prompt and model variation. revision: yes
-
Referee: [§5] §5 (Fine-Tuning Experiments): The claim that the composed Event Planning task reaches 'frontier-level accuracy' using small quantized LoRA transformers is not accompanied by direct comparisons to strong baselines (e.g., larger models without primitive-specific fine-tuning or models trained on real temporal corpora). This weakens the constructive contribution that the primitives are 'fully learnable' in a general sense.
Authors: We maintain that the primary constructive result is that primitive-specific synthetic data enables small LoRA adapters to reach frontier-level performance on the composed task, which is a practical and resource-efficient finding. Nevertheless, we acknowledge the value of additional context. In revision we will add a baseline comparison against the same base models without primitive fine-tuning (showing the lift attributable to PRIMETIME data) and, where feasible, against publicly available models fine-tuned on general temporal corpora. These comparisons will be presented as supplementary evidence rather than exhaustive benchmarking, preserving the paper’s focus on the diagnostic-to-remediation pipeline. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper's derivation introduces an independent synthetic generator for datetime exemplars, applies it to produce diagnostic benchmarks that reveal per-primitive unreliability, and then uses separate exemplars from the same generator for fine-tuning to demonstrate learnability and improved Event Planning performance. This chain does not reduce any result to its inputs by construction: the diagnostic accuracies and post-fine-tuning gains are empirical measurements on held-out data rather than tautological re-statements of the generator's sampling rules or fitted parameters. No equations, self-citations, uniqueness theorems, or ansatzes are shown to create a load-bearing loop, and the generator functions as an external artifact whose outputs serve as falsifiable test cases. The overall methodology remains self-contained against the produced benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic datetime examples can be generated without contamination or conflation of parsing, arithmetic, and world knowledge.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat ≃ Nat (recovery theorem) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PRIMETIME synthetic generator delivers non-conflated, uncontaminated datetime exemplars... primitives themselves prove individually unreliable... fine-tuning... Event Planning task reaches frontier-level accuracy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URL https://openai.com/index/hello-gpt-4o
Hello gpt-4o, . URL https://openai.com/index/hello-gpt-4o
-
[2]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Build the future of ai with meta llama 3, 2024. URL https://www.llama.com/
work page 2024
-
[5]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark (published at neurips 2024 track datasets and benchmarks), 202...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation, 2024
Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu, Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, and Yanghua Xiao. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation, 2024. URLhttps: //arxiv.org/abs/2306.05783
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv. org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023. URLhttps://arxiv.org/abs/2211.12588
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units, 2018. URL https://arxiv.org/abs/1808.00508
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms, 2016. URLhttps://arxiv. org/abs/1511.08228
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[12]
Niklas Heim, Tomáš Pevný, and Václav Šmídl. Neural power units, 2020. URL https: //arxiv.org/abs/2006.01681
-
[13]
inalu: Improved neural arithmetic logic unit,
Daniel Schlör, Markus Ring, and Andreas Hotho. inalu: Improved neural arithmetic logic unit,
- [14]
-
[15]
In: Bouamor, H., Pino, J., Bali, K
Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 10776–10787, Singa- pore, Dec...
-
[16]
Investigating data contamination in modern benchmarks for large language models, 2024
Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models, 2024. URL https: //arxiv.org/abs/2311.09783
-
[17]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024. URL https://arxiv.org/abs/2402.15938
-
[19]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Measuring mathematical problem solving with the math dataset,
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,
-
[22]
URL https://arxiv.org/abs/2103.03874
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
On the Measure of Intelligence
François Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/abs/ 1911.01547
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[25]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023
BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id= uyTL5Bvosj
work page 2023
-
[26]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination- free llm benchmark, 2024. URL https://arxiv.org/abs/2406.19314
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019
work page 2019
-
[28]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2020
work page 2020
-
[29]
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization, 2020
work page 2020
-
[30]
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages, 2020
work page 2020
-
[31]
Erik F. Tjong Kim Sang. Introduction to the conll-2002 shared task: Language-independent named entity recognition, 2002
work page 2002
-
[32]
David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Ala...
work page 2021
-
[33]
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. Bbq: A hand-built bias benchmark for question answering, 2022. URL https://arxiv.org/abs/2110.08193
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [34]
-
[35]
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task general- ization via natural language crowdsourcing instructions, 2022. URL https://arxiv.org/ abs/2104.08773
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[37]
Adversarial nli: A new benchmark for natural language understanding, 2020
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding, 2020. URL https: //arxiv.org/abs/1910.14599
-
[38]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020. URL https://arxiv.org/abs/2009.13081
-
[39]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/2304.06364
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017. URL https: //arxiv.org/abs/1705.03551
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/ 1911.11641
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[42]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/ 1907.10641
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[43]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018. URL https: //arxiv.org/abs/1809.02789
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
Boolq: Exploring the surprising difficulty of natural yes/no questions,
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions,
-
[45]
URL https://arxiv.org/abs/1905.10044
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[46]
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019. URL https: //arxiv.org/abs/1811.00937
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[47]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958. 43
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https: //arxiv.org/abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024
Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024. URL https://arxiv.org/ abs/2310.16049
-
[53]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Assoc...
-
[54]
arXiv preprint arXiv:2106.15772
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers, 2021. URL https://arxiv.org/abs/ 2106.15772
-
[55]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...
work page 2019
-
[56]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URL https://arxiv.org/abs/ 2310.02255
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning, 2021
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning, 2021. URL https://arxiv.org/abs/2105.04165
-
[58]
Breaking language barriers in multilingual mathematical reasoning: Insights and observations, 2024
Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dongmei Zhang, and Jia Li. Breaking language barriers in multilingual mathematical reasoning: Insights and observations, 2024. URLhttps://arxiv.org/abs/2310.20246
-
[59]
Squad: 100,000+ questions for machine comprehension of text, 2016
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URL https://arxiv.org/abs/1606. 05250
work page 2016
-
[60]
Know What You Don't Know: Unanswerable Questions for SQuAD
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad, 2018. URL https://arxiv.org/abs/1806.03822
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[61]
The Penn Treebank: Annotating predicate argument structure
Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The Penn Treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at 44 Plainsboro, New Jersey, March 8-11, 1994 , 1994. URL https://aclanthology.org/ H94-1020
work page 1994
-
[62]
Siva Reddy, Danqi Chen, and Christopher D. Manning. Coqa: A conversational question answering challenge, 2019. URL https://arxiv.org/abs/1808.07042
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[63]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URL https: //arxiv.org/abs/1606.06031
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[64]
A corpus and evaluation framework for deeper understanding of commonsense stories, 2016
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of commonsense stories, 2016. URL https://arxiv.org/abs/1604. 01696
work page 2016
-
[65]
Hector J. Levesque, Ernest Davis, and L. Morgenstern. The winograd schema challenge. In International Conference on Principles of Knowledge Representation and Reasoning , 2012. URLhttps://api.semanticscholar.org/CorpusID:116068945
work page 2012
-
[66]
QuAC : Question Answering in Context
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac : Question answering in context, 2018. URL https://arxiv.org/ abs/1808.07036
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[67]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations, 2017. URL https://arxiv.org/abs/ 1704.04683
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[68]
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019. URL https://arxiv.org/abs/1903.00161
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [69]
-
[70]
Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference, 2018. URL https://arxiv.org/abs/1704. 05426
work page 2018
-
[71]
Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) . Association for Computational Linguistics, 2017. doi: 10.18653/v1/s17-2001. URL http: //dx.d...
-
[72]
Manning, Andrew Ng, and Christopher Potts
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Lan...
work page 2013
-
[73]
William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005) ,
-
[74]
URL https://aclanthology.org/I05-5002
-
[75]
URLhttps://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
-
[76]
Natural language embedded programs for hybrid language symbolic reasoning, 2024
Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, and James Glass. Natural language embedded programs for hybrid language symbolic reasoning, 2024. URL https://arxiv.org/abs/2309.10814
-
[77]
Opencompass: A universal evaluation platform for foundation models
OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023. 45
work page 2023
-
[78]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[79]
A framework for few-shot language model evaluation, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...
-
[80]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.