pith. sign in

arxiv: 2605.24703 · v1 · pith:6IP2J7LInew · submitted 2026-05-23 · 💻 cs.CL · cs.AI

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

Pith reviewed 2026-06-30 13:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords time-series question answeringanalytical skillsbenchmarklarge language modelstemporal reasoningTSQAskill evaluationagentic framework
0
0 comments X

The pith

TS-Skill benchmark isolates three temporal skills to show that cross-interval integration remains hardest for non-agent models in time-series QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TS-Skill as a benchmark that decomposes time-series question answering into three composable skills: temporal scale selection, temporal localization, and cross-interval integration. It constructs the benchmark at scale using the SKEvol agentic framework, which generates timestamp-aware questions, builds answers with metadata and code, and applies multi-phase verification plus human curation. Experiments across ten LLMs and TSLMs demonstrate substantial uneven gaps among the skills, with aggregate scores failing to reveal the specific weaknesses. A reader would care because this skill-level view can identify targeted failures in temporal signal handling that current task-based benchmarks miss.

Core claim

The paper claims that existing TSQA benchmarks organized by task types or high-level reasoning obscure the underlying signal-level capabilities, and that TS-Skill's controlled evaluation of SK1, SK2, and SK3 uncovers these failures. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. The benchmark supplies timestamp-aware questions across broad domains with human-validated quality, and the SKEvol framework enables its scalable construction through skill-guided generation and verification.

What carries the argument

The three composable analytical skills—temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3)—that the benchmark uses to isolate and measure distinct temporal reasoning capabilities in TSQA.

If this is right

  • Skill-level evaluation can uncover temporal reasoning failures that aggregate TSQA scores obscure.
  • SK3 cross-interval integration remains consistently challenging for non-agent models.
  • Tool-augmented agents obtain a selective advantage specifically on standalone SK3.
  • The SKEvol framework supports controlled, scalable construction of timestamp-aware questions with human-validated quality.
  • Broad domain coverage in the benchmark allows diagnosis of capabilities that generalize beyond single tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the skills prove composable, targeted training or fine-tuning on SK3 alone could raise overall TSQA performance without retraining on every task type.
  • The selective agent advantage on SK3 suggests that explicit tool access for interval comparison may be a practical route to stronger temporal integration.
  • The benchmark's isolation method could be applied to sequential data outside time series to test whether similar skill gaps appear in other domains.
  • Persistent SK3 difficulty for non-agent models implies that pure next-token prediction may lack built-in mechanisms for cross-interval signal synthesis.

Load-bearing premise

The three skills are the primary composable analytical capabilities required for TSQA and the SKEvol-generated questions validly isolate each skill without introducing unintended biases or correlations.

What would settle it

A replication in which all ten models show equal performance across SK1-SK3 or in which tool-augmented agents lose their selective advantage on standalone SK3 would falsify the reported uneven gaps.

Figures

Figures reproduced from arXiv: 2605.24703 by Dezhi Hong, Gaofeng Dong, Jason Wu, Kang Yang, Liying Han, Mani Srivastava, Mario Berges, Oliver Wang, Ozan Baris Mulayim, Pengrui Quan, Sizhe Ma, Yuyang Yuan.

Figure 1
Figure 1. Figure 1: Three Analytical Skills. SK1 selects the temporal resolution at which a pattern is visible. SK2 localizes the relevant interval. SK3 integrates evidence across separated temporal regions. Electricity Consumption Traffic: Finance: … Up to 25 domains Variable Pool Stock Prices … Exchange Rates Domain Context Variable: [2022-07-11 00:00, 2022-07-11 01:00, ..., 2022-07-17 23:00] Time-Series Seed Attribute-base… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the TS-Skill construction pipeline: domain-context-guided time-series seed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Agentic QA generation and verification workflow in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Human-rated acceptability of TS-Skill QA pairs. “Qwen” denotes Qwen3-8B + Qwen2.5-VL. Setting. We conduct human evaluation to validate the qual￾ity of automatically accepted TSQA pairs and, at the same time, compare the effect of different LLM/VLM settings for generation and verification. In addition to the released GPT-5.4 setting, we run SKEvol with two alternative set￾tings: Qwen3-8B + Qwen2.5-VL, and G… view at source ↗
Figure 5
Figure 5. Figure 5: Example Human Evaluation Interface. Inter-Rater Reliability of the Human Evaluation We collect ratings from 10 expert reviewers for the human dataset quality evaluation, assigning 6–10 reviewers to each item. This yields roughly 33 items per generator setting with redundant judgments rather than a single annotator’s opinion. For reliability analysis, we collapse the detailed responses into a binary accept/… view at source ↗
read the original abstract

Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces TS-Skill, a controlled benchmark for three composable analytical skills in time-series question answering (TSQA): temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). It describes SKEvol, a skill-guided agentic framework combining domain-aware seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase verification, and human curation. Experiments on ten state-of-the-art LLMs and TSLMs are reported to reveal substantial and uneven capability gaps across the skills, with SK3 remaining challenging for non-agent models while tool-augmented agents show selective advantage on standalone SK3. The work argues that skill-level evaluation can uncover temporal reasoning failures obscured by aggregate TSQA scores.

Significance. If the benchmark construction successfully isolates the three skills without systematic confounds or correlations, the work would provide a useful diagnostic lens for TSQA that goes beyond task-level or high-level reasoning categories, potentially informing targeted improvements in temporal grounding for both LLMs and TSLMs.

major comments (2)
  1. [SKEvol construction pipeline (abstract description)] The central claim of uneven capability gaps and selective agent advantage on SK3 depends on SKEvol validly isolating SK1-SK3 without unintended correlations or difficulty confounds. The abstract describes a complex multi-phase verification pipeline but provides no quantitative checks (e.g., cross-skill requirement rates, metadata balance across SK1/SK2/SK3, or inter-rater agreement on skill purity), leaving open the possibility that observed gaps are construction artifacts rather than evidence of distinct capabilities.
  2. [Experiments (abstract description)] The abstract states that experiments on ten models 'reveal substantial and uneven capability gaps' and a 'selective advantage' on SK3, yet supplies no quantitative results, dataset statistics, error analysis, or verification details. Without these, it is impossible to assess whether the data support the stated gaps or the composability assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract should include quantitative support for the construction validity and experimental claims. We have revised the abstract and added explicit validation details to the manuscript to address these points.

read point-by-point responses
  1. Referee: [SKEvol construction pipeline (abstract description)] The central claim of uneven capability gaps and selective agent advantage on SK3 depends on SKEvol validly isolating SK1-SK3 without unintended correlations or difficulty confounds. The abstract describes a complex multi-phase verification pipeline but provides no quantitative checks (e.g., cross-skill requirement rates, metadata balance across SK1/SK2/SK3, or inter-rater agreement on skill purity), leaving open the possibility that observed gaps are construction artifacts rather than evidence of distinct capabilities.

    Authors: We acknowledge that the original abstract omitted these quantitative validation metrics. The full manuscript (Section 3) describes the multi-phase verification pipeline and includes the requested checks: cross-skill requirement rates, metadata balance statistics across SK1/SK2/SK3, and inter-rater agreement on skill purity. To directly address the concern, we have revised the abstract to summarize these results, confirming low unintended correlations and balanced difficulty, thereby supporting that the observed gaps reflect distinct capabilities rather than construction artifacts. revision: yes

  2. Referee: [Experiments (abstract description)] The abstract states that experiments on ten models 'reveal substantial and uneven capability gaps' and a 'selective advantage' on SK3, yet supplies no quantitative results, dataset statistics, error analysis, or verification details. Without these, it is impossible to assess whether the data support the stated gaps or the composability assumption.

    Authors: We agree the abstract was insufficiently quantitative. The full manuscript (Section 5) reports the full experimental results on the ten models, including dataset statistics, per-skill performance breakdowns, error analysis, and verification details supporting the composability assumption. We have revised the abstract to incorporate key quantitative findings on the capability gaps and selective agent advantage on SK3, along with a brief summary of the supporting statistics and analysis. revision: yes

Circularity Check

0 steps flagged

No circularity detected; benchmark paper with no derivations

full rationale

The paper introduces TS-Skill and SKEvol as an empirical benchmark construction and evaluation framework. No equations, predictions, or first-principles derivations appear in the abstract or described content. The three skills are posited as composable but the construction pipeline is presented as a methodological choice rather than a self-referential definition or fitted input renamed as prediction. Central claims rest on experimental results across models, not on any reduction to inputs by construction. This is a standard benchmark paper whose claims are self-contained against external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.1-grok · 5825 in / 1109 out tokens · 24342 ms · 2026-06-30T13:09:04.920042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large- Scale Multitask Dataset, 2025

    Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, and Zhongyu Wei. ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large- Scale Multitask Dataset, 2025

  2. [2]

    ChatTS: Aligning Time Series with Llms via Synthetic Data for Enhanced Understanding and Reasoning

    Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. ChatTS: Aligning Time Series with Llms via Synthetic Data for Enhanced Understanding and Reasoning. arXiv preprint arXiv:2412.03104, 2024

  3. [3]

    ECG-QA: A Comprehen- sive Question Answering Dataset Combined with Electrocardiogram.Conference on Neural Information Processing Systems, 36:66277–66288, 2023

    Jungwoo Oh, Gyubok Lee, Seongsu Bae, Joon-myoung Kwon, and Edward Choi. ECG-QA: A Comprehen- sive Question Answering Dataset Combined with Electrocardiogram.Conference on Neural Information Processing Systems, 36:66277–66288, 2023

  4. [4]

    SensorQA: A Question Answering Benchmark for Daily-Life Monitoring, 2025

    Benjamin Reichman, Xiaofan Yu, Lanxiang Hu, Jack Truxal, Atishay Jain, Rushil Chandrupatla, Tajana Šimuni´c Rosing, and Larry Heck. SensorQA: A Question Answering Benchmark for Daily-Life Monitoring, 2025

  5. [5]

    TimeSerie- sExamAgent: Creating Time Series Reasoning Benchmarks at Scale, 2026

    Malgorzata Gwiazda, Yifu Cai, Mononito Goswami, Arjun Choudhry, and Artur Dubrawski. TimeSerie- sExamAgent: Creating Time Series Reasoning Benchmarks at Scale, 2026

  6. [6]

    Towards Time Series Reasoning with Llms.arXiv preprint arXiv:2409.11376, 2024

    Winnie Chow, Lauren Gardiner, Haraldur T Hallgrímsson, Maxwell A Xu, and Shirley You Ren. Towards Time Series Reasoning with Llms.arXiv preprint arXiv:2409.11376, 2024

  7. [7]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InConference on Empirical Methods in Natural Language Process- ing (EMNLP), 2016

  8. [8]

    Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 10

  9. [9]

    TGIF-QA: Toward Spatio- Temporal Reasoning in Visual Question Answering

    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: Toward Spatio- Temporal Reasoning in Visual Question Answering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  10. [10]

    AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

    Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  11. [11]

    Video Question Answering: Datasets, Algorithms and Challenges

    Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. Video Question Answering: Datasets, Algorithms and Challenges. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

  12. [12]

    Deep Learning for Time Series Anomaly Detection: A Survey.ACM Computing Surveys, 57(1):1–42, 2024

    Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu Aggarwal, and Mahsa Salehi. Deep Learning for Time Series Anomaly Detection: A Survey.ACM Computing Surveys, 57(1):1–42, 2024

  13. [13]

    Anomaly Detection in Time Series: A Comprehensive Evaluation.Proceedings of the VLDB Endowment, 15(9):1779–1797, 2022

    Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation.Proceedings of the VLDB Endowment, 15(9):1779–1797, 2022

  14. [14]

    Ts-Reasoner: Aligning Time Series Foundation Models with Llm Reasoning.arXiv preprint arXiv:2510.03519, 2025

    Fangxu Yu, Hongyu Zhao, and Tianyi Zhou. Ts-Reasoner: Aligning Time Series Foundation Models with Llm Reasoning.arXiv preprint arXiv:2510.03519, 2025

  15. [15]

    STL: A Seasonal-Trend Decomposition Procedure Based on Loess.Journal of Official Statistics, 6:3–73, 1990

    CLEVELAND RB. STL: A Seasonal-Trend Decomposition Procedure Based on Loess.Journal of Official Statistics, 6:3–73, 1990

  16. [16]

    Time-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment, 17(12):4229–4232, 2024

    Qinghua Liu, Paul Boniol, Themis Palpanas, and John Paparrizos. Time-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment, 17(12):4229–4232, 2024

  17. [17]

    A Survey of Methods for Time Series Change Point Detection.International Conference on Information and Knowledge Systems, 51(2):339–367, 2017

    Samaneh Aminikhanghahi and Diane J Cook. A Survey of Methods for Time Series Change Point Detection.International Conference on Information and Knowledge Systems, 51(2):339–367, 2017

  18. [18]

    Selective Review of Offline Change Point Detection Methods.IEEE Transactions on Signal Processing, 167:107299, 2020

    Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective Review of Offline Change Point Detection Methods.IEEE Transactions on Signal Processing, 167:107299, 2020

  19. [19]

    Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement, 2025

    Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement, 2025

  20. [20]

    Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021

    Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I Webb, Rob J Hyndman, and Pablo Montero-Manso. Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021

  21. [21]

    Toward Reasoning-Centric Time-Series Analysis, 2025

    Xinlei Wang, Mingtian Tan, Jing Qiu, Junhua Zhao, and Jinjin Gu. Toward Reasoning-Centric Time-Series Analysis, 2025

  22. [22]

    Xu, Harish Haresamudram, Catherine W

    Maxwell A. Xu, Harish Haresamudram, Catherine W. Liu, Patrick Langer, Jathurshan Pradeepkumar, Wanting Mao, Sunita J. Ferns, Aradhana Verma, Jimeng Sun, Paul Schmiedmayer, Xin Liu, Daniel McDuff, Emily B. Fox, and James M. Rehg. How Well Do Multimodal Models Reason on ECG Signals?, 2026

  23. [23]

    Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, and Matthew Reimherr

    Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin, Qi Zhu, Haoyang Fang, Danielle C. Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, and Matthew Reimherr. SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning, 2026

  24. [24]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research, 2023

  25. [25]

    MMTS-BENCH: A Comprehensive Benchmark for Time Series Under- standing and Reasoning.arXiv preprint arXiv:2602.08588, 2026

    Yao Yin, Zhenyu Xiao, Musheng Li, Yiwen Liu, Sutong Nan, Yiting He, Ruiqi Wang, Zhenwei Zhang, Qingmin Liao, and Yuantao Gu. MMTS-BENCH: A Comprehensive Benchmark for Time Series Under- standing and Reasoning.arXiv preprint arXiv:2602.08588, 2026

  26. [26]

    QuAnTS: Question Answering on Time Series.arXiv preprint arXiv:2511.05124, 2025

    Felix Divo, Maurice Kraus, Anh Q Nguyen, Hao Xue, Imran Razzak, Flora D Salim, Kristian Kersting, and Devendra Singh Dhami. QuAnTS: Question Answering on Time Series.arXiv preprint arXiv:2511.05124, 2025

  27. [27]

    Evaluating Large Language Models on Time Series Feature Understanding: A Comprehen- sive Taxonomy and Benchmark

    Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. Evaluating Large Language Models on Time Series Feature Understanding: A Comprehen- sive Taxonomy and Benchmark. InConference on Empirical Methods in Natural Language Process- ing (EMNLP), 2024. 11

  28. [28]

    WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions, 2025

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions, 2025

  29. [29]

    Best Practices for the Human Evaluation of Automatically Generated Text

    Chris Van Der Lee, Albert Gatt, Emiel Van Miltenburg, Sander Wubben, and Emiel Krahmer. Best Practices for the Human Evaluation of Automatically Generated Text. InInternational Conference on Natural Language Generation (INLG), 2019

  30. [30]

    Measuring Massive Multitask Language Understanding.International Conference on Learning Representations, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.International Conference on Learning Representations, 2021

  31. [31]

    TSAQA: Time Series Analysis Question And Answering Benchmark

    Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, et al. TSAQA: Time Series Analysis Question and Answering Benchmark. arXiv preprint arXiv:2601.23204, 2026

  32. [32]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Conference on Neural Information Processing Systems, 36:46595–46623, 2023

  33. [33]

    G-Eval: NLG Evaluation Using Gpt-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation Using Gpt-4 with Better Human Alignment. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  34. [34]

    MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering.arXiv preprint arXiv:2503.16858, 2025

    Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering.arXiv preprint arXiv:2503.16858, 2025

  35. [35]

    How to Do Human Evaluation: A Brief Introduction to User Studies in NLP.Natural Language Engineering, 29(5):1199–1222, 2023

    Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, and Ngoc Thang Vu. How to Do Human Evaluation: A Brief Introduction to User Studies in NLP.Natural Language Engineering, 29(5):1199–1222, 2023

  36. [36]

    High Agreement but Low Kappa: I

    Alvan R Feinstein and Domenic V Cicchetti. High Agreement but Low Kappa: I. The Problems of Two Paradoxes.Journal of clinical epidemiology, 43(6):543–549, 1990

  37. [37]

    High Agreement but Low Kappa: II

    Domenic V Cicchetti and Alvan R Feinstein. High Agreement but Low Kappa: II. Resolving the Paradoxes. Journal of clinical epidemiology, 43(6):551–558, 1990

  38. [38]

    Leveraging Large Language Models for Multiple Choice Question Answering

    Joshua Robinson and David Wingate. Leveraging Large Language Models for Multiple Choice Question Answering. InInternational Conference on Learning Representations, 2023

  39. [39]

    Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? InWorkshop on towards Knowledgeable Language Models (KnowLLM), 2024

    Nishant Balepur and Rachel Rudinger. Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? InWorkshop on towards Knowledgeable Language Models (KnowLLM), 2024

  40. [40]

    Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

    Nishant Balepur, Rachel Rudinger, and Jordan Lee Boyd-Graber. Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

  41. [41]

    Large Language Models Sensitivity to the Order of Op- tions in Multiple-Choice Questions

    Pouya Pezeshkpour and Estevam Hruschka. Large Language Models Sensitivity to the Order of Op- tions in Multiple-Choice Questions. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

  42. [42]

    Large Language Models Are Not Robust Multiple Choice Selectors.arXiv preprint arXiv:2309.03882, 2023

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large Language Models Are Not Robust Multiple Choice Selectors.arXiv preprint arXiv:2309.03882, 2023

  43. [43]

    Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

  44. [44]

    Data Cards: Purposeful and Transpar- ent Dataset Documentation for Responsible Ai

    Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and Transpar- ent Dataset Documentation for Responsible Ai. InACM Conference on Fairness, Accountability and Transparency (FAccT), 2022

  45. [45]

    In the time series [...], there is a general increasing trend. True or False?

    James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. Synthetic Data–What, Why and How?arXiv preprint arXiv:2205.03257, 2022. 12 Appendix Contents A Analytical Skills in Existing TSQA Benchmarks 14 BSKEvolPrompt Templates and Human-in-the-Loop Review Details 15 B.1 Prompt T...