TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

Dezhi Hong; Gaofeng Dong; Jason Wu; Kang Yang; Liying Han; Mani Srivastava; Mario Berges; Oliver Wang; Ozan Baris Mulayim; Pengrui Quan

arxiv: 2605.24703 · v1 · pith:6IP2J7LInew · submitted 2026-05-23 · 💻 cs.CL · cs.AI

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

Liying Han , Kang Yang , Oliver Wang , Jason Wu , Pengrui Quan , Gaofeng Dong , Ozan Baris Mulayim , Sizhe Ma

show 4 more authors

Yuyang Yuan Dezhi Hong Mario Berges Mani Srivastava

This is my paper

Pith reviewed 2026-06-30 13:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords time-series question answeringanalytical skillsbenchmarklarge language modelstemporal reasoningTSQAskill evaluationagentic framework

0 comments

The pith

TS-Skill benchmark isolates three temporal skills to show that cross-interval integration remains hardest for non-agent models in time-series QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TS-Skill as a benchmark that decomposes time-series question answering into three composable skills: temporal scale selection, temporal localization, and cross-interval integration. It constructs the benchmark at scale using the SKEvol agentic framework, which generates timestamp-aware questions, builds answers with metadata and code, and applies multi-phase verification plus human curation. Experiments across ten LLMs and TSLMs demonstrate substantial uneven gaps among the skills, with aggregate scores failing to reveal the specific weaknesses. A reader would care because this skill-level view can identify targeted failures in temporal signal handling that current task-based benchmarks miss.

Core claim

The paper claims that existing TSQA benchmarks organized by task types or high-level reasoning obscure the underlying signal-level capabilities, and that TS-Skill's controlled evaluation of SK1, SK2, and SK3 uncovers these failures. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. The benchmark supplies timestamp-aware questions across broad domains with human-validated quality, and the SKEvol framework enables its scalable construction through skill-guided generation and verification.

What carries the argument

The three composable analytical skills—temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3)—that the benchmark uses to isolate and measure distinct temporal reasoning capabilities in TSQA.

If this is right

Skill-level evaluation can uncover temporal reasoning failures that aggregate TSQA scores obscure.
SK3 cross-interval integration remains consistently challenging for non-agent models.
Tool-augmented agents obtain a selective advantage specifically on standalone SK3.
The SKEvol framework supports controlled, scalable construction of timestamp-aware questions with human-validated quality.
Broad domain coverage in the benchmark allows diagnosis of capabilities that generalize beyond single tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the skills prove composable, targeted training or fine-tuning on SK3 alone could raise overall TSQA performance without retraining on every task type.
The selective agent advantage on SK3 suggests that explicit tool access for interval comparison may be a practical route to stronger temporal integration.
The benchmark's isolation method could be applied to sequential data outside time series to test whether similar skill gaps appear in other domains.
Persistent SK3 difficulty for non-agent models implies that pure next-token prediction may lack built-in mechanisms for cross-interval signal synthesis.

Load-bearing premise

The three skills are the primary composable analytical capabilities required for TSQA and the SKEvol-generated questions validly isolate each skill without introducing unintended biases or correlations.

What would settle it

A replication in which all ten models show equal performance across SK1-SK3 or in which tool-augmented agents lose their selective advantage on standalone SK3 would falsify the reported uneven gaps.

Figures

Figures reproduced from arXiv: 2605.24703 by Dezhi Hong, Gaofeng Dong, Jason Wu, Kang Yang, Liying Han, Mani Srivastava, Mario Berges, Oliver Wang, Ozan Baris Mulayim, Pengrui Quan, Sizhe Ma, Yuyang Yuan.

**Figure 1.** Figure 1: Three Analytical Skills. SK1 selects the temporal resolution at which a pattern is visible. SK2 localizes the relevant interval. SK3 integrates evidence across separated temporal regions. Electricity Consumption Traffic: Finance: … Up to 25 domains Variable Pool Stock Prices … Exchange Rates Domain Context Variable: [2022-07-11 00:00, 2022-07-11 01:00, ..., 2022-07-17 23:00] Time-Series Seed Attribute-base… view at source ↗

**Figure 2.** Figure 2: Overview of the TS-Skill construction pipeline: domain-context-guided time-series seed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Agentic QA generation and verification workflow in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Human-rated acceptability of TS-Skill QA pairs. “Qwen” denotes Qwen3-8B + Qwen2.5-VL. Setting. We conduct human evaluation to validate the quality of automatically accepted TSQA pairs and, at the same time, compare the effect of different LLM/VLM settings for generation and verification. In addition to the released GPT-5.4 setting, we run SKEvol with two alternative settings: Qwen3-8B + Qwen2.5-VL, and G… view at source ↗

**Figure 5.** Figure 5: Example Human Evaluation Interface. Inter-Rater Reliability of the Human Evaluation We collect ratings from 10 expert reviewers for the human dataset quality evaluation, assigning 6–10 reviewers to each item. This yields roughly 33 items per generator setting with redundant judgments rather than a single annotator’s opinion. For reliability analysis, we collapse the detailed responses into a binary accept/… view at source ↗

read the original abstract

Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TS-Skill gives a three-skill breakdown for time-series QA and an agentic build pipeline, but the claim that the skills are cleanly isolated rests on unshown verification details.

read the letter

The paper's main contribution is a benchmark that decomposes time-series question answering into three signal-level skills—temporal scale selection, localization, and cross-interval integration—plus the SKEvol pipeline to generate questions targeting each one at scale. That decomposition moves past the usual task-type or high-level reasoning categories in existing TSQA work, and the agentic construction with domain seeds, metadata checks, and human curation is a concrete engineering step.

What the paper does well is lay out a controlled way to probe whether models handle different temporal patterns separately. The reported pattern that SK3 stays hard for plain models while tool agents gain selectively on it is the kind of diagnostic result that could guide follow-up work on agent use in temporal tasks.

The soft spot is the lack of visible evidence that the generated questions actually isolate the three skills. The abstract mentions multi-phase signal-grounded verification and human curation, but without numbers on cross-skill overlap, metadata balance, or inter-rater checks on skill purity, it is hard to rule out that SK3 items simply carry extra complexity or hidden SK1/SK2 demands. If those confounds exist, the uneven gaps become harder to interpret as distinct capability differences.

This is useful for researchers building or evaluating time-series language models who already care about fine-grained temporal reasoning. A reader looking for new benchmark construction methods will find the pipeline worth examining even if the isolation claim needs more data. The work is coherent enough on its own terms to merit referee time rather than a desk reject.

Referee Report

2 major / 0 minor

Summary. The paper introduces TS-Skill, a controlled benchmark for three composable analytical skills in time-series question answering (TSQA): temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). It describes SKEvol, a skill-guided agentic framework combining domain-aware seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase verification, and human curation. Experiments on ten state-of-the-art LLMs and TSLMs are reported to reveal substantial and uneven capability gaps across the skills, with SK3 remaining challenging for non-agent models while tool-augmented agents show selective advantage on standalone SK3. The work argues that skill-level evaluation can uncover temporal reasoning failures obscured by aggregate TSQA scores.

Significance. If the benchmark construction successfully isolates the three skills without systematic confounds or correlations, the work would provide a useful diagnostic lens for TSQA that goes beyond task-level or high-level reasoning categories, potentially informing targeted improvements in temporal grounding for both LLMs and TSLMs.

major comments (2)

[SKEvol construction pipeline (abstract description)] The central claim of uneven capability gaps and selective agent advantage on SK3 depends on SKEvol validly isolating SK1-SK3 without unintended correlations or difficulty confounds. The abstract describes a complex multi-phase verification pipeline but provides no quantitative checks (e.g., cross-skill requirement rates, metadata balance across SK1/SK2/SK3, or inter-rater agreement on skill purity), leaving open the possibility that observed gaps are construction artifacts rather than evidence of distinct capabilities.
[Experiments (abstract description)] The abstract states that experiments on ten models 'reveal substantial and uneven capability gaps' and a 'selective advantage' on SK3, yet supplies no quantitative results, dataset statistics, error analysis, or verification details. Without these, it is impossible to assess whether the data support the stated gaps or the composability assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract should include quantitative support for the construction validity and experimental claims. We have revised the abstract and added explicit validation details to the manuscript to address these points.

read point-by-point responses

Referee: [SKEvol construction pipeline (abstract description)] The central claim of uneven capability gaps and selective agent advantage on SK3 depends on SKEvol validly isolating SK1-SK3 without unintended correlations or difficulty confounds. The abstract describes a complex multi-phase verification pipeline but provides no quantitative checks (e.g., cross-skill requirement rates, metadata balance across SK1/SK2/SK3, or inter-rater agreement on skill purity), leaving open the possibility that observed gaps are construction artifacts rather than evidence of distinct capabilities.

Authors: We acknowledge that the original abstract omitted these quantitative validation metrics. The full manuscript (Section 3) describes the multi-phase verification pipeline and includes the requested checks: cross-skill requirement rates, metadata balance statistics across SK1/SK2/SK3, and inter-rater agreement on skill purity. To directly address the concern, we have revised the abstract to summarize these results, confirming low unintended correlations and balanced difficulty, thereby supporting that the observed gaps reflect distinct capabilities rather than construction artifacts. revision: yes
Referee: [Experiments (abstract description)] The abstract states that experiments on ten models 'reveal substantial and uneven capability gaps' and a 'selective advantage' on SK3, yet supplies no quantitative results, dataset statistics, error analysis, or verification details. Without these, it is impossible to assess whether the data support the stated gaps or the composability assumption.

Authors: We agree the abstract was insufficiently quantitative. The full manuscript (Section 5) reports the full experimental results on the ten models, including dataset statistics, per-skill performance breakdowns, error analysis, and verification details supporting the composability assumption. We have revised the abstract to incorporate key quantitative findings on the capability gaps and selective agent advantage on SK3, along with a brief summary of the supporting statistics and analysis. revision: yes

Circularity Check

0 steps flagged

No circularity detected; benchmark paper with no derivations

full rationale

The paper introduces TS-Skill and SKEvol as an empirical benchmark construction and evaluation framework. No equations, predictions, or first-principles derivations appear in the abstract or described content. The three skills are posited as composable but the construction pipeline is presented as a methodological choice rather than a self-referential definition or fitted input renamed as prediction. Central claims rest on experimental results across models, not on any reduction to inputs by construction. This is a standard benchmark paper whose claims are self-contained against external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.1-grok · 5825 in / 1109 out tokens · 24342 ms · 2026-06-30T13:09:04.920042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 10 canonical work pages · 1 internal anchor

[1]

ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large- Scale Multitask Dataset, 2025

Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, and Zhongyu Wei. ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large- Scale Multitask Dataset, 2025

2025
[2]

ChatTS: Aligning Time Series with Llms via Synthetic Data for Enhanced Understanding and Reasoning

Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. ChatTS: Aligning Time Series with Llms via Synthetic Data for Enhanced Understanding and Reasoning. arXiv preprint arXiv:2412.03104, 2024

work page arXiv 2024
[3]

ECG-QA: A Comprehen- sive Question Answering Dataset Combined with Electrocardiogram.Conference on Neural Information Processing Systems, 36:66277–66288, 2023

Jungwoo Oh, Gyubok Lee, Seongsu Bae, Joon-myoung Kwon, and Edward Choi. ECG-QA: A Comprehen- sive Question Answering Dataset Combined with Electrocardiogram.Conference on Neural Information Processing Systems, 36:66277–66288, 2023

2023
[4]

SensorQA: A Question Answering Benchmark for Daily-Life Monitoring, 2025

Benjamin Reichman, Xiaofan Yu, Lanxiang Hu, Jack Truxal, Atishay Jain, Rushil Chandrupatla, Tajana Šimuni´c Rosing, and Larry Heck. SensorQA: A Question Answering Benchmark for Daily-Life Monitoring, 2025

2025
[5]

TimeSerie- sExamAgent: Creating Time Series Reasoning Benchmarks at Scale, 2026

Malgorzata Gwiazda, Yifu Cai, Mononito Goswami, Arjun Choudhry, and Artur Dubrawski. TimeSerie- sExamAgent: Creating Time Series Reasoning Benchmarks at Scale, 2026

2026
[6]

Towards Time Series Reasoning with Llms.arXiv preprint arXiv:2409.11376, 2024

Winnie Chow, Lauren Gardiner, Haraldur T Hallgrímsson, Maxwell A Xu, and Shirley You Ren. Towards Time Series Reasoning with Llms.arXiv preprint arXiv:2409.11376, 2024

work page arXiv 2024
[7]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InConference on Empirical Methods in Natural Language Process- ing (EMNLP), 2016

2016
[8]

Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 10

2019
[9]

TGIF-QA: Toward Spatio- Temporal Reasoning in Visual Question Answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: Toward Spatio- Temporal Reasoning in Visual Question Answering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017
[10]

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[11]

Video Question Answering: Datasets, Algorithms and Challenges

Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. Video Question Answering: Datasets, Algorithms and Challenges. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

2022
[12]

Deep Learning for Time Series Anomaly Detection: A Survey.ACM Computing Surveys, 57(1):1–42, 2024

Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu Aggarwal, and Mahsa Salehi. Deep Learning for Time Series Anomaly Detection: A Survey.ACM Computing Surveys, 57(1):1–42, 2024

2024
[13]

Anomaly Detection in Time Series: A Comprehensive Evaluation.Proceedings of the VLDB Endowment, 15(9):1779–1797, 2022

Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation.Proceedings of the VLDB Endowment, 15(9):1779–1797, 2022

2022
[14]

Ts-Reasoner: Aligning Time Series Foundation Models with Llm Reasoning.arXiv preprint arXiv:2510.03519, 2025

Fangxu Yu, Hongyu Zhao, and Tianyi Zhou. Ts-Reasoner: Aligning Time Series Foundation Models with Llm Reasoning.arXiv preprint arXiv:2510.03519, 2025

work page arXiv 2025
[15]

STL: A Seasonal-Trend Decomposition Procedure Based on Loess.Journal of Official Statistics, 6:3–73, 1990

CLEVELAND RB. STL: A Seasonal-Trend Decomposition Procedure Based on Loess.Journal of Official Statistics, 6:3–73, 1990

1990
[16]

Time-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment, 17(12):4229–4232, 2024

Qinghua Liu, Paul Boniol, Themis Palpanas, and John Paparrizos. Time-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment, 17(12):4229–4232, 2024

2024
[17]

A Survey of Methods for Time Series Change Point Detection.International Conference on Information and Knowledge Systems, 51(2):339–367, 2017

Samaneh Aminikhanghahi and Diane J Cook. A Survey of Methods for Time Series Change Point Detection.International Conference on Information and Knowledge Systems, 51(2):339–367, 2017

2017
[18]

Selective Review of Offline Change Point Detection Methods.IEEE Transactions on Signal Processing, 167:107299, 2020

Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective Review of Offline Change Point Detection Methods.IEEE Transactions on Signal Processing, 167:107299, 2020

2020
[19]

Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement, 2025

Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement, 2025

2025
[20]

Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021

Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I Webb, Rob J Hyndman, and Pablo Montero-Manso. Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021

work page arXiv 2021
[21]

Toward Reasoning-Centric Time-Series Analysis, 2025

Xinlei Wang, Mingtian Tan, Jing Qiu, Junhua Zhao, and Jinjin Gu. Toward Reasoning-Centric Time-Series Analysis, 2025

2025
[22]

Xu, Harish Haresamudram, Catherine W

Maxwell A. Xu, Harish Haresamudram, Catherine W. Liu, Patrick Langer, Jathurshan Pradeepkumar, Wanting Mao, Sunita J. Ferns, Aradhana Verma, Jimeng Sun, Paul Schmiedmayer, Xin Liu, Daniel McDuff, Emily B. Fox, and James M. Rehg. How Well Do Multimodal Models Reason on ECG Signals?, 2026

2026
[23]

Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, and Matthew Reimherr

Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin, Qi Zhu, Haoyang Fang, Danielle C. Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, and Matthew Reimherr. SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning, 2026

2026
[24]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research, 2023

2023
[25]

MMTS-BENCH: A Comprehensive Benchmark for Time Series Under- standing and Reasoning.arXiv preprint arXiv:2602.08588, 2026

Yao Yin, Zhenyu Xiao, Musheng Li, Yiwen Liu, Sutong Nan, Yiting He, Ruiqi Wang, Zhenwei Zhang, Qingmin Liao, and Yuantao Gu. MMTS-BENCH: A Comprehensive Benchmark for Time Series Under- standing and Reasoning.arXiv preprint arXiv:2602.08588, 2026

work page arXiv 2026
[26]

QuAnTS: Question Answering on Time Series.arXiv preprint arXiv:2511.05124, 2025

Felix Divo, Maurice Kraus, Anh Q Nguyen, Hao Xue, Imran Razzak, Flora D Salim, Kristian Kersting, and Devendra Singh Dhami. QuAnTS: Question Answering on Time Series.arXiv preprint arXiv:2511.05124, 2025

work page arXiv 2025
[27]

Evaluating Large Language Models on Time Series Feature Understanding: A Comprehen- sive Taxonomy and Benchmark

Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. Evaluating Large Language Models on Time Series Feature Understanding: A Comprehen- sive Taxonomy and Benchmark. InConference on Empirical Methods in Natural Language Process- ing (EMNLP), 2024. 11

2024
[28]

WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions, 2025

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions, 2025

2025
[29]

Best Practices for the Human Evaluation of Automatically Generated Text

Chris Van Der Lee, Albert Gatt, Emiel Van Miltenburg, Sander Wubben, and Emiel Krahmer. Best Practices for the Human Evaluation of Automatically Generated Text. InInternational Conference on Natural Language Generation (INLG), 2019

2019
[30]

Measuring Massive Multitask Language Understanding.International Conference on Learning Representations, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.International Conference on Learning Representations, 2021

2021
[31]

TSAQA: Time Series Analysis Question And Answering Benchmark

Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, et al. TSAQA: Time Series Analysis Question and Answering Benchmark. arXiv preprint arXiv:2601.23204, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Conference on Neural Information Processing Systems, 36:46595–46623, 2023

2023
[33]

G-Eval: NLG Evaluation Using Gpt-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation Using Gpt-4 with Better Human Alignment. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[34]

MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering.arXiv preprint arXiv:2503.16858, 2025

Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering.arXiv preprint arXiv:2503.16858, 2025

work page arXiv 2025
[35]

How to Do Human Evaluation: A Brief Introduction to User Studies in NLP.Natural Language Engineering, 29(5):1199–1222, 2023

Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, and Ngoc Thang Vu. How to Do Human Evaluation: A Brief Introduction to User Studies in NLP.Natural Language Engineering, 29(5):1199–1222, 2023

2023
[36]

High Agreement but Low Kappa: I

Alvan R Feinstein and Domenic V Cicchetti. High Agreement but Low Kappa: I. The Problems of Two Paradoxes.Journal of clinical epidemiology, 43(6):543–549, 1990

1990
[37]

High Agreement but Low Kappa: II

Domenic V Cicchetti and Alvan R Feinstein. High Agreement but Low Kappa: II. Resolving the Paradoxes. Journal of clinical epidemiology, 43(6):551–558, 1990

1990
[38]

Leveraging Large Language Models for Multiple Choice Question Answering

Joshua Robinson and David Wingate. Leveraging Large Language Models for Multiple Choice Question Answering. InInternational Conference on Learning Representations, 2023

2023
[39]

Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? InWorkshop on towards Knowledgeable Language Models (KnowLLM), 2024

Nishant Balepur and Rachel Rudinger. Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? InWorkshop on towards Knowledgeable Language Models (KnowLLM), 2024

2024
[40]

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Nishant Balepur, Rachel Rudinger, and Jordan Lee Boyd-Graber. Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

2025
[41]

Large Language Models Sensitivity to the Order of Op- tions in Multiple-Choice Questions

Pouya Pezeshkpour and Estevam Hruschka. Large Language Models Sensitivity to the Order of Op- tions in Multiple-Choice Questions. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

2024
[42]

Large Language Models Are Not Robust Multiple Choice Selectors.arXiv preprint arXiv:2309.03882, 2023

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large Language Models Are Not Robust Multiple Choice Selectors.arXiv preprint arXiv:2309.03882, 2023

work page arXiv 2023
[43]

Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

2021
[44]

Data Cards: Purposeful and Transpar- ent Dataset Documentation for Responsible Ai

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and Transpar- ent Dataset Documentation for Responsible Ai. InACM Conference on Fairness, Accountability and Transparency (FAccT), 2022

2022
[45]

In the time series [...], there is a general increasing trend. True or False?

James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. Synthetic Data–What, Why and How?arXiv preprint arXiv:2205.03257, 2022. 12 Appendix Contents A Analytical Skills in Existing TSQA Benchmarks 14 BSKEvolPrompt Templates and Human-in-the-Loop Review Details 15 B.1 Prompt T...

work page arXiv 2022

[1] [1]

ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large- Scale Multitask Dataset, 2025

Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, and Zhongyu Wei. ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large- Scale Multitask Dataset, 2025

2025

[2] [2]

ChatTS: Aligning Time Series with Llms via Synthetic Data for Enhanced Understanding and Reasoning

Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. ChatTS: Aligning Time Series with Llms via Synthetic Data for Enhanced Understanding and Reasoning. arXiv preprint arXiv:2412.03104, 2024

work page arXiv 2024

[3] [3]

ECG-QA: A Comprehen- sive Question Answering Dataset Combined with Electrocardiogram.Conference on Neural Information Processing Systems, 36:66277–66288, 2023

Jungwoo Oh, Gyubok Lee, Seongsu Bae, Joon-myoung Kwon, and Edward Choi. ECG-QA: A Comprehen- sive Question Answering Dataset Combined with Electrocardiogram.Conference on Neural Information Processing Systems, 36:66277–66288, 2023

2023

[4] [4]

SensorQA: A Question Answering Benchmark for Daily-Life Monitoring, 2025

Benjamin Reichman, Xiaofan Yu, Lanxiang Hu, Jack Truxal, Atishay Jain, Rushil Chandrupatla, Tajana Šimuni´c Rosing, and Larry Heck. SensorQA: A Question Answering Benchmark for Daily-Life Monitoring, 2025

2025

[5] [5]

TimeSerie- sExamAgent: Creating Time Series Reasoning Benchmarks at Scale, 2026

Malgorzata Gwiazda, Yifu Cai, Mononito Goswami, Arjun Choudhry, and Artur Dubrawski. TimeSerie- sExamAgent: Creating Time Series Reasoning Benchmarks at Scale, 2026

2026

[6] [6]

Towards Time Series Reasoning with Llms.arXiv preprint arXiv:2409.11376, 2024

Winnie Chow, Lauren Gardiner, Haraldur T Hallgrímsson, Maxwell A Xu, and Shirley You Ren. Towards Time Series Reasoning with Llms.arXiv preprint arXiv:2409.11376, 2024

work page arXiv 2024

[7] [7]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InConference on Empirical Methods in Natural Language Process- ing (EMNLP), 2016

2016

[8] [8]

Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 10

2019

[9] [9]

TGIF-QA: Toward Spatio- Temporal Reasoning in Visual Question Answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: Toward Spatio- Temporal Reasoning in Visual Question Answering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017

[10] [10]

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021

[11] [11]

Video Question Answering: Datasets, Algorithms and Challenges

Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. Video Question Answering: Datasets, Algorithms and Challenges. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

2022

[12] [12]

Deep Learning for Time Series Anomaly Detection: A Survey.ACM Computing Surveys, 57(1):1–42, 2024

Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu Aggarwal, and Mahsa Salehi. Deep Learning for Time Series Anomaly Detection: A Survey.ACM Computing Surveys, 57(1):1–42, 2024

2024

[13] [13]

Anomaly Detection in Time Series: A Comprehensive Evaluation.Proceedings of the VLDB Endowment, 15(9):1779–1797, 2022

Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation.Proceedings of the VLDB Endowment, 15(9):1779–1797, 2022

2022

[14] [14]

Ts-Reasoner: Aligning Time Series Foundation Models with Llm Reasoning.arXiv preprint arXiv:2510.03519, 2025

Fangxu Yu, Hongyu Zhao, and Tianyi Zhou. Ts-Reasoner: Aligning Time Series Foundation Models with Llm Reasoning.arXiv preprint arXiv:2510.03519, 2025

work page arXiv 2025

[15] [15]

STL: A Seasonal-Trend Decomposition Procedure Based on Loess.Journal of Official Statistics, 6:3–73, 1990

CLEVELAND RB. STL: A Seasonal-Trend Decomposition Procedure Based on Loess.Journal of Official Statistics, 6:3–73, 1990

1990

[16] [16]

Time-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment, 17(12):4229–4232, 2024

Qinghua Liu, Paul Boniol, Themis Palpanas, and John Paparrizos. Time-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment, 17(12):4229–4232, 2024

2024

[17] [17]

A Survey of Methods for Time Series Change Point Detection.International Conference on Information and Knowledge Systems, 51(2):339–367, 2017

Samaneh Aminikhanghahi and Diane J Cook. A Survey of Methods for Time Series Change Point Detection.International Conference on Information and Knowledge Systems, 51(2):339–367, 2017

2017

[18] [18]

Selective Review of Offline Change Point Detection Methods.IEEE Transactions on Signal Processing, 167:107299, 2020

Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective Review of Offline Change Point Detection Methods.IEEE Transactions on Signal Processing, 167:107299, 2020

2020

[19] [19]

Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement, 2025

Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement, 2025

2025

[20] [20]

Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021

Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I Webb, Rob J Hyndman, and Pablo Montero-Manso. Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021

work page arXiv 2021

[21] [21]

Toward Reasoning-Centric Time-Series Analysis, 2025

Xinlei Wang, Mingtian Tan, Jing Qiu, Junhua Zhao, and Jinjin Gu. Toward Reasoning-Centric Time-Series Analysis, 2025

2025

[22] [22]

Xu, Harish Haresamudram, Catherine W

Maxwell A. Xu, Harish Haresamudram, Catherine W. Liu, Patrick Langer, Jathurshan Pradeepkumar, Wanting Mao, Sunita J. Ferns, Aradhana Verma, Jimeng Sun, Paul Schmiedmayer, Xin Liu, Daniel McDuff, Emily B. Fox, and James M. Rehg. How Well Do Multimodal Models Reason on ECG Signals?, 2026

2026

[23] [23]

Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, and Matthew Reimherr

Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin, Qi Zhu, Haoyang Fang, Danielle C. Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, and Matthew Reimherr. SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning, 2026

2026

[24] [24]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research, 2023

2023

[25] [25]

MMTS-BENCH: A Comprehensive Benchmark for Time Series Under- standing and Reasoning.arXiv preprint arXiv:2602.08588, 2026

Yao Yin, Zhenyu Xiao, Musheng Li, Yiwen Liu, Sutong Nan, Yiting He, Ruiqi Wang, Zhenwei Zhang, Qingmin Liao, and Yuantao Gu. MMTS-BENCH: A Comprehensive Benchmark for Time Series Under- standing and Reasoning.arXiv preprint arXiv:2602.08588, 2026

work page arXiv 2026

[26] [26]

QuAnTS: Question Answering on Time Series.arXiv preprint arXiv:2511.05124, 2025

Felix Divo, Maurice Kraus, Anh Q Nguyen, Hao Xue, Imran Razzak, Flora D Salim, Kristian Kersting, and Devendra Singh Dhami. QuAnTS: Question Answering on Time Series.arXiv preprint arXiv:2511.05124, 2025

work page arXiv 2025

[27] [27]

Evaluating Large Language Models on Time Series Feature Understanding: A Comprehen- sive Taxonomy and Benchmark

Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. Evaluating Large Language Models on Time Series Feature Understanding: A Comprehen- sive Taxonomy and Benchmark. InConference on Empirical Methods in Natural Language Process- ing (EMNLP), 2024. 11

2024

[28] [28]

WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions, 2025

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions, 2025

2025

[29] [29]

Best Practices for the Human Evaluation of Automatically Generated Text

Chris Van Der Lee, Albert Gatt, Emiel Van Miltenburg, Sander Wubben, and Emiel Krahmer. Best Practices for the Human Evaluation of Automatically Generated Text. InInternational Conference on Natural Language Generation (INLG), 2019

2019

[30] [30]

Measuring Massive Multitask Language Understanding.International Conference on Learning Representations, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.International Conference on Learning Representations, 2021

2021

[31] [31]

TSAQA: Time Series Analysis Question And Answering Benchmark

Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, et al. TSAQA: Time Series Analysis Question and Answering Benchmark. arXiv preprint arXiv:2601.23204, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Conference on Neural Information Processing Systems, 36:46595–46623, 2023

2023

[33] [33]

G-Eval: NLG Evaluation Using Gpt-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation Using Gpt-4 with Better Human Alignment. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023

[34] [34]

MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering.arXiv preprint arXiv:2503.16858, 2025

Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering.arXiv preprint arXiv:2503.16858, 2025

work page arXiv 2025

[35] [35]

How to Do Human Evaluation: A Brief Introduction to User Studies in NLP.Natural Language Engineering, 29(5):1199–1222, 2023

Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, and Ngoc Thang Vu. How to Do Human Evaluation: A Brief Introduction to User Studies in NLP.Natural Language Engineering, 29(5):1199–1222, 2023

2023

[36] [36]

High Agreement but Low Kappa: I

Alvan R Feinstein and Domenic V Cicchetti. High Agreement but Low Kappa: I. The Problems of Two Paradoxes.Journal of clinical epidemiology, 43(6):543–549, 1990

1990

[37] [37]

High Agreement but Low Kappa: II

Domenic V Cicchetti and Alvan R Feinstein. High Agreement but Low Kappa: II. Resolving the Paradoxes. Journal of clinical epidemiology, 43(6):551–558, 1990

1990

[38] [38]

Leveraging Large Language Models for Multiple Choice Question Answering

Joshua Robinson and David Wingate. Leveraging Large Language Models for Multiple Choice Question Answering. InInternational Conference on Learning Representations, 2023

2023

[39] [39]

Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? InWorkshop on towards Knowledgeable Language Models (KnowLLM), 2024

Nishant Balepur and Rachel Rudinger. Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? InWorkshop on towards Knowledgeable Language Models (KnowLLM), 2024

2024

[40] [40]

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Nishant Balepur, Rachel Rudinger, and Jordan Lee Boyd-Graber. Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

2025

[41] [41]

Large Language Models Sensitivity to the Order of Op- tions in Multiple-Choice Questions

Pouya Pezeshkpour and Estevam Hruschka. Large Language Models Sensitivity to the Order of Op- tions in Multiple-Choice Questions. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

2024

[42] [42]

Large Language Models Are Not Robust Multiple Choice Selectors.arXiv preprint arXiv:2309.03882, 2023

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large Language Models Are Not Robust Multiple Choice Selectors.arXiv preprint arXiv:2309.03882, 2023

work page arXiv 2023

[43] [43]

Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021

2021

[44] [44]

Data Cards: Purposeful and Transpar- ent Dataset Documentation for Responsible Ai

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and Transpar- ent Dataset Documentation for Responsible Ai. InACM Conference on Fairness, Accountability and Transparency (FAccT), 2022

2022

[45] [45]

In the time series [...], there is a general increasing trend. True or False?

James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. Synthetic Data–What, Why and How?arXiv preprint arXiv:2205.03257, 2022. 12 Appendix Contents A Analytical Skills in Existing TSQA Benchmarks 14 BSKEvolPrompt Templates and Human-in-the-Loop Review Details 15 B.1 Prompt T...

work page arXiv 2022