TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering
Pith reviewed 2026-06-30 13:09 UTC · model grok-4.3
The pith
TS-Skill benchmark isolates three temporal skills to show that cross-interval integration remains hardest for non-agent models in time-series QA.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that existing TSQA benchmarks organized by task types or high-level reasoning obscure the underlying signal-level capabilities, and that TS-Skill's controlled evaluation of SK1, SK2, and SK3 uncovers these failures. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. The benchmark supplies timestamp-aware questions across broad domains with human-validated quality, and the SKEvol framework enables its scalable construction through skill-guided generation and verification.
What carries the argument
The three composable analytical skills—temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3)—that the benchmark uses to isolate and measure distinct temporal reasoning capabilities in TSQA.
If this is right
- Skill-level evaluation can uncover temporal reasoning failures that aggregate TSQA scores obscure.
- SK3 cross-interval integration remains consistently challenging for non-agent models.
- Tool-augmented agents obtain a selective advantage specifically on standalone SK3.
- The SKEvol framework supports controlled, scalable construction of timestamp-aware questions with human-validated quality.
- Broad domain coverage in the benchmark allows diagnosis of capabilities that generalize beyond single tasks.
Where Pith is reading between the lines
- If the skills prove composable, targeted training or fine-tuning on SK3 alone could raise overall TSQA performance without retraining on every task type.
- The selective agent advantage on SK3 suggests that explicit tool access for interval comparison may be a practical route to stronger temporal integration.
- The benchmark's isolation method could be applied to sequential data outside time series to test whether similar skill gaps appear in other domains.
- Persistent SK3 difficulty for non-agent models implies that pure next-token prediction may lack built-in mechanisms for cross-interval signal synthesis.
Load-bearing premise
The three skills are the primary composable analytical capabilities required for TSQA and the SKEvol-generated questions validly isolate each skill without introducing unintended biases or correlations.
What would settle it
A replication in which all ten models show equal performance across SK1-SK3 or in which tool-augmented agents lose their selective advantage on standalone SK3 would falsify the reported uneven gaps.
Figures
read the original abstract
Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TS-Skill, a controlled benchmark for three composable analytical skills in time-series question answering (TSQA): temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). It describes SKEvol, a skill-guided agentic framework combining domain-aware seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase verification, and human curation. Experiments on ten state-of-the-art LLMs and TSLMs are reported to reveal substantial and uneven capability gaps across the skills, with SK3 remaining challenging for non-agent models while tool-augmented agents show selective advantage on standalone SK3. The work argues that skill-level evaluation can uncover temporal reasoning failures obscured by aggregate TSQA scores.
Significance. If the benchmark construction successfully isolates the three skills without systematic confounds or correlations, the work would provide a useful diagnostic lens for TSQA that goes beyond task-level or high-level reasoning categories, potentially informing targeted improvements in temporal grounding for both LLMs and TSLMs.
major comments (2)
- [SKEvol construction pipeline (abstract description)] The central claim of uneven capability gaps and selective agent advantage on SK3 depends on SKEvol validly isolating SK1-SK3 without unintended correlations or difficulty confounds. The abstract describes a complex multi-phase verification pipeline but provides no quantitative checks (e.g., cross-skill requirement rates, metadata balance across SK1/SK2/SK3, or inter-rater agreement on skill purity), leaving open the possibility that observed gaps are construction artifacts rather than evidence of distinct capabilities.
- [Experiments (abstract description)] The abstract states that experiments on ten models 'reveal substantial and uneven capability gaps' and a 'selective advantage' on SK3, yet supplies no quantitative results, dataset statistics, error analysis, or verification details. Without these, it is impossible to assess whether the data support the stated gaps or the composability assumption.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the abstract should include quantitative support for the construction validity and experimental claims. We have revised the abstract and added explicit validation details to the manuscript to address these points.
read point-by-point responses
-
Referee: [SKEvol construction pipeline (abstract description)] The central claim of uneven capability gaps and selective agent advantage on SK3 depends on SKEvol validly isolating SK1-SK3 without unintended correlations or difficulty confounds. The abstract describes a complex multi-phase verification pipeline but provides no quantitative checks (e.g., cross-skill requirement rates, metadata balance across SK1/SK2/SK3, or inter-rater agreement on skill purity), leaving open the possibility that observed gaps are construction artifacts rather than evidence of distinct capabilities.
Authors: We acknowledge that the original abstract omitted these quantitative validation metrics. The full manuscript (Section 3) describes the multi-phase verification pipeline and includes the requested checks: cross-skill requirement rates, metadata balance statistics across SK1/SK2/SK3, and inter-rater agreement on skill purity. To directly address the concern, we have revised the abstract to summarize these results, confirming low unintended correlations and balanced difficulty, thereby supporting that the observed gaps reflect distinct capabilities rather than construction artifacts. revision: yes
-
Referee: [Experiments (abstract description)] The abstract states that experiments on ten models 'reveal substantial and uneven capability gaps' and a 'selective advantage' on SK3, yet supplies no quantitative results, dataset statistics, error analysis, or verification details. Without these, it is impossible to assess whether the data support the stated gaps or the composability assumption.
Authors: We agree the abstract was insufficiently quantitative. The full manuscript (Section 5) reports the full experimental results on the ten models, including dataset statistics, per-skill performance breakdowns, error analysis, and verification details supporting the composability assumption. We have revised the abstract to incorporate key quantitative findings on the capability gaps and selective agent advantage on SK3, along with a brief summary of the supporting statistics and analysis. revision: yes
Circularity Check
No circularity detected; benchmark paper with no derivations
full rationale
The paper introduces TS-Skill and SKEvol as an empirical benchmark construction and evaluation framework. No equations, predictions, or first-principles derivations appear in the abstract or described content. The three skills are posited as composable but the construction pipeline is presented as a methodological choice rather than a self-referential definition or fitted input renamed as prediction. Central claims rest on experimental results across models, not on any reduction to inputs by construction. This is a standard benchmark paper whose claims are self-contained against external model evaluations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large- Scale Multitask Dataset, 2025
Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, and Zhongyu Wei. ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large- Scale Multitask Dataset, 2025
2025
-
[2]
ChatTS: Aligning Time Series with Llms via Synthetic Data for Enhanced Understanding and Reasoning
Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. ChatTS: Aligning Time Series with Llms via Synthetic Data for Enhanced Understanding and Reasoning. arXiv preprint arXiv:2412.03104, 2024
-
[3]
ECG-QA: A Comprehen- sive Question Answering Dataset Combined with Electrocardiogram.Conference on Neural Information Processing Systems, 36:66277–66288, 2023
Jungwoo Oh, Gyubok Lee, Seongsu Bae, Joon-myoung Kwon, and Edward Choi. ECG-QA: A Comprehen- sive Question Answering Dataset Combined with Electrocardiogram.Conference on Neural Information Processing Systems, 36:66277–66288, 2023
2023
-
[4]
SensorQA: A Question Answering Benchmark for Daily-Life Monitoring, 2025
Benjamin Reichman, Xiaofan Yu, Lanxiang Hu, Jack Truxal, Atishay Jain, Rushil Chandrupatla, Tajana Šimuni´c Rosing, and Larry Heck. SensorQA: A Question Answering Benchmark for Daily-Life Monitoring, 2025
2025
-
[5]
TimeSerie- sExamAgent: Creating Time Series Reasoning Benchmarks at Scale, 2026
Malgorzata Gwiazda, Yifu Cai, Mononito Goswami, Arjun Choudhry, and Artur Dubrawski. TimeSerie- sExamAgent: Creating Time Series Reasoning Benchmarks at Scale, 2026
2026
-
[6]
Towards Time Series Reasoning with Llms.arXiv preprint arXiv:2409.11376, 2024
Winnie Chow, Lauren Gardiner, Haraldur T Hallgrímsson, Maxwell A Xu, and Shirley You Ren. Towards Time Series Reasoning with Llms.arXiv preprint arXiv:2409.11376, 2024
-
[7]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InConference on Empirical Methods in Natural Language Process- ing (EMNLP), 2016
2016
-
[8]
Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 10
2019
-
[9]
TGIF-QA: Toward Spatio- Temporal Reasoning in Visual Question Answering
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: Toward Spatio- Temporal Reasoning in Visual Question Answering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017
2017
-
[10]
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
2021
-
[11]
Video Question Answering: Datasets, Algorithms and Challenges
Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. Video Question Answering: Datasets, Algorithms and Challenges. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
2022
-
[12]
Deep Learning for Time Series Anomaly Detection: A Survey.ACM Computing Surveys, 57(1):1–42, 2024
Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu Aggarwal, and Mahsa Salehi. Deep Learning for Time Series Anomaly Detection: A Survey.ACM Computing Surveys, 57(1):1–42, 2024
2024
-
[13]
Anomaly Detection in Time Series: A Comprehensive Evaluation.Proceedings of the VLDB Endowment, 15(9):1779–1797, 2022
Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation.Proceedings of the VLDB Endowment, 15(9):1779–1797, 2022
2022
-
[14]
Fangxu Yu, Hongyu Zhao, and Tianyi Zhou. Ts-Reasoner: Aligning Time Series Foundation Models with Llm Reasoning.arXiv preprint arXiv:2510.03519, 2025
-
[15]
STL: A Seasonal-Trend Decomposition Procedure Based on Loess.Journal of Official Statistics, 6:3–73, 1990
CLEVELAND RB. STL: A Seasonal-Trend Decomposition Procedure Based on Loess.Journal of Official Statistics, 6:3–73, 1990
1990
-
[16]
Time-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment, 17(12):4229–4232, 2024
Qinghua Liu, Paul Boniol, Themis Palpanas, and John Paparrizos. Time-Series Anomaly Detection: Overview and New Trends.Proceedings of the VLDB Endowment, 17(12):4229–4232, 2024
2024
-
[17]
A Survey of Methods for Time Series Change Point Detection.International Conference on Information and Knowledge Systems, 51(2):339–367, 2017
Samaneh Aminikhanghahi and Diane J Cook. A Survey of Methods for Time Series Change Point Detection.International Conference on Information and Knowledge Systems, 51(2):339–367, 2017
2017
-
[18]
Selective Review of Offline Change Point Detection Methods.IEEE Transactions on Signal Processing, 167:107299, 2020
Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective Review of Offline Change Point Detection Methods.IEEE Transactions on Signal Processing, 167:107299, 2020
2020
-
[19]
Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement, 2025
Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement, 2025
2025
-
[20]
Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021
Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I Webb, Rob J Hyndman, and Pablo Montero-Manso. Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021
-
[21]
Toward Reasoning-Centric Time-Series Analysis, 2025
Xinlei Wang, Mingtian Tan, Jing Qiu, Junhua Zhao, and Jinjin Gu. Toward Reasoning-Centric Time-Series Analysis, 2025
2025
-
[22]
Xu, Harish Haresamudram, Catherine W
Maxwell A. Xu, Harish Haresamudram, Catherine W. Liu, Patrick Langer, Jathurshan Pradeepkumar, Wanting Mao, Sunita J. Ferns, Aradhana Verma, Jimeng Sun, Paul Schmiedmayer, Xin Liu, Daniel McDuff, Emily B. Fox, and James M. Rehg. How Well Do Multimodal Models Reason on ECG Signals?, 2026
2026
-
[23]
Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, and Matthew Reimherr
Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin, Qi Zhu, Haoyang Fang, Danielle C. Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, and Matthew Reimherr. SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning, 2026
2026
-
[24]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research, 2023
2023
-
[25]
Yao Yin, Zhenyu Xiao, Musheng Li, Yiwen Liu, Sutong Nan, Yiting He, Ruiqi Wang, Zhenwei Zhang, Qingmin Liao, and Yuantao Gu. MMTS-BENCH: A Comprehensive Benchmark for Time Series Under- standing and Reasoning.arXiv preprint arXiv:2602.08588, 2026
-
[26]
QuAnTS: Question Answering on Time Series.arXiv preprint arXiv:2511.05124, 2025
Felix Divo, Maurice Kraus, Anh Q Nguyen, Hao Xue, Imran Razzak, Flora D Salim, Kristian Kersting, and Devendra Singh Dhami. QuAnTS: Question Answering on Time Series.arXiv preprint arXiv:2511.05124, 2025
-
[27]
Evaluating Large Language Models on Time Series Feature Understanding: A Comprehen- sive Taxonomy and Benchmark
Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. Evaluating Large Language Models on Time Series Feature Understanding: A Comprehen- sive Taxonomy and Benchmark. InConference on Empirical Methods in Natural Language Process- ing (EMNLP), 2024. 11
2024
-
[28]
WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions, 2025
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions, 2025
2025
-
[29]
Best Practices for the Human Evaluation of Automatically Generated Text
Chris Van Der Lee, Albert Gatt, Emiel Van Miltenburg, Sander Wubben, and Emiel Krahmer. Best Practices for the Human Evaluation of Automatically Generated Text. InInternational Conference on Natural Language Generation (INLG), 2019
2019
-
[30]
Measuring Massive Multitask Language Understanding.International Conference on Learning Representations, 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.International Conference on Learning Representations, 2021
2021
-
[31]
TSAQA: Time Series Analysis Question And Answering Benchmark
Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, et al. TSAQA: Time Series Analysis Question and Answering Benchmark. arXiv preprint arXiv:2601.23204, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Conference on Neural Information Processing Systems, 36:46595–46623, 2023
2023
-
[33]
G-Eval: NLG Evaluation Using Gpt-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation Using Gpt-4 with Better Human Alignment. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
2023
-
[34]
Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering.arXiv preprint arXiv:2503.16858, 2025
-
[35]
How to Do Human Evaluation: A Brief Introduction to User Studies in NLP.Natural Language Engineering, 29(5):1199–1222, 2023
Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, and Ngoc Thang Vu. How to Do Human Evaluation: A Brief Introduction to User Studies in NLP.Natural Language Engineering, 29(5):1199–1222, 2023
2023
-
[36]
High Agreement but Low Kappa: I
Alvan R Feinstein and Domenic V Cicchetti. High Agreement but Low Kappa: I. The Problems of Two Paradoxes.Journal of clinical epidemiology, 43(6):543–549, 1990
1990
-
[37]
High Agreement but Low Kappa: II
Domenic V Cicchetti and Alvan R Feinstein. High Agreement but Low Kappa: II. Resolving the Paradoxes. Journal of clinical epidemiology, 43(6):551–558, 1990
1990
-
[38]
Leveraging Large Language Models for Multiple Choice Question Answering
Joshua Robinson and David Wingate. Leveraging Large Language Models for Multiple Choice Question Answering. InInternational Conference on Learning Representations, 2023
2023
-
[39]
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? InWorkshop on towards Knowledgeable Language Models (KnowLLM), 2024
Nishant Balepur and Rachel Rudinger. Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? InWorkshop on towards Knowledgeable Language Models (KnowLLM), 2024
2024
-
[40]
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Nishant Balepur, Rachel Rudinger, and Jordan Lee Boyd-Graber. Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
2025
-
[41]
Large Language Models Sensitivity to the Order of Op- tions in Multiple-Choice Questions
Pouya Pezeshkpour and Estevam Hruschka. Large Language Models Sensitivity to the Order of Op- tions in Multiple-Choice Questions. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
2024
-
[42]
Large Language Models Are Not Robust Multiple Choice Selectors.arXiv preprint arXiv:2309.03882, 2023
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large Language Models Are Not Robust Multiple Choice Selectors.arXiv preprint arXiv:2309.03882, 2023
-
[43]
Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for Datasets.Communications of the ACM, 64(12):86–92, 2021
2021
-
[44]
Data Cards: Purposeful and Transpar- ent Dataset Documentation for Responsible Ai
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and Transpar- ent Dataset Documentation for Responsible Ai. InACM Conference on Fairness, Accountability and Transparency (FAccT), 2022
2022
-
[45]
In the time series [...], there is a general increasing trend. True or False?
James Jordon, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. Synthetic Data–What, Why and How?arXiv preprint arXiv:2205.03257, 2022. 12 Appendix Contents A Analytical Skills in Existing TSQA Benchmarks 14 BSKEvolPrompt Templates and Human-in-the-Loop Review Details 15 B.1 Prompt T...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.