TSAQA: Time Series Analysis Question And Answering Benchmark

Baoyu Jing; Boyu Liu; Dongqi Fu; Hanghang Tong; Jiaru Zou; Jingchao Ni; Jingrui He; Lecheng Zheng; Ruizhong Qiu; Sanhorn Chen

arxiv: 2601.23204 · v2 · pith:K5NUTXHOnew · submitted 2026-01-30 · 💻 cs.AI

TSAQA: Time Series Analysis Question And Answering Benchmark

Baoyu Jing , Sanhorn Chen , Lecheng Zheng , Boyu Liu , Zihao Li , Jiaru Zou , Tianxin Wei , Zhining Liu

show 8 more authors

Zhichen Zeng Ruizhong Qiu Xiao Lin Yuchen Yan Dongqi Fu Jingchao Ni Jingrui He Hanghang Tong

This is my paper

classification 💻 cs.AI

keywords analysisseriestimediversetaskstemporaltsaqaacross

0 comments

read the original abstract

Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time series question answering (QA), current benchmarks remain limited to forecasting and anomaly detection tasks. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates six diverse tasks under a single framework ranging from conventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, data transformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shot evaluation demonstrates that these tasks are challenging for current Large Language Models (LLMs): the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. Although instruction tuning boosts open-source performance: the best-performing open-source model, LLaMA-3.1-8B, shows significant room for improvement, highlighting the complexity of temporal analysis for LLMs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Recognition to Understanding: Unlocking Cognitive Time Series Reasoning with LLMs
cs.CL 2026-06 unverdicted novelty 7.0

Introduces the TSCognition benchmark for cognitive time series reasoning tasks and the TSAlign alignment framework, reporting outperformance over LLM, VLM, and time-series baselines on TSCognition and TimerBed with lo...
Harnessing Generalist Agents for Contextualized Time Series
cs.AI 2026-06 unverdicted novelty 6.0

TimeClaw is a framework that augments LLM agents with temporal tools, capability evolution, and episodic memory to enable contextualized time series reasoning, with reported gains on benchmarks across energy, finance,...
TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering
cs.CL 2026-05 unverdicted novelty 6.0

Presents TS-Skill benchmark and SKEvol construction framework to diagnose three composable analytical skills in time-series QA across LLMs and TSLMs.
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.