TSAQA: time series analysis question and answering benchmark.CoRR, abs/2601.23204

Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, ZhichenZeng, RuizhongQiu, XiaoLin, YuchenYan, DongqiFu, JingchaoNi, JingruiHe, andHanghang Tong · 2026 · cs.AI · arXiv 2601.23204

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time series question answering (QA), current benchmarks remain limited to forecasting and anomaly detection tasks. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates six diverse tasks under a single framework ranging from conventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, data transformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shot evaluation demonstrates that these tasks are challenging for current Large Language Models (LLMs): the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. Although instruction tuning boosts open-source performance: the best-performing open-source model, LLaMA-3.1-8B, shows significant room for improvement, highlighting the complexity of temporal analysis for LLMs.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

cs.CL · 2026-05-23 · unverdicted · novelty 6.0

Presents TS-Skill benchmark and SKEvol construction framework to diagnose three composable analytical skills in time-series QA across LLMs and TSLMs.

Heterogeneous Scientific Foundation Model Collaboration

cs.AI · 2026-04-30 · unverdicted · novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

citing papers explorer

Showing 2 of 2 citing papers.

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering cs.CL · 2026-05-23 · unverdicted · none · ref 31 · internal anchor
Presents TS-Skill benchmark and SKEvol construction framework to diagnose three composable analytical skills in time-series QA across LLMs and TSLMs.
Heterogeneous Scientific Foundation Model Collaboration cs.AI · 2026-04-30 · unverdicted · none · ref 123 · internal anchor
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

TSAQA: time series analysis question and answering benchmark.CoRR, abs/2601.23204

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer