arxiv: 2509.05215 · v3 · submitted 2025-09-05 · 💻 cs.CL · cs.LG

BEDTime: A Unified Benchmark for Automatically Describing Time Series

Medhasweta Sen , Zachary Gottesman , Jiaxing Qiu , C. Bayan Bruss , Nam Nguyen , Tom Hartvigsen This is my paper

Pith reviewed 2026-05-18 18:52 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords time seriesbenchmarkmultimodal modelsdescription generationstructural propertiesrobustnessvision language modelsunivariate series

0 comments p. Extension

The pith

A benchmark reveals that vision-language models describe structural features of time series better than dedicated time-series models or language-only approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how well current models can recognize, differentiate, and generate descriptions of univariate time series, which it treats as foundational skills before tackling complex reasoning or question-answering. It introduces the BEDTime benchmark, built from five existing datasets reformatted into three modalities, and tests seventeen state-of-the-art models on these tasks plus robustness checks. The evaluation shows dedicated time-series language models underperform, vision-language models handle the tasks more capably, and language-only methods lag furthest behind. Every tested approach degrades sharply under real-world variations such as noise or shifts in scale. The results therefore question prior claims that complex multimodal time-series models have already mastered basic descriptive capabilities.

Core claim

Successful models for time-series and language should first master recognizing structural properties, differentiating between series, and generating accurate textual descriptions of univariate time series. The BEDTime benchmark, which reformats five datasets across text, image, and time-series modalities, demonstrates that dedicated time-series language models fall short on these tasks despite their design, vision-language models perform more strongly, language-only methods perform worst, and all approaches remain fragile under robustness perturbations.

What carries the argument

The BEDTime benchmark, which evaluates three core tasks (recognition, differentiation, and generation of descriptions) on five reformatted univariate time-series datasets presented in three modalities.

If this is right

Prior claims of strong performance on complex time-series reasoning and cross-modal question answering rest on untested foundational skills.
Vision-language models offer a stronger starting point than time-series-specific architectures for descriptive tasks.
Language-only models require additional mechanisms to capture structural time-series features effectively.
Robustness to noise, scale shifts, and other real-world variations must be improved before any approach can support reliable applications.
Future model development should include explicit evaluation on recognition, differentiation, and generation before scaling to harder tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to multivariate series to test whether the same ordering of model families holds when interactions between channels matter.
If vision-language models continue to lead, training pipelines that convert time series into images may become a default route for descriptive and reasoning capabilities.
The fragility finding suggests that current training objectives for time-series language models do not sufficiently penalize sensitivity to small distributional changes.
A practical next step would be to measure whether models that pass BEDTime also improve on downstream tasks such as anomaly explanation or forecasting justification.

Load-bearing premise

That the three tasks of recognizing, differentiating, and generating descriptions of univariate time series are the foundational skills any model claiming advanced time-series reasoning must have mastered first.

What would settle it

A dedicated time-series language model that scores above 80 percent accuracy on all three tasks across the five datasets while maintaining performance under the paper's robustness perturbations would falsify the claim that such models fall short.

Figures

Figures reproduced from arXiv: 2509.05215 by C. Bayan Bruss, Jiaxing Qiu, Medhasweta Sen, Nam Nguyen, Tom Hartvigsen, Zachary Gottesman.

**Figure 2.** Figure 2: Robustness of LLMs and TSLMs to sequence length, missingness, added gaussian noise [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of Chain-of-Thought (CoT) prompting on language-only models’ accuracy across [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy of LLMs and VLMs on recognition and differentiation tasks across real-world [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling for different time series lengths [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

read the original abstract

Recent works propose complex multi-modal models that handle both time series and language, ultimately claiming high performance on complex tasks like time series reasoning and cross-modal question answering. However, they skip foundational evaluations that such complex models should have mastered. So we ask a simple question: \textit{How well can recent models describe structural properties of time series?} To answer this, we propose that successful models should be able to \textit{recognize}, \textit{differentiate}, and \textit{generate} descriptions of univariate time series. We then create \textbf{\benchmark}, a benchmark to assess these novel tasks, that comprises \textbf{five datasets} reformatted across \textbf{three modalities}. In evaluating \textbf{17 state-of-the-art models}, we find that (1) surprisingly, dedicated time series-language models fall short, despite being designed for similar tasks, (2) vision language models are quite capable, (3) language only methods perform worst, despite many lauding their potential, and (4) all approaches are clearly fragile to a range of real world robustness tests, indicating directions for future work. Together, our findings critique prior works' claims and provide avenues for advancing multi-modal time series modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEDTime gives a practical benchmark for basic time series description and shows clear performance gaps, but the claim that these three tasks are foundational feels narrow.

read the letter

The main thing to know is that this paper builds a benchmark called BEDTime to test whether models can recognize, differentiate, and generate descriptions of univariate time series, then runs 17 models across five reformatted datasets in three modalities. The results show dedicated time series-language models underperform, vision-language models do better than expected, language-only approaches lag, and everything looks fragile under robustness checks. That pattern is the concrete takeaway. What is new here is the unified three-task framing and the cross-modality setup; prior work on time series reasoning has not focused on these exact description skills in one place. The evaluation itself is straightforward and reports consistent differences, which is useful for anyone trying to ground claims about multi-modal time series models. The paper earns credit for reusing existing datasets rather than inventing new ones and for including robustness tests that highlight real fragility. The soft spot is the premise that these three tasks must be mastered first by any model claiming time series reasoning or cross-modal QA. The tasks are limited to univariate series and rely on reformatted data, so it is possible that stronger performance on complex downstream tasks could occur without top scores here. If the robustness tests turn out to be sensitive to how the data was prepared, the fragility conclusion would need more support. This work is for researchers who build or evaluate time series-language models and want a simple diagnostic before claiming advanced reasoning. A reader focused on benchmarking or evaluation gaps will get the most out of it. The empirical results are clear enough that it deserves a serious referee, even if the foundational argument needs tightening. I would send it to review with requests for more detail on the reformatting process and the exact robustness protocol.

Referee Report

2 major / 2 minor

Summary. The paper introduces BEDTime, a benchmark for assessing models' ability to recognize, differentiate, and generate descriptions of univariate time series. It reformats five existing datasets into three modalities and evaluates 17 state-of-the-art models (including dedicated time series-language models, vision-language models, and language-only methods). The central findings are that time series-language models underperform despite their design, vision-language models are relatively capable, language-only approaches perform worst, and all models show fragility under robustness tests. The work uses these results to critique prior multi-modal models for skipping such foundational evaluations before claiming performance on complex tasks like time series reasoning and cross-modal QA.

Significance. If the benchmark construction and evaluations are robust, the work provides a useful standardized testbed for basic time series description skills that could help diagnose limitations in current multi-modal approaches. The empirical comparison across 17 models and the inclusion of real-world robustness tests add concrete evidence on performance gaps and fragility, which may usefully inform future model development even if the foundational premise requires further support.

major comments (2)

[Introduction and §3] Introduction and §3 (Benchmark Tasks): The central critique of prior works rests on the claim that recognizing, differentiating, and generating descriptions of univariate time series are foundational prerequisite skills for any model asserting time series reasoning or cross-modal QA capabilities. This assumption is load-bearing but receives limited justification; the manuscript does not provide explicit argument, cognitive/ML references, or evidence showing why strong performance on these specific univariate reformatted tasks is necessary before complex reasoning can succeed. If complex models can achieve higher-level tasks without excelling here, the reported performance gaps would not necessarily invalidate prior claims.
[§4] §4 (Dataset Reformatting and Modalities): The description of how the five datasets were reformatted across the three modalities lacks sufficient detail on preprocessing steps, preservation of structural properties, and controls for leakage or bias. This is load-bearing for the validity of the cross-model comparisons and the fragility findings, as reformatting choices directly affect what 'structural properties' are being tested.

minor comments (2)

[§5] §5 (Evaluation Metrics): For the generation task, specify whether LLM-based metrics were used and how circularity was avoided (e.g., via human validation or fixed judges); this would strengthen interpretability of the reported gaps.
[Results tables] Table 2 or equivalent results table: Ensure consistent categorization of the 17 models (TS-LM vs. VLM vs. language-only) with citations and hyperparameter details for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify important areas for clarification and strengthening. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Introduction and §3] Introduction and §3 (Benchmark Tasks): The central critique of prior works rests on the claim that recognizing, differentiating, and generating descriptions of univariate time series are foundational prerequisite skills for any model asserting time series reasoning or cross-modal QA capabilities. This assumption is load-bearing but receives limited justification; the manuscript does not provide explicit argument, cognitive/ML references, or evidence showing why strong performance on these specific univariate reformatted tasks is necessary before complex reasoning can succeed. If complex models can achieve higher-level tasks without excelling here, the reported performance gaps would not necessarily invalidate prior claims.

Authors: We appreciate the referee's observation that the foundational premise requires stronger support. Our view is that these tasks capture basic perceptual and descriptive competencies that logically precede higher-order reasoning, analogous to how low-level feature extraction supports complex inference in multimodal systems. In revision we will expand the Introduction and §3 with explicit argumentation, drawing on references from cognitive science regarding hierarchical skill development and from ML literature on the necessity of basic multimodal alignment before advanced reasoning. We will also note that even if not strictly prerequisite, poor performance on these tasks still diagnostically highlights limitations in current models' time-series understanding. revision: yes
Referee: [§4] §4 (Dataset Reformatting and Modalities): The description of how the five datasets were reformatted across the three modalities lacks sufficient detail on preprocessing steps, preservation of structural properties, and controls for leakage or bias. This is load-bearing for the validity of the cross-model comparisons and the fragility findings, as reformatting choices directly affect what 'structural properties' are being tested.

Authors: We agree that greater detail on reformatting is necessary to support the validity of the comparisons and robustness results. In the revised manuscript we will substantially expand §4 to document the full preprocessing pipelines for each dataset and modality. This will include explicit steps for preserving structural properties (e.g., trends, seasonality, anomalies), quantitative checks on property retention, and controls implemented to avoid leakage or systematic bias across reformattings. Where feasible we will add pseudocode or illustrative examples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation

full rationale

This is a benchmark paper that defines three tasks (recognize, differentiate, generate descriptions of univariate time series), reformats five existing datasets into three modalities, and reports empirical performance of 17 models. No derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on direct model evaluations against the new benchmark rather than any reduction of results to the paper's own inputs by construction. The assumption that these tasks are foundational is stated explicitly but does not create circularity in the evaluation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the three proposed tasks are foundational; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The tasks of recognizing, differentiating, and generating descriptions of univariate time series are foundational skills that complex multi-modal models should have mastered before tackling time series reasoning and cross-modal question answering.
Explicitly stated in the abstract as the premise for creating the benchmark.

pith-pipeline@v0.9.0 · 5763 in / 1161 out tokens · 36657 ms · 2026-05-18T18:52:02.310780+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics
cs.AI 2026-04 unverdicted novelty 7.0

LLaTiSA is a vision-language model trained on a new 83k-sample hierarchical time series reasoning dataset that shows superior performance and out-of-distribution generalization on stratified TSR tasks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Multi-label clinical time- series generation via conditional gan.IEEE Transactions on Knowledge and Data Engineering, 36(4):1728–1740, 2023

Chang Lu, Chandan K Reddy, Ping Wang, Dong Nie, and Yue Ning. Multi-label clinical time- series generation via conditional gan.IEEE Transactions on Knowledge and Data Engineering, 36(4):1728–1740, 2023

work page 2023
[2]

Healthcare utilization after respiratory tuberculosis: a controlled interrupted time series analysis.Clinical Infectious Diseases, 77(6):883–891, 2023

Kamila Romanowski, Michael R Law, Mohammad Ehsanul Karim, Jonathon R Campbell, Md Belal Hossain, Mark Gilbert, Victoria J Cook, and James C Johnston. Healthcare utilization after respiratory tuberculosis: a controlled interrupted time series analysis.Clinical Infectious Diseases, 77(6):883–891, 2023

work page 2023
[3]

treat-all

H Manisha Yapa, Hae-Young Kim, Kathy Petoumenos, Frank A Post, Awachana Jiamsakul, Jan-Walter De Neve, Frank Tanser, Collins Iwuji, Kathy Baisley, Maryam Shahmanesh, et al. Cd4+ t-cell count at antiretroviral therapy initiation in the “treat-all” era in rural south africa: an interrupted time series analysis.Clinical Infectious Diseases, 74(8):1350–1359, 2022

work page 2022
[4]

Recent historic increase of infant mortality in france: A time-series analysis, 2001 to 2019.The Lancet Regional Health–Europe, 16, 2022

Nhung TH Trinh, Sophie de Visme, Jeremie F Cohen, Tim Bruckner, Nathalie Lelong, Pauline Adnot, Jean-Christophe Rozé, Béatrice Blondel, François Goffinet, Grégoire Rey, et al. Recent historic increase of infant mortality in france: A time-series analysis, 2001 to 2019.The Lancet Regional Health–Europe, 16, 2022

work page 2001
[5]

Applying time series analyses on continuous accelerometry data—a clinical example in older adults with and without cognitive impairment.Plos one, 16(5):e0251544, 2021

Torsten Rackoll, Konrad Neumann, Sven Passmann, Ulrike Grittner, Nadine Külzow, Julia Ladenbauer, and Agnes Flöel. Applying time series analyses on continuous accelerometry data—a clinical example in older adults with and without cognitive impairment.Plos one, 16(5):e0251544, 2021

work page 2021
[6]

Application of evolutionary computation for rule discovery in stock algorithmic trading: A literature review

Yong Hu, Kang Liu, Xiangzhou Zhang, Lijun Su, EWT Ngai, and Mei Liu. Application of evolutionary computation for rule discovery in stock algorithmic trading: A literature review. Applied Soft Computing, 36:534–551, 2020

work page 2020
[7]

Robustness of support vector machines in algorithmic trading on cryptocurrency market.Central European Economic Journal, 5(52):186–205, 2018

Robert ´Slepaczuk and Maryna Zenkova. Robustness of support vector machines in algorithmic trading on cryptocurrency market.Central European Economic Journal, 5(52):186–205, 2018

work page 2018
[8]

A comparative study of bitcoin price prediction using deep learning.Mathematics, 7(10):898, 2019

Suhwan Ji, Jongmin Kim, and Hyeonseung Im. A comparative study of bitcoin price prediction using deep learning.Mathematics, 7(10):898, 2019

work page 2019
[9]

Algorithmic financial trading with deep convolutional neural networks: Time series to image conversion approach.Applied Soft Computing, 70:525–538, 2018

Omer Berat Sezer and Ahmet Murat Ozbayoglu. Algorithmic financial trading with deep convolutional neural networks: Time series to image conversion approach.Applied Soft Computing, 70:525–538, 2018

work page 2018
[10]

Profitability of technical stock trading: Has it moved from daily to intraday data?Review of Financial Economics, 18(4):190–201, 2019

Stephan Schulmeister. Profitability of technical stock trading: Has it moved from daily to intraday data?Review of Financial Economics, 18(4):190–201, 2019

work page 2019
[11]

Deep learning for multivariate time series imputation: A survey.arXiv preprint arXiv:2402.04059, 2024

Jun Wang, Wenjie Du, Yiyuan Yang, Linglong Qian, Wei Cao, Keli Zhang, Wenjia Wang, Yuxuan Liang, and Qingsong Wen. Deep learning for multivariate time series imputation: A survey.arXiv preprint arXiv:2402.04059, 2024

work page arXiv 2024
[12]

Are language models actually useful for time series forecasting?Advances in Neural Information Processing Systems, 37:60162–60191, 2024

Mingtian Tan, Mike Merrill, Vinayak Gupta, Tim Althoff, and Tom Hartvigsen. Are language models actually useful for time series forecasting?Advances in Neural Information Processing Systems, 37:60162–60191, 2024

work page 2024
[13]

Jiang, W

Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, and Haifeng Chen. Explainable multi-modal time series prediction with llm-in-the-loop.arXiv preprint arXiv:2503.01013, 2025

work page arXiv 2025
[14]

From news to forecast: Integrating event analysis in llm-based time series forecasting with reflection.Advances in Neural Information Processing Systems, 37:58118–58153, 2024

Xinlei Wang, Maike Feng, Jing Qiu, Jinjin Gu, and Junhua Zhao. From news to forecast: Integrating event analysis in llm-based time series forecasting with reflection.Advances in Neural Information Processing Systems, 37:58118–58153, 2024. 10

work page 2024
[15]

Language models still struggle to zero-shot reason about time series

Mike A Merrill, Mingtian Tan, Vinayak Gupta, Thomas Hartvigsen, and Tim Althoff. Language models still struggle to zero-shot reason about time series. InFindings of EMNLP, 2024

work page 2024
[16]

Hallgrímsson, Maxwell A

Winnie Chow, Lauren Gardiner, Haraldur T. Hallgrímsson, Maxwell A. Xu, and Shirley You Ren. Towards time series reasoning with llms, 2024

work page 2024
[17]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models.arXiv preprint arXiv:2310.01728, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Temporal data meets llm–explainable financial time series forecasting.arXiv preprint arXiv:2306.11025, 2023

Xinli Yu, Zheng Chen, Yuan Ling, Shujing Dong, Zongyi Liu, and Yanbin Lu. Temporal data meets llm–explainable financial time series forecasting.arXiv preprint arXiv:2306.11025, 2023

work page arXiv 2023
[19]

Chattime: A unified multimodal time series foundation model bridging numerical and textual data.AAAI Conference on Artificial Intelligence, 2025

Chengsen Wang, Qi Qi, Jingyu Wang, Haifeng Sun, Zirui Zhuang, Jinming Wu, Lei Zhang, and Jianxin Liao. Chattime: A unified multimodal time series foundation model bridging numerical and textual data.AAAI Conference on Artificial Intelligence, 2025

work page 2025
[20]

Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.VLDB, 2025

Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.VLDB, 2025

work page 2025
[21]

Merrill, Zack Gottesman, Tim Althoff, David Evans, and Tom Hartvigsen

Mingtian Tan, Mike A. Merrill, Zack Gottesman, Tim Althoff, David Evans, and Tom Hartvigsen. Inferring events from time series using language models, 2025

work page 2025
[22]

Promptcast: A new prompt-based learning paradigm for time series forecasting.IEEE Transactions on Knowledge and Data Engineering, 36(11):6851–6864, 2023

Hao Xue and Flora D Salim. Promptcast: A new prompt-based learning paradigm for time series forecasting.IEEE Transactions on Knowledge and Data Engineering, 36(11):6851–6864, 2023

work page 2023
[23]

Timeseriesexam: A time series understanding exam, 2024

Yifu Cai, Arjun Choudhry, Mononito Goswami, and Artur Dubrawski. Timeseriesexam: A time series understanding exam, 2024

work page 2024
[24]

Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

work page arXiv 2024
[25]

Time series language model for descriptive caption generation, 2025

Mohamed Trabelsi, Aidan Boyd, Jin Cao, and Huseyin Uzunalioglu. Time series language model for descriptive caption generation, 2025

work page 2025
[26]

Clasp: Learning concepts for time-series signals from natural language supervision, 2025

Aoi Ito, Kota Dohi, and Yohei Kawaguchi. Clasp: Learning concepts for time-series signals from natural language supervision, 2025

work page 2025
[27]

Can brain signals reveal inner alignment with human languages?, 2024

William Han, Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Douglas Weber, Bo Li, and Ding Zhao. Can brain signals reveal inner alignment with human languages?, 2024

work page 2024
[28]

New linguistic description approach for time series and its application to bed restlessness monitoring for eldercare.IEEE Transactions on Fuzzy Systems, PP:1–1, 01 2021

Carmen Martínez-Cruz, Antonio Rueda, Mihail Popescu, and James Keller. New linguistic description approach for time series and its application to bed restlessness monitoring for eldercare.IEEE Transactions on Fuzzy Systems, PP:1–1, 01 2021

work page 2021
[29]

Monwatch: A fuzzy application to monitor the user behavior using wearable trackers

José María Serrano Chica. Monwatch: A fuzzy application to monitor the user behavior using wearable trackers. In2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE, 2020

work page 2020
[30]

A first approach to the generation of linguistic summaries from glucose sensors using gpt-4

Carmen Martínez-Cruz, Juan Gaitán-Guerrero, José Luis López Ruiz, Antonio Rueda, and Macarena Espinilla. A first approach to the generation of linguistic summaries from glucose sensors using gpt-4. InA First Approach to the Generation of Linguistic Summaries from Glucose Sensors Using GPT-4, pages 33–43, 11 2023

work page 2023
[31]

Law, Yvonne Freer, Jim Hunter, Robert H

Andrew S. Law, Yvonne Freer, Jim Hunter, Robert H. Logie, Neil McIntosh, and John Quinn. A comparison of graphical and textual presentations of time series data to support medical decision making in the neonatal intensive care unit.Journal of Clinical Monitoring and Computing, 19(3):183–194, June 2005

work page 2005
[32]

Contextual analysis of financial time series.Mathematics, 13(1):57, 2025

Nadezhda Yarushkina, Aleksey Filippov, and Anton Romanov. Contextual analysis of financial time series.Mathematics, 13(1):57, 2025. 11

work page 2025
[33]

Truth-conditional captioning of time series data

Harsh Jhamtani and Taylor Berg-Kirkpatrick. Truth-conditional captioning of time series data. InEMNLP, 2021

work page 2021
[34]

SUSHI: A Dataset of Synthetic Unichannel Signals Based on Heuristic Implementation (Tiny), September 2024

Yohei Kawaguchi, Kota Dohi, and Aoi Ito. SUSHI: A Dataset of Synthetic Unichannel Signals Based on Heuristic Implementation (Tiny), September 2024

work page 2024
[35]

Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram, 2023

Jungwoo Oh, Gyubok Lee, Seongsu Bae, Joon myoung Kwon, and Edward Choi. Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram, 2023

work page 2023
[36]

Kaplan, Alun D

Tianwei Xing, Luis Garcia, Federico Cerutti, Lance M. Kaplan, Alun D. Preece, and Mani B. Srivastava. Deepsqa: Understanding sensor data via question answering. InIoTDI, pages 106–118. ACM, 2021

work page 2021
[37]

Pixiu: A large language model, instruction data and evaluation benchmark for finance, 2023

Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. Pixiu: A large language model, instruction data and evaluation benchmark for finance, 2023

work page 2023
[38]

Large language models are zero-shot time series forecasters

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. Large language models are zero-shot time series forecasters. InNeurIPS, 2023

work page 2023
[39]

Large language models for time series: A survey,

Xiyuan Zhang, Ranak Roy Chowdhury, Rajesh K. Gupta, and Jingbo Shang. Large language models for time series: A survey.ArXiv, abs/2402.01801, 2024

work page arXiv 2024
[40]

Empowering time series analysis with large language models: A survey

Yushan Jiang, Zijie Pan, Xikun Zhang, Sahil Garg, Anderson Schneider, Yuriy Nevmyvaka, and Dongjin Song. Empowering time series analysis with large language models: A survey. In International Joint Conference on Artificial Intelligence, 2024

work page 2024
[41]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021
[42]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

work page 2023
[43]

Zhong, W

Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, and Yuxuan Liang. Time-vlm: Exploring multimodal vision-language models for augmented time series forecasting.arXiv preprint arXiv:2502.04395, 2025

work page arXiv 2025
[44]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

work page 2024
[45]

Gemini 2.0 flash

Google. Gemini 2.0 flash. https://cloud.google.com/vertex-ai/docs/ generative-ai/models/gemini-2-flash, 2025. Accessed: May 15, 2025

work page 2025
[46]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page 2024
[47]

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

work page 2024
[48]

Qwen2.5-1m technical report, 2025

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

work page 2025
[49]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[50]

A first approach to the generation of linguistic summaries from glucose sensors using gpt-4

Carmen Martínez-Cruz, Juan Gaitán-Guerrero, José Luis López Ruiz, Antonio Rueda, and Macarena Espinilla. A first approach to the generation of linguistic summaries from glucose sensors using gpt-4. InProceedings of the 15th International Conference on Ubiquitous Computing & Ambient Intelligence (UCAmI 2023), volume 842 ofLecture Notes in Networks and Syst...

work page 2023
[51]

tasksource: A large collection of NLP tasks with a structured dataset prepro- cessing framework

Damien Sileo. tasksource: A large collection of NLP tasks with a structured dataset prepro- cessing framework. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Con- ference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING...

work page 2024
[52]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023
[53]

a value is neared by the flow through exponential decay

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 15 Appendix A Dataset Description 1 3 5 7 9 11 Time Step 0 5 10 15 Value ID: AD_240 mostly lower than starting position it remains mostly flat in the middle. stays steady in the middle 1 3 5 7 9 11 ...

work page 2023
[54]

Review the annotation:{description}

work page
[55]

Analyze the time series:{series}

work page
[56]

Respond withTrueif the annotation accurately describes the time series

Determine if the annotation precisely matches the pattern depicted in the time series. Respond withTrueif the annotation accurately describes the time series. Respond withFalseif it does not. Avoid providing any additional comments or explanations. Task 2: Differentiation. Carefully analyze the given time series and choose the single best option that most...

work page
[57]

Read all options before deciding

work page
[58]

Only output the chosen option, highlighted as A, B, C, or D

work page
[59]

17 Time series:{series} Options: • A:{option_1} • B:{option_2} • C:{option_3} • D:{option_4} Task 3: Open Generation

Avoid adding extra text or explanations. 17 Time series:{series} Options: • A:{option_1} • B:{option_2} • C:{option_3} • D:{option_4} Task 3: Open Generation. You are tasked with generating a textual description of the visual properties of the provided time series. Please follow these instructions carefully:

work page
[60]

Analyze the given time series data:{series}

work page
[61]

A/B/C/D” or “True/False

Identify and describe the most prominent visual features or patterns observed in the time series. Consider characteristics such as trends, seasonality, anomalies, or significant changes. Your response should be a concise textual description of the most pronounced visual properties of the time series. Avoid including unnecessary details or unrelated commen...

work page