pith. the verified trust layer for science. sign in

arxiv: 2509.05215 · v3 · submitted 2025-09-05 · 💻 cs.CL · cs.LG

BEDTime: A Unified Benchmark for Automatically Describing Time Series

Pith reviewed 2026-05-18 18:52 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords time seriesbenchmarkmultimodal modelsdescription generationstructural propertiesrobustnessvision language modelsunivariate series
0
0 comments X p. Extension

The pith

A benchmark reveals that vision-language models describe structural features of time series better than dedicated time-series models or language-only approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how well current models can recognize, differentiate, and generate descriptions of univariate time series, which it treats as foundational skills before tackling complex reasoning or question-answering. It introduces the BEDTime benchmark, built from five existing datasets reformatted into three modalities, and tests seventeen state-of-the-art models on these tasks plus robustness checks. The evaluation shows dedicated time-series language models underperform, vision-language models handle the tasks more capably, and language-only methods lag furthest behind. Every tested approach degrades sharply under real-world variations such as noise or shifts in scale. The results therefore question prior claims that complex multimodal time-series models have already mastered basic descriptive capabilities.

Core claim

Successful models for time-series and language should first master recognizing structural properties, differentiating between series, and generating accurate textual descriptions of univariate time series. The BEDTime benchmark, which reformats five datasets across text, image, and time-series modalities, demonstrates that dedicated time-series language models fall short on these tasks despite their design, vision-language models perform more strongly, language-only methods perform worst, and all approaches remain fragile under robustness perturbations.

What carries the argument

The BEDTime benchmark, which evaluates three core tasks (recognition, differentiation, and generation of descriptions) on five reformatted univariate time-series datasets presented in three modalities.

If this is right

  • Prior claims of strong performance on complex time-series reasoning and cross-modal question answering rest on untested foundational skills.
  • Vision-language models offer a stronger starting point than time-series-specific architectures for descriptive tasks.
  • Language-only models require additional mechanisms to capture structural time-series features effectively.
  • Robustness to noise, scale shifts, and other real-world variations must be improved before any approach can support reliable applications.
  • Future model development should include explicit evaluation on recognition, differentiation, and generation before scaling to harder tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended to multivariate series to test whether the same ordering of model families holds when interactions between channels matter.
  • If vision-language models continue to lead, training pipelines that convert time series into images may become a default route for descriptive and reasoning capabilities.
  • The fragility finding suggests that current training objectives for time-series language models do not sufficiently penalize sensitivity to small distributional changes.
  • A practical next step would be to measure whether models that pass BEDTime also improve on downstream tasks such as anomaly explanation or forecasting justification.

Load-bearing premise

That the three tasks of recognizing, differentiating, and generating descriptions of univariate time series are the foundational skills any model claiming advanced time-series reasoning must have mastered first.

What would settle it

A dedicated time-series language model that scores above 80 percent accuracy on all three tasks across the five datasets while maintaining performance under the paper's robustness perturbations would falsify the claim that such models fall short.

Figures

Figures reproduced from arXiv: 2509.05215 by C. Bayan Bruss, Jiaxing Qiu, Medhasweta Sen, Nam Nguyen, Tom Hartvigsen, Zachary Gottesman.

Figure 1
Figure 1. Figure 1: Overview of the benchmark for automatic time series description ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Robustness of LLMs and TSLMs to sequence length, missingness, added gaussian noise [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Chain-of-Thought (CoT) prompting on language-only models’ accuracy across [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy of LLMs and VLMs on recognition and differentiation tasks across real-world [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scaling for different time series lengths [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
read the original abstract

Recent works propose complex multi-modal models that handle both time series and language, ultimately claiming high performance on complex tasks like time series reasoning and cross-modal question answering. However, they skip foundational evaluations that such complex models should have mastered. So we ask a simple question: \textit{How well can recent models describe structural properties of time series?} To answer this, we propose that successful models should be able to \textit{recognize}, \textit{differentiate}, and \textit{generate} descriptions of univariate time series. We then create \textbf{\benchmark}, a benchmark to assess these novel tasks, that comprises \textbf{five datasets} reformatted across \textbf{three modalities}. In evaluating \textbf{17 state-of-the-art models}, we find that (1) surprisingly, dedicated time series-language models fall short, despite being designed for similar tasks, (2) vision language models are quite capable, (3) language only methods perform worst, despite many lauding their potential, and (4) all approaches are clearly fragile to a range of real world robustness tests, indicating directions for future work. Together, our findings critique prior works' claims and provide avenues for advancing multi-modal time series modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BEDTime, a benchmark for assessing models' ability to recognize, differentiate, and generate descriptions of univariate time series. It reformats five existing datasets into three modalities and evaluates 17 state-of-the-art models (including dedicated time series-language models, vision-language models, and language-only methods). The central findings are that time series-language models underperform despite their design, vision-language models are relatively capable, language-only approaches perform worst, and all models show fragility under robustness tests. The work uses these results to critique prior multi-modal models for skipping such foundational evaluations before claiming performance on complex tasks like time series reasoning and cross-modal QA.

Significance. If the benchmark construction and evaluations are robust, the work provides a useful standardized testbed for basic time series description skills that could help diagnose limitations in current multi-modal approaches. The empirical comparison across 17 models and the inclusion of real-world robustness tests add concrete evidence on performance gaps and fragility, which may usefully inform future model development even if the foundational premise requires further support.

major comments (2)
  1. [Introduction and §3] Introduction and §3 (Benchmark Tasks): The central critique of prior works rests on the claim that recognizing, differentiating, and generating descriptions of univariate time series are foundational prerequisite skills for any model asserting time series reasoning or cross-modal QA capabilities. This assumption is load-bearing but receives limited justification; the manuscript does not provide explicit argument, cognitive/ML references, or evidence showing why strong performance on these specific univariate reformatted tasks is necessary before complex reasoning can succeed. If complex models can achieve higher-level tasks without excelling here, the reported performance gaps would not necessarily invalidate prior claims.
  2. [§4] §4 (Dataset Reformatting and Modalities): The description of how the five datasets were reformatted across the three modalities lacks sufficient detail on preprocessing steps, preservation of structural properties, and controls for leakage or bias. This is load-bearing for the validity of the cross-model comparisons and the fragility findings, as reformatting choices directly affect what 'structural properties' are being tested.
minor comments (2)
  1. [§5] §5 (Evaluation Metrics): For the generation task, specify whether LLM-based metrics were used and how circularity was avoided (e.g., via human validation or fixed judges); this would strengthen interpretability of the reported gaps.
  2. [Results tables] Table 2 or equivalent results table: Ensure consistent categorization of the 17 models (TS-LM vs. VLM vs. language-only) with citations and hyperparameter details for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify important areas for clarification and strengthening. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Introduction and §3] Introduction and §3 (Benchmark Tasks): The central critique of prior works rests on the claim that recognizing, differentiating, and generating descriptions of univariate time series are foundational prerequisite skills for any model asserting time series reasoning or cross-modal QA capabilities. This assumption is load-bearing but receives limited justification; the manuscript does not provide explicit argument, cognitive/ML references, or evidence showing why strong performance on these specific univariate reformatted tasks is necessary before complex reasoning can succeed. If complex models can achieve higher-level tasks without excelling here, the reported performance gaps would not necessarily invalidate prior claims.

    Authors: We appreciate the referee's observation that the foundational premise requires stronger support. Our view is that these tasks capture basic perceptual and descriptive competencies that logically precede higher-order reasoning, analogous to how low-level feature extraction supports complex inference in multimodal systems. In revision we will expand the Introduction and §3 with explicit argumentation, drawing on references from cognitive science regarding hierarchical skill development and from ML literature on the necessity of basic multimodal alignment before advanced reasoning. We will also note that even if not strictly prerequisite, poor performance on these tasks still diagnostically highlights limitations in current models' time-series understanding. revision: yes

  2. Referee: [§4] §4 (Dataset Reformatting and Modalities): The description of how the five datasets were reformatted across the three modalities lacks sufficient detail on preprocessing steps, preservation of structural properties, and controls for leakage or bias. This is load-bearing for the validity of the cross-model comparisons and the fragility findings, as reformatting choices directly affect what 'structural properties' are being tested.

    Authors: We agree that greater detail on reformatting is necessary to support the validity of the comparisons and robustness results. In the revised manuscript we will substantially expand §4 to document the full preprocessing pipelines for each dataset and modality. This will include explicit steps for preserving structural properties (e.g., trends, seasonality, anomalies), quantitative checks on property retention, and controls implemented to avoid leakage or systematic bias across reformattings. Where feasible we will add pseudocode or illustrative examples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation

full rationale

This is a benchmark paper that defines three tasks (recognize, differentiate, generate descriptions of univariate time series), reformats five existing datasets into three modalities, and reports empirical performance of 17 models. No derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on direct model evaluations against the new benchmark rather than any reduction of results to the paper's own inputs by construction. The assumption that these tasks are foundational is stated explicitly but does not create circularity in the evaluation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the three proposed tasks are foundational; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The tasks of recognizing, differentiating, and generating descriptions of univariate time series are foundational skills that complex multi-modal models should have mastered before tackling time series reasoning and cross-modal question answering.
    Explicitly stated in the abstract as the premise for creating the benchmark.

pith-pipeline@v0.9.0 · 5763 in / 1161 out tokens · 36657 ms · 2026-05-18T18:52:02.310780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics

    cs.AI 2026-04 unverdicted novelty 7.0

    LLaTiSA is a vision-language model trained on a new 83k-sample hierarchical time series reasoning dataset that shows superior performance and out-of-distribution generalization on stratified TSR tasks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Multi-label clinical time- series generation via conditional gan.IEEE Transactions on Knowledge and Data Engineering, 36(4):1728–1740, 2023

    Chang Lu, Chandan K Reddy, Ping Wang, Dong Nie, and Yue Ning. Multi-label clinical time- series generation via conditional gan.IEEE Transactions on Knowledge and Data Engineering, 36(4):1728–1740, 2023

  2. [2]

    Healthcare utilization after respiratory tuberculosis: a controlled interrupted time series analysis.Clinical Infectious Diseases, 77(6):883–891, 2023

    Kamila Romanowski, Michael R Law, Mohammad Ehsanul Karim, Jonathon R Campbell, Md Belal Hossain, Mark Gilbert, Victoria J Cook, and James C Johnston. Healthcare utilization after respiratory tuberculosis: a controlled interrupted time series analysis.Clinical Infectious Diseases, 77(6):883–891, 2023

  3. [3]

    treat-all

    H Manisha Yapa, Hae-Young Kim, Kathy Petoumenos, Frank A Post, Awachana Jiamsakul, Jan-Walter De Neve, Frank Tanser, Collins Iwuji, Kathy Baisley, Maryam Shahmanesh, et al. Cd4+ t-cell count at antiretroviral therapy initiation in the “treat-all” era in rural south africa: an interrupted time series analysis.Clinical Infectious Diseases, 74(8):1350–1359, 2022

  4. [4]

    Recent historic increase of infant mortality in france: A time-series analysis, 2001 to 2019.The Lancet Regional Health–Europe, 16, 2022

    Nhung TH Trinh, Sophie de Visme, Jeremie F Cohen, Tim Bruckner, Nathalie Lelong, Pauline Adnot, Jean-Christophe Rozé, Béatrice Blondel, François Goffinet, Grégoire Rey, et al. Recent historic increase of infant mortality in france: A time-series analysis, 2001 to 2019.The Lancet Regional Health–Europe, 16, 2022

  5. [5]

    Applying time series analyses on continuous accelerometry data—a clinical example in older adults with and without cognitive impairment.Plos one, 16(5):e0251544, 2021

    Torsten Rackoll, Konrad Neumann, Sven Passmann, Ulrike Grittner, Nadine Külzow, Julia Ladenbauer, and Agnes Flöel. Applying time series analyses on continuous accelerometry data—a clinical example in older adults with and without cognitive impairment.Plos one, 16(5):e0251544, 2021

  6. [6]

    Application of evolutionary computation for rule discovery in stock algorithmic trading: A literature review

    Yong Hu, Kang Liu, Xiangzhou Zhang, Lijun Su, EWT Ngai, and Mei Liu. Application of evolutionary computation for rule discovery in stock algorithmic trading: A literature review. Applied Soft Computing, 36:534–551, 2020

  7. [7]

    Robustness of support vector machines in algorithmic trading on cryptocurrency market.Central European Economic Journal, 5(52):186–205, 2018

    Robert ´Slepaczuk and Maryna Zenkova. Robustness of support vector machines in algorithmic trading on cryptocurrency market.Central European Economic Journal, 5(52):186–205, 2018

  8. [8]

    A comparative study of bitcoin price prediction using deep learning.Mathematics, 7(10):898, 2019

    Suhwan Ji, Jongmin Kim, and Hyeonseung Im. A comparative study of bitcoin price prediction using deep learning.Mathematics, 7(10):898, 2019

  9. [9]

    Algorithmic financial trading with deep convolutional neural networks: Time series to image conversion approach.Applied Soft Computing, 70:525–538, 2018

    Omer Berat Sezer and Ahmet Murat Ozbayoglu. Algorithmic financial trading with deep convolutional neural networks: Time series to image conversion approach.Applied Soft Computing, 70:525–538, 2018

  10. [10]

    Profitability of technical stock trading: Has it moved from daily to intraday data?Review of Financial Economics, 18(4):190–201, 2019

    Stephan Schulmeister. Profitability of technical stock trading: Has it moved from daily to intraday data?Review of Financial Economics, 18(4):190–201, 2019

  11. [11]

    Deep learning for multivariate time series imputation: A survey.arXiv preprint arXiv:2402.04059, 2024

    Jun Wang, Wenjie Du, Yiyuan Yang, Linglong Qian, Wei Cao, Keli Zhang, Wenjia Wang, Yuxuan Liang, and Qingsong Wen. Deep learning for multivariate time series imputation: A survey.arXiv preprint arXiv:2402.04059, 2024

  12. [12]

    Are language models actually useful for time series forecasting?Advances in Neural Information Processing Systems, 37:60162–60191, 2024

    Mingtian Tan, Mike Merrill, Vinayak Gupta, Tim Althoff, and Tom Hartvigsen. Are language models actually useful for time series forecasting?Advances in Neural Information Processing Systems, 37:60162–60191, 2024

  13. [13]

    Jiang, W

    Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, and Haifeng Chen. Explainable multi-modal time series prediction with llm-in-the-loop.arXiv preprint arXiv:2503.01013, 2025

  14. [14]

    From news to forecast: Integrating event analysis in llm-based time series forecasting with reflection.Advances in Neural Information Processing Systems, 37:58118–58153, 2024

    Xinlei Wang, Maike Feng, Jing Qiu, Jinjin Gu, and Junhua Zhao. From news to forecast: Integrating event analysis in llm-based time series forecasting with reflection.Advances in Neural Information Processing Systems, 37:58118–58153, 2024. 10

  15. [15]

    Language models still struggle to zero-shot reason about time series

    Mike A Merrill, Mingtian Tan, Vinayak Gupta, Thomas Hartvigsen, and Tim Althoff. Language models still struggle to zero-shot reason about time series. InFindings of EMNLP, 2024

  16. [16]

    Hallgrímsson, Maxwell A

    Winnie Chow, Lauren Gardiner, Haraldur T. Hallgrímsson, Maxwell A. Xu, and Shirley You Ren. Towards time series reasoning with llms, 2024

  17. [17]

    Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

    Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models.arXiv preprint arXiv:2310.01728, 2023

  18. [18]

    Temporal data meets llm–explainable financial time series forecasting.arXiv preprint arXiv:2306.11025, 2023

    Xinli Yu, Zheng Chen, Yuan Ling, Shujing Dong, Zongyi Liu, and Yanbin Lu. Temporal data meets llm–explainable financial time series forecasting.arXiv preprint arXiv:2306.11025, 2023

  19. [19]

    Chattime: A unified multimodal time series foundation model bridging numerical and textual data.AAAI Conference on Artificial Intelligence, 2025

    Chengsen Wang, Qi Qi, Jingyu Wang, Haifeng Sun, Zirui Zhuang, Jinming Wu, Lei Zhang, and Jianxin Liao. Chattime: A unified multimodal time series foundation model bridging numerical and textual data.AAAI Conference on Artificial Intelligence, 2025

  20. [20]

    Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.VLDB, 2025

    Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.VLDB, 2025

  21. [21]

    Merrill, Zack Gottesman, Tim Althoff, David Evans, and Tom Hartvigsen

    Mingtian Tan, Mike A. Merrill, Zack Gottesman, Tim Althoff, David Evans, and Tom Hartvigsen. Inferring events from time series using language models, 2025

  22. [22]

    Promptcast: A new prompt-based learning paradigm for time series forecasting.IEEE Transactions on Knowledge and Data Engineering, 36(11):6851–6864, 2023

    Hao Xue and Flora D Salim. Promptcast: A new prompt-based learning paradigm for time series forecasting.IEEE Transactions on Knowledge and Data Engineering, 36(11):6851–6864, 2023

  23. [23]

    Timeseriesexam: A time series understanding exam, 2024

    Yifu Cai, Arjun Choudhry, Mononito Goswami, and Artur Dubrawski. Timeseriesexam: A time series understanding exam, 2024

  24. [24]

    Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

    Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

  25. [25]

    Time series language model for descriptive caption generation, 2025

    Mohamed Trabelsi, Aidan Boyd, Jin Cao, and Huseyin Uzunalioglu. Time series language model for descriptive caption generation, 2025

  26. [26]

    Clasp: Learning concepts for time-series signals from natural language supervision, 2025

    Aoi Ito, Kota Dohi, and Yohei Kawaguchi. Clasp: Learning concepts for time-series signals from natural language supervision, 2025

  27. [27]

    Can brain signals reveal inner alignment with human languages?, 2024

    William Han, Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Douglas Weber, Bo Li, and Ding Zhao. Can brain signals reveal inner alignment with human languages?, 2024

  28. [28]

    New linguistic description approach for time series and its application to bed restlessness monitoring for eldercare.IEEE Transactions on Fuzzy Systems, PP:1–1, 01 2021

    Carmen Martínez-Cruz, Antonio Rueda, Mihail Popescu, and James Keller. New linguistic description approach for time series and its application to bed restlessness monitoring for eldercare.IEEE Transactions on Fuzzy Systems, PP:1–1, 01 2021

  29. [29]

    Monwatch: A fuzzy application to monitor the user behavior using wearable trackers

    José María Serrano Chica. Monwatch: A fuzzy application to monitor the user behavior using wearable trackers. In2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE, 2020

  30. [30]

    A first approach to the generation of linguistic summaries from glucose sensors using gpt-4

    Carmen Martínez-Cruz, Juan Gaitán-Guerrero, José Luis López Ruiz, Antonio Rueda, and Macarena Espinilla. A first approach to the generation of linguistic summaries from glucose sensors using gpt-4. InA First Approach to the Generation of Linguistic Summaries from Glucose Sensors Using GPT-4, pages 33–43, 11 2023

  31. [31]

    Law, Yvonne Freer, Jim Hunter, Robert H

    Andrew S. Law, Yvonne Freer, Jim Hunter, Robert H. Logie, Neil McIntosh, and John Quinn. A comparison of graphical and textual presentations of time series data to support medical decision making in the neonatal intensive care unit.Journal of Clinical Monitoring and Computing, 19(3):183–194, June 2005

  32. [32]

    Contextual analysis of financial time series.Mathematics, 13(1):57, 2025

    Nadezhda Yarushkina, Aleksey Filippov, and Anton Romanov. Contextual analysis of financial time series.Mathematics, 13(1):57, 2025. 11

  33. [33]

    Truth-conditional captioning of time series data

    Harsh Jhamtani and Taylor Berg-Kirkpatrick. Truth-conditional captioning of time series data. InEMNLP, 2021

  34. [34]

    SUSHI: A Dataset of Synthetic Unichannel Signals Based on Heuristic Implementation (Tiny), September 2024

    Yohei Kawaguchi, Kota Dohi, and Aoi Ito. SUSHI: A Dataset of Synthetic Unichannel Signals Based on Heuristic Implementation (Tiny), September 2024

  35. [35]

    Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram, 2023

    Jungwoo Oh, Gyubok Lee, Seongsu Bae, Joon myoung Kwon, and Edward Choi. Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram, 2023

  36. [36]

    Kaplan, Alun D

    Tianwei Xing, Luis Garcia, Federico Cerutti, Lance M. Kaplan, Alun D. Preece, and Mani B. Srivastava. Deepsqa: Understanding sensor data via question answering. InIoTDI, pages 106–118. ACM, 2021

  37. [37]

    Pixiu: A large language model, instruction data and evaluation benchmark for finance, 2023

    Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. Pixiu: A large language model, instruction data and evaluation benchmark for finance, 2023

  38. [38]

    Large language models are zero-shot time series forecasters

    Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. Large language models are zero-shot time series forecasters. InNeurIPS, 2023

  39. [39]

    Large language models for time series: A survey,

    Xiyuan Zhang, Ranak Roy Chowdhury, Rajesh K. Gupta, and Jingbo Shang. Large language models for time series: A survey.ArXiv, abs/2402.01801, 2024

  40. [40]

    Empowering time series analysis with large language models: A survey

    Yushan Jiang, Zijie Pan, Xikun Zhang, Sahil Garg, Anderson Schneider, Yuriy Nevmyvaka, and Dongjin Song. Empowering time series analysis with large language models: A survey. In International Joint Conference on Artificial Intelligence, 2024

  41. [41]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

  42. [42]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

  43. [43]

    Zhong, W

    Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, and Yuxuan Liang. Time-vlm: Exploring multimodal vision-language models for augmented time series forecasting.arXiv preprint arXiv:2502.04395, 2025

  44. [44]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

  45. [45]

    Gemini 2.0 flash

    Google. Gemini 2.0 flash. https://cloud.google.com/vertex-ai/docs/ generative-ai/models/gemini-2-flash, 2025. Accessed: May 15, 2025

  46. [46]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

  47. [47]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

  48. [48]

    Qwen2.5-1m technical report, 2025

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

  49. [49]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  50. [50]

    A first approach to the generation of linguistic summaries from glucose sensors using gpt-4

    Carmen Martínez-Cruz, Juan Gaitán-Guerrero, José Luis López Ruiz, Antonio Rueda, and Macarena Espinilla. A first approach to the generation of linguistic summaries from glucose sensors using gpt-4. InProceedings of the 15th International Conference on Ubiquitous Computing & Ambient Intelligence (UCAmI 2023), volume 842 ofLecture Notes in Networks and Syst...

  51. [51]

    tasksource: A large collection of NLP tasks with a structured dataset prepro- cessing framework

    Damien Sileo. tasksource: A large collection of NLP tasks with a structured dataset prepro- cessing framework. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Con- ference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING...

  52. [52]

    Chain-of-thought prompting elicits reasoning in large language models, 2023

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

  53. [53]

    a value is neared by the flow through exponential decay

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 15 Appendix A Dataset Description 1 3 5 7 9 11 Time Step 0 5 10 15 Value ID: AD_240 mostly lower than starting position it remains mostly flat in the middle. stays steady in the middle 1 3 5 7 9 11 ...

  54. [54]

    Review the annotation:{description}

  55. [55]

    Analyze the time series:{series}

  56. [56]

    Respond withTrueif the annotation accurately describes the time series

    Determine if the annotation precisely matches the pattern depicted in the time series. Respond withTrueif the annotation accurately describes the time series. Respond withFalseif it does not. Avoid providing any additional comments or explanations. Task 2: Differentiation. Carefully analyze the given time series and choose the single best option that most...

  57. [57]

    Read all options before deciding

  58. [58]

    Only output the chosen option, highlighted as A, B, C, or D

  59. [59]

    17 Time series:{series} Options: • A:{option_1} • B:{option_2} • C:{option_3} • D:{option_4} Task 3: Open Generation

    Avoid adding extra text or explanations. 17 Time series:{series} Options: • A:{option_1} • B:{option_2} • C:{option_3} • D:{option_4} Task 3: Open Generation. You are tasked with generating a textual description of the visual properties of the provided time series. Please follow these instructions carefully:

  60. [60]

    Analyze the given time series data:{series}

  61. [61]

    A/B/C/D” or “True/False

    Identify and describe the most prominent visual features or patterns observed in the time series. Consider characteristics such as trends, seasonality, anomalies, or significant changes. Your response should be a concise textual description of the most pronounced visual properties of the time series. Avoid including unnecessary details or unrelated commen...