TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning

arxiv: 2602.14200 · v5 · submitted 2026-02-15 · 💻 cs.LG

TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning

Nicolas Zumarraga , Thomas Kaar , Ning Wang , William Tennien , Alpay Hasanli , Max Rosenblattl , Fan Wu , Kevin Riehl

show 7 more authors

Maxwell A. Xu Markus Kreft Kevin O'Sullivan Elgar Fleisch Paul Schmiedmayer Robert Jakob Patrick Langer

This is my paper

Pith reviewed 2026-05-15 21:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords time-serieslong-contextretrieval benchmarkTSLMagentic retrievaltemporal reasoninganomaly detectionquestion answering

0 comments p. Extension

The pith

Existing time-series language models lose accuracy as contexts lengthen to a full day, yet an agentic system with classifier tools recovers performance on nine of ten tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TS-Haystack, a benchmark of ten event-grounded question-answering tasks over time-series data that stretches from 100 seconds to 24 hours. It documents clear degradation in current TSLMs: accuracy falls with longer inputs, some models run out of memory on high-rate signals beyond 100 seconds, and tasks that require precise time-interval answers drop toward zero performance. The authors demonstrate that routing retrieval through specialized time-series classifier tools in an agentic setup matches or exceeds state-of-the-art TSLMs on nine of the ten tasks. This pattern indicates that full-context tokenization is not the only viable route for long temporal reasoning.

Core claim

TS-Haystack shows that TSLMs suffer severe long-context degradation: accuracy declines with context length from 100 seconds to 24 hours, direct-tokenization models encounter memory limits beyond 100 seconds on high-rate signals, and time-interval-grounded tasks approach near-zero accuracy; an agentic retrieval framework that employs specialized time-series classifier tools matches or outperforms existing TSLMs on 9 of 10 tasks.

What carries the argument

The TS-Haystack benchmark of ten event-grounded QA tasks over variable-length time series, paired with an agentic retrieval framework that invokes specialized time-series classifier tools to locate relevant segments before reasoning.

If this is right

Direct tokenization of long time series will remain impractical for many applications due to memory exhaustion and accuracy collapse.
Retrieval-first agentic designs provide a practical path to scale temporal reasoning without loading entire histories.
Tasks that hinge on exact time intervals or anomaly localization are the most sensitive to context growth.
Benchmarking must expand beyond short contexts to expose these limits in future TSLM development.
Tool use focused on time-series classification can substitute for end-to-end long-context modeling in several domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continuous monitoring systems in medicine or infrastructure may shift toward retrieval agents rather than monolithic models.
Training regimes that interleave retrieval and reasoning steps could reduce the need for ever-larger context windows.
The same length-based degradation may surface in other sequential data types, pointing to a broader architectural pattern.
Adding tasks with irregular sampling rates or missing values would test whether the current performance gap persists.

Load-bearing premise

The ten tasks capture the essential difficulties of real-world long time-series reasoning and that observed accuracy drops are caused primarily by context length rather than task construction or data traits.

What would settle it

A direct TSLM that maintains high accuracy on the same tasks when context lengths are doubled, or an agentic system that loses its advantage on a fresh collection of long time-series tasks with comparable event density.

Figures

Figures reproduced from arXiv: 2602.14200 by Alpay Hasanli, Elgar Fleisch, Fan Wu, Kevin O'Sullivan, Kevin Riehl, Markus Kreft, Max Rosenblattl, Maxwell A. Xu, Nicolas Zumarraga, Ning Wang, Patrick Langer, Paul Schmiedmayer, Robert Jakob, Thomas Kaar, William Tennien.

**Figure 1.** Figure 1: TS-Haystack construction pipeline. Left: A 1 hour window is sampled from a participant, activity needles ranging 1s – 6s are sampled at random. Right: Question-Answer pairs and chain-ofthought rationales are generated from plots and metadata-enriched prompt templates. continuous signals and the embedding space of LLMs (Langer et al., 2025; Jin et al., 2024; Wang et al., 2025) using specialized time series… view at source ↗

**Figure 2.** Figure 2: Scaling behaviors. (a) Capture24 classification: Macro-F1 and Balanced Accuracy increase with context length. (b) TS-Haystack: Retrieval accuracy degrades with context length [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Task 1: Existence. The model must detect whether a specific activity is present among [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Task 2 (Localization). The model must identify and temporally ground a specific activity [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Task 3 (Counting). The model must enumerate all occurrences of a specific activity [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Task 4 (Ordering). The model must determine the temporal relationship between two [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Task 5 (State Query). The model must identify the global activity regime surrounding a [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Task 6 (Antecedent). The model must identify the activity immediately preceding a target [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Task 7 (Comparison). The model must compare durations across multiple activity bouts or [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Task 8 (Multi-Hop). The model must locate an anchor, then count to the [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Task 9 (Anomaly Detection). The model must identify cross-regime activity without being [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Task 10 (Anomaly Localization). The model must detect and temporally ground cross [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

read the original abstract

Time Series Language Models (TSLMs) promise reasoning over real-world temporal data, but their ability to retrieve and reason over long time-series remains largely untested. We introduce TS-Haystack, a multi-domain retrieval benchmark with ten event-grounded question-answering tasks over contexts from 100 seconds to 24 hours, spanning direct retrieval, temporal reasoning, multi-step reasoning, and contextual anomaly detection. Existing TSLMs exhibit severe long-context degradation: accuracy declines with context length, direct-tokenization models run out of memory beyond 100 seconds on high-rate signals, and time-interval-grounded tasks collapse toward near-zero accuracy when increasing the time-series lengths, aligning with existing literature on text and multi-modal long context retrieval. An agentic retrieval framework using specialized time-series classifier tools matches or outperforms SoTA TSLMs on 9 of 10 tasks, highlighting agentic retrieval as a promising approach for long-context TSLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TS-Haystack gives a new multi-task benchmark for long time-series contexts and shows agentic retrieval holding up better than direct models, but the length-degradation story still needs controls that isolate context from task difficulty.

read the letter

The paper's core offering is TS-Haystack, a set of ten event-grounded QA tasks that stretch from 100 seconds to 24 hours and cover direct retrieval, temporal reasoning, multi-step inference, and anomaly detection. It reports that existing TSLMs lose accuracy as length grows, hit memory walls on high-rate signals, and collapse on interval-based questions, while an agentic system that calls specialized time-series classifiers matches or beats them on nine of the ten tasks. That comparison is the practical takeaway worth checking first.

Referee Report

2 major / 2 minor

Summary. The paper introduces TS-Haystack, a multi-domain retrieval benchmark with ten event-grounded QA tasks over time-series contexts ranging from 100 seconds to 24 hours. It evaluates existing TSLMs on direct retrieval, temporal reasoning, multi-step reasoning, and anomaly detection, reporting severe accuracy degradation with increasing context length, out-of-memory failures for direct-tokenization models beyond 100 seconds on high-rate signals, and near-zero performance on time-interval tasks at longer scales. The authors further present an agentic retrieval framework using specialized time-series classifier tools that matches or outperforms SoTA TSLMs on 9 of 10 tasks.

Significance. If the performance drops can be rigorously attributed to context length rather than task scaling artifacts, the benchmark would usefully document limitations of current TSLMs and motivate agentic retrieval as a practical alternative for long-context time-series reasoning. The work supplies a concrete, multi-task evaluation suite that could serve as a standard testbed, though its impact depends on the strength of the length-isolation controls.

major comments (2)

[Benchmark construction and experimental setup] The central attribution of accuracy decline to context length (abstract and experimental results) lacks evidence of controls that hold event density, anomaly statistics, or question difficulty fixed while varying context length from 100 s to 24 h. Without such isolation, the observed collapse on time-interval tasks could arise from intrinsic increases in task hardness rather than model limitations, directly undermining the claim that degradation is length-driven.
[Results section] Reported results supply no error bars, statistical significance tests, or data-exclusion rules, leaving the directional claims only partially verifiable and preventing assessment of whether differences between TSLMs and the agentic framework are reliable.

minor comments (2)

[§3] Full task definitions, exact question templates, and precise rules for context sampling across lengths should be moved to the main text or a clearly referenced appendix for reproducibility.
[Notation and task definitions] Notation for time intervals and event grounding is introduced without a consolidated table; a single reference table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and describe the revisions we will make to strengthen the work.

read point-by-point responses

Referee: [Benchmark construction and experimental setup] The central attribution of accuracy decline to context length (abstract and experimental results) lacks evidence of controls that hold event density, anomaly statistics, or question difficulty fixed while varying context length from 100 s to 24 h. Without such isolation, the observed collapse on time-interval tasks could arise from intrinsic increases in task hardness rather than model limitations, directly undermining the claim that degradation is length-driven.

Authors: We agree that rigorous isolation of context length from potential changes in task difficulty is essential. The benchmark construction samples a fixed number of events from the same underlying distributions and pads with background segments drawn from the same source distribution, which was intended to keep event density and anomaly statistics constant; however, we did not explicitly quantify or report difficulty metrics (e.g., question complexity or event-rate statistics) across lengths in the original submission. In the revised manuscript we will add a dedicated subsection detailing these construction choices, report event-density and difficulty statistics for every context length, and include controlled experiments on a subset of tasks in which longer contexts are subsampled to match the event statistics of shorter ones. These additions will allow readers to assess whether the observed degradation is length-driven. revision: yes
Referee: [Results section] Reported results supply no error bars, statistical significance tests, or data-exclusion rules, leaving the directional claims only partially verifiable and preventing assessment of whether differences between TSLMs and the agentic framework are reliable.

Authors: We acknowledge that the original results section omitted error bars, statistical tests, and explicit data-exclusion criteria. In the revised version we will recompute all reported accuracies with standard error bars over multiple random seeds or runs, add paired statistical significance tests (e.g., t-tests or Wilcoxon tests) for all model comparisons, and include a clear statement of data-exclusion rules (e.g., filtering criteria for invalid time-series segments or malformed questions). These changes will make the reliability of the directional claims directly verifiable. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivations or fitted parameters

full rationale

The paper introduces the TS-Haystack benchmark and reports direct empirical measurements of TSLM performance across context lengths on ten event-grounded tasks. No mathematical derivations, equations, parameter fittings, or self-citation chains are present that reduce any claim to the paper's own inputs by construction. Accuracy degradation and agentic framework comparisons are presented as observed results from the benchmark rather than quantities defined in terms of prior fitted values or self-referential premises. The evaluation is self-contained against external model baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work is an empirical benchmark introduction without explicit derivations or new postulated constructs.

pith-pipeline@v0.9.0 · 5500 in / 1174 out tokens · 21797 ms · 2026-05-15T21:39:59.646609+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Yifu Cai, Arjun Choudhry, Xing Hu, and Artur Dubrawski

doi: 10.1145/2499621. Yifu Cai, Arjun Choudhry, Xing Hu, and Artur Dubrawski. Timeseriesexam: A time series under- standing exam.arXiv preprint arXiv:2410.14752,

work page doi:10.1145/2499621
[2]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

URLhttps://arxiv.org/abs/2310.01728. Greg Kamradt. Needle in a haystack - pressure testing LLMs. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack,

work page internal anchor Pith review arXiv
[5]

Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A

Patrick Langer et al. OpenTSLM: Time-series language models for reasoning over multivariate medical text- and time-series data.arXiv preprint arXiv:2510.02410,

work page arXiv
[6]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models

6 Published as a workshop paper at ICLR 2026 Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Namber, Tanuja Yu, and Hao Liu. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models.arXiv preprint arXiv:2406.11230,

work page arXiv 2026
[8]

Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty

URLhttps://arxiv.org/abs/2506.20093. Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty. Statistical machine learning of sleep and physical activity phenotypes from sensor data in 96,220 UK Biobank participants.Scientific Reports, 8(1):7961,

work page arXiv
[9]

A CLASSIFICATIONEXPERIMENTDETAILS A.1 EXPERIMENTALSETUP Model architecture.We use the Flamingo variant of OpenTSLM (Langer et al.,

doi: 10.1038/s41598-018-26174-1. A CLASSIFICATIONEXPERIMENTDETAILS A.1 EXPERIMENTALSETUP Model architecture.We use the Flamingo variant of OpenTSLM (Langer et al.,

work page doi:10.1038/s41598-018-26174-1
[10]

Random” indicates uniform subsampling to the budget cap; “full

into train (100), validation (25) and test (26) splits at random. We adopt a curriculum learning approach, training sequentially from shorter to longer contexts. For each context length, we use a maximum sample budget: 80,000 training, 15,000 validation, and 15,000 test samples (see Tables 3 and 4 for the relationship between available and sampled data). ...

work page 2026
[11]

Their benchmark evaluates five tasks: forecasting, classification, anomaly detection, and imputation, with emphasis on limited supervision settings

releases the first open-source foundation models for general-purpose time series analysis, pre-trained on the Time Series Pile, a large multi-domain collection of public time series. Their benchmark evaluates five tasks: forecasting, classification, anomaly detection, and imputation, with emphasis on limited supervision settings. 9 Published as a workshop...

work page 2026
[12]

Their TSQA dataset comprises approximately 200k question-answer pairs across diverse domains

is more closely aligned with our objective, unifying forecasting, imputation, anomaly detection, classification, and open-ended reasoning under a question-answering framework. Their TSQA dataset comprises approximately 200k question-answer pairs across diverse domains. However, Time-MQA does not specifically probelong-context retrieval, the ability to loc...

work page 2023
[13]

02:34:56:789 AM

adapts the paradigm to multimodal retrieval, stress- testing models’ ability to locate target sub-images within large visual contexts. To our knowledge, TS-Haystack is the first benchmark to apply the needle-in-a-haystack methodology to continuous time series data, where the challenge is compounded by the absence of discrete token boundaries and the need ...

work page 2024
[14]

These annotations support training and evaluation of reasoning capabilities beyond direct answer extraction

Chain-of-thought annotations.For each sample, we generate chain-of-thought (CoT) rationales using ChatGPT 4.1 mini (OpenAI, 2023), conditioned on image plots of the accelerometer data along with rich metadata including activity timelines, bout boundaries, and signal statistics. These annotations support training and evaluation of reasoning capabilities be...

work page 2023
[15]

Answer:<your answer>

Results.The classifier achieves an AUC of 0.499 and 0.490 respectively for different context lengths, close to random chance (0.50). This confirms that our mean-shift normalization combined with cosine blending produces insertions that are statistically undetectable, even to a gradient-boosted ensemble operating on raw signal features. The protocol succes...

work page 2026

[1] [1]

Yifu Cai, Arjun Choudhry, Xing Hu, and Artur Dubrawski

doi: 10.1145/2499621. Yifu Cai, Arjun Choudhry, Xing Hu, and Artur Dubrawski. Timeseriesexam: A time series under- standing exam.arXiv preprint arXiv:2410.14752,

work page doi:10.1145/2499621

[2] [2]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

URLhttps://arxiv.org/abs/2310.01728. Greg Kamradt. Needle in a haystack - pressure testing LLMs. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack,

work page internal anchor Pith review arXiv

[5] [5]

Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A

Patrick Langer et al. OpenTSLM: Time-series language models for reasoning over multivariate medical text- and time-series data.arXiv preprint arXiv:2510.02410,

work page arXiv

[6] [6]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models

6 Published as a workshop paper at ICLR 2026 Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Namber, Tanuja Yu, and Hao Liu. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models.arXiv preprint arXiv:2406.11230,

work page arXiv 2026

[8] [8]

Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty

URLhttps://arxiv.org/abs/2506.20093. Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty. Statistical machine learning of sleep and physical activity phenotypes from sensor data in 96,220 UK Biobank participants.Scientific Reports, 8(1):7961,

work page arXiv

[9] [9]

A CLASSIFICATIONEXPERIMENTDETAILS A.1 EXPERIMENTALSETUP Model architecture.We use the Flamingo variant of OpenTSLM (Langer et al.,

doi: 10.1038/s41598-018-26174-1. A CLASSIFICATIONEXPERIMENTDETAILS A.1 EXPERIMENTALSETUP Model architecture.We use the Flamingo variant of OpenTSLM (Langer et al.,

work page doi:10.1038/s41598-018-26174-1

[10] [10]

Random” indicates uniform subsampling to the budget cap; “full

into train (100), validation (25) and test (26) splits at random. We adopt a curriculum learning approach, training sequentially from shorter to longer contexts. For each context length, we use a maximum sample budget: 80,000 training, 15,000 validation, and 15,000 test samples (see Tables 3 and 4 for the relationship between available and sampled data). ...

work page 2026

[11] [11]

Their benchmark evaluates five tasks: forecasting, classification, anomaly detection, and imputation, with emphasis on limited supervision settings

releases the first open-source foundation models for general-purpose time series analysis, pre-trained on the Time Series Pile, a large multi-domain collection of public time series. Their benchmark evaluates five tasks: forecasting, classification, anomaly detection, and imputation, with emphasis on limited supervision settings. 9 Published as a workshop...

work page 2026

[12] [12]

Their TSQA dataset comprises approximately 200k question-answer pairs across diverse domains

is more closely aligned with our objective, unifying forecasting, imputation, anomaly detection, classification, and open-ended reasoning under a question-answering framework. Their TSQA dataset comprises approximately 200k question-answer pairs across diverse domains. However, Time-MQA does not specifically probelong-context retrieval, the ability to loc...

work page 2023

[13] [13]

02:34:56:789 AM

adapts the paradigm to multimodal retrieval, stress- testing models’ ability to locate target sub-images within large visual contexts. To our knowledge, TS-Haystack is the first benchmark to apply the needle-in-a-haystack methodology to continuous time series data, where the challenge is compounded by the absence of discrete token boundaries and the need ...

work page 2024

[14] [14]

These annotations support training and evaluation of reasoning capabilities beyond direct answer extraction

Chain-of-thought annotations.For each sample, we generate chain-of-thought (CoT) rationales using ChatGPT 4.1 mini (OpenAI, 2023), conditioned on image plots of the accelerometer data along with rich metadata including activity timelines, bout boundaries, and signal statistics. These annotations support training and evaluation of reasoning capabilities be...

work page 2023

[15] [15]

Answer:<your answer>

Results.The classifier achieves an AUC of 0.499 and 0.490 respectively for different context lengths, close to random chance (0.50). This confirms that our mean-shift normalization combined with cosine blending produces insertions that are statistically undetectable, even to a gradient-boosted ensemble operating on raw signal features. The protocol succes...

work page 2026