TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

Ali Maatouk; Andreas Varvarigos; Austin Feng; Daniela Fernandez; Ioannis Panitsas; Jialin Chen; Jinbiao Wei; Leandros Tassiulas; Rex Ying; Yuwei Guo

arxiv: 2510.06063 · v3 · pith:KHIPXZVFnew · submitted 2025-10-07 · 💻 cs.AI · cs.IT· cs.LG· math.IT

TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

Austin Feng , Andreas Varvarigos , Ioannis Panitsas , Daniela Fernandez , Jinbiao Wei , Yuwei Guo , Jialin Chen , Ali Maatouk

show 2 more authors

Leandros Tassiulas Rex Ying

This is my paper

Pith reviewed 2026-05-21 20:19 UTC · model grok-4.3

classification 💻 cs.AI cs.ITcs.LGmath.IT

keywords observability datatime series5G networkanomaly detectionroot cause analysismulti-modal modelsfoundation modelsbenchmark dataset

0 comments

The pith

A new 5G observability dataset shows that current time series and multi-modal models struggle with abrupt, noisy, high-variance dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TelecomTS, a large-scale dataset from a real 5G telecommunications network that supplies heterogeneous time series metrics with preserved absolute scale information rather than normalized or anonymized values. This data exhibits zero-inflation, high stochasticity, and minimal temporal structure unlike smoother domains such as climate or finance. Benchmarking state-of-the-art time series, language, reasoning, and multi-modal models on tasks including anomaly detection, root cause analysis, and multi-modal question-answering demonstrates that existing approaches have difficulty with the abrupt changes and noise typical of system monitoring. The work matters because observability data underpins enterprise system reliability, and the findings point toward the need for models that can directly use raw scale information in practical applications.

Core claim

TelecomTS is a heterogeneous, de-anonymized observability dataset from a 5G network that retains explicit absolute scale in covariates and supports downstream tasks such as anomaly detection, root cause analysis, and multi-modal question-answering; evaluations show that existing time series, language, reasoning, and multi-modal foundation models struggle with its abrupt, noisy, and high-variance dynamics, underscoring the importance of preserving and natively leveraging scale information.

What carries the argument

The TelecomTS dataset supplying raw-scale 5G network metrics and associated multi-modal tasks that expose model limitations on stochastic observability data.

If this is right

Foundation time series models should be designed to accept and use absolute scale information in covariates rather than assuming normalized inputs.
Approaches trained primarily on low-variance domains will likely underperform on high-stochasticity monitoring data without adaptation.
Multi-modal models can now be directly compared on root cause analysis and question-answering using paired time series and textual descriptions from the same system.
Public benchmarks for observability applications must retain raw scale values to remain representative of production environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Comparable datasets from cloud or IoT monitoring could reveal whether the observed model limitations generalize beyond telecommunications.
Pretraining objectives that explicitly model zero-inflation and abrupt shifts might yield more robust observability-specific models.
Routine preservation of absolute scale during data collection could become standard practice for time series applications in other noisy domains.

Load-bearing premise

The dataset drawn from one 5G network is representative of general enterprise observability data and the selected tasks reflect authentic real-world challenges without selection bias.

What would settle it

A replication showing that current models reach high accuracy on TelecomTS tasks after standard fine-tuning, or that the same performance gap does not appear on other observability datasets, would challenge the central claim.

Figures

Figures reproduced from arXiv: 2510.06063 by Ali Maatouk, Andreas Varvarigos, Austin Feng, Daniela Fernandez, Ioannis Panitsas, Jialin Chen, Jinbiao Wei, Leandros Tassiulas, Rex Ying, Yuwei Guo.

**Figure 1.** Figure 1: An overview of TelecomTS, illustrating its data curation pipeline, covariate characteristics, and the range of supported multi-modal downstream tasks. (1) the lack of publicly available datasets due to the proprietary nature of observability data, (2) anonymization in the few existing datasets, which obscures both the identity of the metrics and vital information such as their absolute scale; and (3) the l… view at source ↗

**Figure 2.** Figure 2: Overview of the 5G wireless network used for data collection: (a) mobile devices used [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An overview of the anomalies curation process. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: An illustrative difference between UCR Archive Anomaly dataset and the anomalies found [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: An overview of the Q&A dataset. Finally, for an additional multimodal downstream task, we curate a set of Q&A pairs designed to probe the model’s understanding of the time series data. Two families of Q&A are created: the first focuses on qualitative and quantitative aspects to assess a model’s ability to reason about inherent statistical and structural properties of the time series. Particularly, for … view at source ↗

**Figure 6.** Figure 6: Illustration of a failure case that affected all [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Forecasting results of the highest-performing model (Informer) highlight key challenges: [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Randomly sampled variates from the ETTh1 dataset. Next, we observe the MotorImagery dataset that collects EEG data of imagined body movements using an 8 × 8 platinum electrode grid. Each of the 64 sensors corresponds to a variate, and data is recorded every millisecond. While the variates shown in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Randomly sampled variates from the MotorImagery dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 11.** Figure 11: In particular, DewPointFarenheit and DryBulbCelsius exhibit strong daily fluctuations, [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 10.** Figure 10: Randomly sampled variates from the FRED-MD dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Randomly sampled variates from the WTH dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Randomly sampled sequence from TelecomTS. Time 6 8 10 12 14 (a) UL_MCS Time 0 1000 2000 (b) Estimated_UL_Buffer Time 0 20 40 (c) PRBs_DL_Current Time 0.05 0.10 0.15 0.20 0.25 (d) DL_BLER Time 6 8 10 12 (e) DL_MCS Time none UDP TCP (f) DL_Protocol [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Randomly sampled sequence from TelecomTS. link. Additionally, the gNB connects to the 5G core network instance over standard N2/N3 interfaces through a separate 10 Gbps Ethernet backhaul link, enabling full end-to-end standalone operation. A visual overview of the network deployment is provided in [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Spatial partitioning of the environment into 3 zones. Network Zoning for Controlled Experiments. To systematically capture KPI variations under diverse radio conditions, the network was deployed in a controlled lab environment covering approximately 70 m2 . The space was partitioned into three spatial zones—Zone A, Zone B, and Zone C—based on the distance between the UE and the RU. This zoning strategy… view at source ↗

**Figure 15.** Figure 15: Spectrograms illustrating benign and adversarial interference patterns during collection. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Number of packets (top) and number of transmitted bytes (bottom) before (a) and after (b) [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Examples of anomaly effects under varying function types. [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

read the original abstract

Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as climate, observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets remain underrepresented in public benchmarks due to proprietary restrictions and privacy concerns. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks such as anomaly detection, root cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit absolute scale information and provides a diverse suite of downstream tasks, including anomaly detection, root cause analysis, and multi-modal question-answering. Benchmarking state-of-the-art time series, language, reasoning, and multi-modal foundation models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics characteristic of observability data. Our experiments further underscore the importance of preserving covariates' absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical real-world observability applications. The code is available at: https://github.com/Ali-maatouk/TelecomTS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TelecomTS, a large-scale multi-modal observability dataset derived from a single 5G telecommunications network. It provides heterogeneous, de-anonymized covariates retaining absolute scale information and defines downstream tasks including anomaly detection, root cause analysis, and multi-modal question-answering. Benchmarking of state-of-the-art time series, language, reasoning, and multi-modal foundation models is reported to show that existing approaches struggle with the abrupt, noisy, and high-variance dynamics of observability data, with additional emphasis on the importance of preserving absolute scale.

Significance. If the empirical findings hold, the work is significant for releasing a public dataset that fills a gap in observability benchmarks, which are typically anonymized or normalized and thus limited for tasks requiring scale and noise modeling. The explicit availability of code at the cited GitHub repository supports reproducibility. The focus on scale information could usefully guide future foundation model development for real-world monitoring applications, though the single-network origin constrains broader generalization.

major comments (2)

[Experiments section] Experiments section: The benchmarking claims that models struggle with observability dynamics are presented without sufficient detail on model variants, exact evaluation metrics, hyperparameter choices, or statistical significance tests; this prevents verification of the performance gaps and their attribution to abrupt/noisy characteristics rather than implementation choices.
[Dataset and Tasks sections] Dataset and Tasks sections: The dataset is collected from one 5G deployment; without additional analysis or cross-validation showing that zero-inflation, covariate scales, and variance patterns are representative of general enterprise observability (rather than telecom-specific artifacts), the claim that SOTA models struggle with characteristic observability dynamics rests on an untested proxy assumption.

minor comments (2)

[Abstract] Abstract: The summary of benchmarking results does not name the specific models or tasks evaluated, reducing standalone clarity.
[Figures] Figures: Time-series example plots would benefit from explicit scale annotations and legends to illustrate the absolute-scale preservation emphasized in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and describe the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments section] Experiments section: The benchmarking claims that models struggle with observability dynamics are presented without sufficient detail on model variants, exact evaluation metrics, hyperparameter choices, or statistical significance tests; this prevents verification of the performance gaps and their attribution to abrupt/noisy characteristics rather than implementation choices.

Authors: We agree that the current level of detail is insufficient for independent verification. In the revised manuscript we will expand the Experiments section with (i) an exhaustive table of all model variants including architecture, parameter count, and fine-tuning procedure, (ii) precise definitions and formulas for every evaluation metric, (iii) the full hyperparameter search space and final selected values, and (iv) results of statistical significance tests (bootstrap confidence intervals and paired Wilcoxon tests) that quantify the performance gaps. These additions will allow readers to attribute differences more confidently to data characteristics. revision: yes
Referee: [Dataset and Tasks sections] Dataset and Tasks sections: The dataset is collected from one 5G deployment; without additional analysis or cross-validation showing that zero-inflation, covariate scales, and variance patterns are representative of general enterprise observability (rather than telecom-specific artifacts), the claim that SOTA models struggle with characteristic observability dynamics rests on an untested proxy assumption.

Authors: We acknowledge that TelecomTS originates from a single network and that explicit cross-network validation is not feasible with the data we have access to. In the revision we will add a dedicated Limitations subsection that (a) qualifies the generalization claim, (b) cites domain literature indicating that zero-inflation, absolute-scale heterogeneity, and abrupt variance are common across enterprise observability platforms, and (c) positions TelecomTS as an initial public benchmark rather than a definitive universal proxy. We will also soften the language in the abstract and introduction to reflect this scope. revision: partial

Circularity Check

0 steps flagged

Dataset release and empirical benchmarking exhibit no circularity

full rationale

The paper's core contribution is the release of TelecomTS, a new 5G-derived observability dataset, together with downstream tasks (anomaly detection, root cause analysis, multi-modal QA) and benchmarking of existing foundation models. No derivation chain, equations, or fitted parameters are claimed; the reported model struggles are direct empirical observations on the released data rather than reductions to prior fits or self-citations. The work is self-contained against external benchmarks because the dataset and tasks are newly introduced and the evaluation uses standard public models without load-bearing self-referential premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a new empirical dataset rather than relying on fitted parameters or new theoretical entities; the main premises are domain descriptions of observability data.

axioms (1)

domain assumption Observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure.
Stated directly in the abstract as distinguishing characteristics of the data domain.

pith-pipeline@v0.9.0 · 5807 in / 1242 out tokens · 41177 ms · 2026-05-21T20:19:42.474983+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G
cs.AI 2026-05 unverdicted novelty 4.0

The paper envisions AI-native 6G networks anchored by a foundation model and multi-agent systems to shift network management to a unified multi-modal optimization problem.