pith. machine review for the scientific record. sign in

arxiv: 2605.09481 · v1 · submitted 2026-05-10 · 💻 cs.NI

Recognition: 2 theorem links

· Lean Theorem

TSNBench: Benchmarking LLM Proficiency in Time-Sensitive Networking

Daniel Bujosa Mateu, Luxi Zhao, Paul Pop, Rubi Debnath, Sebastian Steinhorst, Silviu S. Craciunas

Pith reviewed 2026-05-12 03:47 UTC · model grok-4.3

classification 💻 cs.NI
keywords LLM benchmarkingTime-Sensitive NetworkingWorst-Case DelayCredit-Based ShaperCyclic Queuing and ForwardingNetwork CalculusSafety-critical systemsMultiple-choice evaluation limits
0
0 comments X

The pith

LLMs that pass multiple-choice tests on time-sensitive networking still make large errors when calculating actual network delays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TSNBench to measure how accurately large language models can handle Time-Sensitive Networking, the standards that guarantee bounded latency in systems such as autonomous vehicles and industrial automation. It pairs 939 expert-checked multiple-choice questions on TSN mechanisms with 100 open-ended tasks that require computing the worst-case delay for flows under Credit-Based Shaper and Cyclic Queuing and Forwarding. Models reach 67 to 95 percent accuracy on the questions yet produce average percentage errors between 36 and over 80 percent on the delay calculations, with the errors large enough to break real-time guarantees. The authors conclude that multiple-choice formats alone can overstate model readiness for safety-critical networking work.

Core claim

TSNBench shows that although current LLMs achieve 67 to 95 percent accuracy on 939 expert-validated multiple-choice questions covering diverse TSN mechanisms, they fail substantially on 100 open-ended worst-case delay computation tasks, with the best model reaching only 36.2 percent mean absolute percentage error on Credit-Based Shaper cases and most models exceeding 80 percent, and similar high errors on Cyclic Queuing and Forwarding; these deviations are large relative to TSN latency budgets and can produce unsafe network configurations.

What carries the argument

TSNBench benchmark, which combines expert-validated multiple-choice questions with open-ended worst-case delay tasks whose ground-truth values come from a verified network calculus solver for CBS and closed-form mathematical bounds for CQF.

If this is right

  • LLMs cannot be relied upon to produce delay bounds accurate enough for TSN network configuration without external verification.
  • High scores on multiple-choice TSN questions do not predict success on the quantitative calculations needed for deterministic networking.
  • Safety-critical domains require benchmarks that test application of knowledge rather than recognition of concepts.
  • Errors of this magnitude can cause real-time constraint violations when LLMs are used to design or validate TSN flows.
  • Closed-form bounds and network calculus solvers remain necessary to check LLM outputs in TSN settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar overestimation risks likely appear in other quantitative engineering domains that use multiple-choice tests for certification.
  • Hybrid benchmarks that combine questions with simulation or solver verification could become standard for assessing LLM use in regulated systems.
  • One practical path forward is to couple LLMs with formal analysis tools so that generated configurations are automatically checked against delay bounds.
  • The gap between recognition and calculation performance may shrink only after models receive training data that includes many solved open-ended network examples.

Load-bearing premise

The 100 open-ended worst-case delay tasks and their computed ground-truth values are sufficient and representative of real proficiency in applying TSN mechanisms.

What would settle it

A new evaluation in which the same models achieve mean absolute percentage error below 10 percent on a fresh set of TSN topologies and traffic patterns with verified ground truth would indicate that the overestimation finding does not hold.

Figures

Figures reproduced from arXiv: 2605.09481 by Daniel Bujosa Mateu, Luxi Zhao, Paul Pop, Rubi Debnath, Sebastian Steinhorst, Silviu S. Craciunas.

Figure 1
Figure 1. Figure 1: TSNBench keyword-generation pipeline. TSN keywords are extracted from research [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of our TSNBench MCQA dataset generator, showing all steps from raw generation [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline for TSNBench open-ended question formulation by domain experts. Each question [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Reliability plot for o3 and Grok 4.1 Fast (NR). Full relia￾bility analysis are in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reliability diagram representing the performance of all 16 state-of-the-art models evaluated [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison across MCQA and open-ended WCD computation for all 16 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A sample TSN network with TSN senders, receivers, and TSN switches in the network. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A sample wireless-TSN network with TSN senders, wireless receivers (such as robotic arm [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A simple CBS mechanism with eight queues in the egress port of the switch with different [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: TSN output port with two AVB queues employing CBS and one BE queue. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A simple CQF mechanism with eight queues in the egress port of the switch with two [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The Hypercyle also known as the scheduling cycle of the CQF (400 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: In this figure, we showcase the even and the odd queue in CQF architecture and during [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: One-switch topology used to evaluate open-ended questions in TSNBench. [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Medium-mesh topology used to evaluate open-ended questions in TSNBench. [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Ring topology representing the industrial ring network used to evaluate open-ended [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Performance comparison across MCQA and open-ended WCD computation for all 16 [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
read the original abstract

We present TSNBench, the first benchmark for evaluating large language model (LLM) proficiency in Time-Sensitive Networking (TSN), a suite of IEEE 802.1 standards for deterministic communication with bounded latency in safety-critical domains such as autonomous vehicles, aviation, defense, and industrial automation. While LLMs have been extensively evaluated on general knowledge tasks, their capabilities in safety-critical networking domains remain largely unexplored. TSNBench comprises 939 expert-validated multiple-choice questions (MCQs) covering diverse TSN mechanisms, along with 100 open-ended Worst-Case Delay (WCD) computation tasks for Credit-Based Shaper (CBS) and Cyclic Queuing and Forwarding (CQF) across varying network topologies and traffic conditions. MCQ answers are validated by domain experts, and open-ended ground truth WCD values are computed using a verified Network Calculus (NC) solver for CBS and closed-form mathematical upper bounds for CQF. We evaluate 16 LLMs and find that although models achieve 67 to 95% accuracy on MCQs, they fail substantially on open-ended WCD computation. For CBS, only GPT-5 achieves a Mean Absolute Percentage Error (MAPE) of 36.2%, meaning its predicted WCD deviates by 36.2% of the actual TSN flow delay on average, while most models exceed 80%. For CQF, the best model achieves 41.8% MAPE, with most models clustering between 80% and 100%. Such errors are large relative to TSN latency budgets and can lead to violations of real-time constraints and unsafe configurations. TSNBench demonstrates that MCQ benchmarks may overestimate LLM capabilities in safety-critical networking domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TSNBench, the first benchmark for LLM proficiency in Time-Sensitive Networking (TSN). It consists of 939 expert-validated multiple-choice questions (MCQs) covering TSN mechanisms and 100 open-ended Worst-Case Delay (WCD) computation tasks for Credit-Based Shaper (CBS) using a verified Network Calculus solver and Cyclic Queuing and Forwarding (CQF) using closed-form bounds. Evaluation of 16 LLMs shows 67-95% accuracy on MCQs but substantially higher errors on WCD tasks (best MAPE 36.2% for CBS, 41.8% for CQF, with most models 80-100%), leading to the conclusion that MCQ benchmarks may overestimate LLM capabilities in safety-critical networking domains such as autonomous vehicles and industrial automation.

Significance. If the central results hold, the work provides a valuable cautionary demonstration that high MCQ performance does not imply readiness for quantitative, safety-critical applications of TSN standards. The grounding via expert validation of MCQs and use of a verified NC solver plus closed-form bounds is a clear strength, offering reproducible ground truth independent of the models. This could influence benchmark design for AI in deterministic networking and highlight the need for open-ended, tool-assisted evaluations in domains with strict latency bounds.

major comments (2)
  1. [Benchmark construction and evaluation sections] The 100 open-ended WCD tasks (described in the benchmark construction section): these are narrowly scoped to numerical WCD computation for only CBS (NC solver) and CQF (closed-form bounds) across topologies and traffic conditions. Real TSN proficiency requires iterative configuration, trade-off analysis, and integration of additional mechanisms such as TAS, preemption, and PSFP, none of which are tested. The observed MAPE gap (36-100%) could therefore reflect general arithmetic limitations rather than TSN-specific misunderstanding, weakening the generalization that MCQ scores systematically overestimate domain capabilities.
  2. [Abstract] Abstract and evaluation protocol: the exact prompt templates, question generation process, and full evaluation protocol for the open-ended tasks are left unspecified. This makes it impossible to determine whether the high WCD errors stem from conceptual gaps in TSN or from prompt sensitivity, limiting the load-bearing strength of the claim that MCQs overestimate proficiency.
minor comments (2)
  1. [Abstract] The abstract should briefly note the total number of topologies and traffic patterns used in the 100 WCD tasks to allow readers to assess coverage.
  2. [Results section] Minor notation inconsistency: ensure consistent use of MAPE definition across CBS and CQF results to avoid reader confusion on error scaling relative to TSN latency budgets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on TSNBench. We address each major comment point by point below, with clarifications based on the manuscript content and proposed revisions where the comments identify areas for improvement.

read point-by-point responses
  1. Referee: [Benchmark construction and evaluation sections] The 100 open-ended WCD tasks (described in the benchmark construction section): these are narrowly scoped to numerical WCD computation for only CBS (NC solver) and CQF (closed-form bounds) across topologies and traffic conditions. Real TSN proficiency requires iterative configuration, trade-off analysis, and integration of additional mechanisms such as TAS, preemption, and PSFP, none of which are tested. The observed MAPE gap (36-100%) could therefore reflect general arithmetic limitations rather than TSN-specific misunderstanding, weakening the generalization that MCQ scores systematically overestimate domain capabilities.

    Authors: We agree that the open-ended tasks are scoped to CBS and CQF WCD computation and do not cover iterative configuration or mechanisms such as TAS, preemption, or PSFP. These two mechanisms were selected as they represent foundational and widely deployed TSN shapers with established analytical models (verified NC solver for CBS and closed-form bounds for CQF), allowing reproducible ground truth. The tasks still require models to correctly map TSN-specific concepts (e.g., credit parameters, burst sizes, interference patterns across topologies) into the appropriate formulas, which is distinct from generic arithmetic. Nevertheless, we acknowledge the scope limitation weakens broad generalization claims. We will add an explicit Limitations subsection in the revised manuscript discussing the narrow focus and outlining planned extensions to additional mechanisms. revision: partial

  2. Referee: [Abstract] Abstract and evaluation protocol: the exact prompt templates, question generation process, and full evaluation protocol for the open-ended tasks are left unspecified. This makes it impossible to determine whether the high WCD errors stem from conceptual gaps in TSN or from prompt sensitivity, limiting the load-bearing strength of the claim that MCQs overestimate proficiency.

    Authors: The manuscript provides the full question generation process, expert validation procedure, prompt templates (zero-shot with task-specific instructions), and evaluation protocol (including model settings and MAPE computation) in Sections 3 and 4. However, the abstract does not summarize these details, which reduces clarity. We will revise the abstract to include a concise description of the evaluation protocol and add explicit cross-references to the relevant sections. We will also release the complete prompt set and evaluation code in a public repository to enable direct assessment of prompt sensitivity. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and conclusions rest on external ground truth

full rationale

The paper defines TSNBench tasks independently, with MCQ answers expert-validated and open-ended WCD ground truths obtained from a verified external Network Calculus solver (CBS) and closed-form mathematical bounds (CQF). LLM accuracies and MAPE errors are computed against these fixed external references; no parameters are fitted to the LLM outputs, no self-citations supply load-bearing uniqueness theorems, and the performance-gap conclusion does not reduce by construction to any input definition or prior author result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no free parameters, axioms, or invented entities; it relies on established TSN mechanisms, expert domain knowledge for question validation, and pre-existing network calculus solvers for ground truth.

pith-pipeline@v0.9.0 · 5635 in / 1108 out tokens · 52188 ms · 2026-05-12T03:47:18.038172+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    Luxi Zhao, Yida Yan, and Xuan Zhou

    doi: 10.1109/TNSM.2022.3180160. Luxi Zhao, Yida Yan, and Xuan Zhou. Minimum Bandwidth Reservation for CBS in TSN With Real-Time QoS Guarantees.IEEE Transactions on Industrial Informatics, 20(4):6187–6198, 2024. doi: 10.1109/TII.2023. 3342466. 19 6 Limitations and Broader Impact 6.1 Limitations While TSNBench fills a significant research gap and proposes a...

  2. [2]

    Future versions should extend to TAS and ATS to cover a broader range of the TSN standard suite

    Additional scheduling mechanisms:TSNBench currently evaluates CBS and CQF. Future versions should extend to TAS and ATS to cover a broader range of the TSN standard suite

  3. [3]

    In future work, we will update the dataset with MCQAs formulated directly from TSN standards

    Updated MCQA:Our MCQA dataset was developed using open-source research documents. In future work, we will update the dataset with MCQAs formulated directly from TSN standards

  4. [4]

    You are an expert Time-Sensitive Networking (TSN) orchestrator

    Fine-tuned and domain-adapted models.TSNBench currently evaluates general-purpose LLMs without any TSN-specific fine-tuning. Future versions should benchmark domain- adapted models trained on TSN standards and network calculus literature. 6.3 Broader Impact TSNBench enables the real-time systems community and the machine learning community to ob- jectivel...

  5. [5]

    Map each egress port’s queues and collect the set of flows traversing from that port, using the given topology, flows, and route of the flow

  6. [6]

    For each egress port, use the given IdleSlope and then compute the SendSlope

  7. [7]

    For each flow, construct an arrival curve from its frame size and periodicity

  8. [8]

    For each port, derive a lower-bounded CBS service curve

  9. [9]

    Calculate the worst case delay (WCD) in microseconds (µs) for each flow using Network Calculus method

  10. [10]

    1.0 means mathematically or procedurally provable from given info with zero ambiguity

    Provide the confidence score between 0.0 and 1.0 from your answers. 1.0 means mathematically or procedurally provable from given info with zero ambiguity. 0.0 means zero confidence. 36 Table 14: CBS Error Analysis Case 1: Lack of Specific Knowledge. (continued) Grok 4.1 Fast (Non-Reasoning) output: F0: 1452.0, F1: 1124.0, F2: 678.0, F3: 1234.0, F4: 1567.0...

  11. [11]

    Map each egress port’s queues and collect the set of flows traversing that port, using the given topology, flows, and route of the flow

  12. [12]

    For the entire network, use the given cycle duration and compute the Hypercycle

  13. [13]

    For each flow, set the offset or the start time of the flow from the sending node as 0

  14. [14]

    Calculate the worst case delay (WCD) in microseconds (µs) for each flow

  15. [15]

    1.0 means mathematically or procedurally provable from given info with zero ambiguity

    Provide the confidence score between 0.0 and 1.0 from your answers. 1.0 means mathematically or procedurally provable from given info with zero ambiguity. 0.0 means zero confidence. 41 Table 15: CQF Error Analysis Case 1: Lack of Specific Knowledge. (continued) Claude Sonnet’s output: F0: 257.72, F1: 206.8, F2: 105.096, F3: 218.704, F4: 253.904, F5: 104.0...