DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4representative citing papers
CAN-QA creates 33,128 QA pairs from CAN traffic logs in 10 categories to test LLMs, which capture patterns but struggle with temporal reasoning and multi-condition inference.
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
TS-Haystack benchmark shows time-series language models degrade sharply on long contexts while an agentic retrieval system using classifier tools matches or beats them on 9 of 10 tasks.
citing papers explorer
-
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
-
CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic
CAN-QA creates 33,128 QA pairs from CAN traffic logs in 10 categories to test LLMs, which capture patterns but struggle with temporal reasoning and multi-condition inference.
-
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
-
TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning
TS-Haystack benchmark shows time-series language models degrade sharply on long contexts while an agentic retrieval system using classifier tools matches or beats them on 9 of 10 tasks.