Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty

URLhttps://arxiv · 2025 · arXiv 2506.20093

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.

CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic

cs.CR · 2026-04-27 · accept · novelty 7.0

CAN-QA creates 33,128 QA pairs from CAN traffic logs in 10 categories to test LLMs, which capture patterns but struggle with temporal reasoning and multi-condition inference.

TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

cs.AI · 2026-04-11 · conditional · novelty 7.0

TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.

TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning

cs.LG · 2026-02-15 · unverdicted · novelty 7.0

TS-Haystack benchmark shows time-series language models degrade sharply on long contexts while an agentic retrieval system using classifier tools matches or beats them on 9 of 10 tasks.

citing papers explorer

Showing 4 of 4 citing papers.

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules cs.AI · 2026-05-09 · unverdicted · none · ref 40
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic cs.CR · 2026-04-27 · accept · none · ref 21
CAN-QA creates 33,128 QA pairs from CAN traffic logs in 10 categories to test LLMs, which capture patterns but struggle with temporal reasoning and multi-condition inference.
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale cs.AI · 2026-04-11 · conditional · none · ref 45
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning cs.LG · 2026-02-15 · unverdicted · none · ref 8
TS-Haystack benchmark shows time-series language models degrade sharply on long contexts while an agentic retrieval system using classifier tools matches or beats them on 9 of 10 tasks.

Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer