Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Chang Xu; Ming Jin; Shirui Pan; Shiyu Wang; Xiping Liu; Yiji Zhao; Yuxuan Liang; Zhao Tan

arxiv: 2602.17001 · v3 · pith:FRUDRZYFnew · submitted 2026-02-19 · 💻 cs.AI · cs.CL· cs.DB

Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Zhao Tan , Yiji Zhao , Shiyu Wang , Chang Xu , Yuxuan Liang , Xiping Liu , Shirui Pan , Ming Jin This is my paper

Pith reviewed 2026-05-21 12:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.DB

keywords natural language queryingtime series databasesneuro-symbolic methodssearch then verifyNLQTSBenchtemporal pattern retrieval

0 comments

The pith

Sonar-TS retrieves time series events by first searching candidates with SQL then verifying them with generated Python programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sonar-TS to let non-expert users pose natural language questions about events, intervals, and summaries in large time series databases. Existing text-to-SQL approaches cannot handle continuous patterns such as shapes or anomalies, while time series models cannot process ultra-long histories. Sonar-TS works by using a feature index to locate candidate windows through SQL queries, then producing Python programs that check those windows directly against the raw data. The authors also release NLQTSBench, the first large-scale benchmark for this type of querying. The result is a pipeline that succeeds on complex temporal questions where prior methods fall short.

Core claim

Sonar-TS is a neuro-symbolic framework for natural language querying over time series databases that follows a Search-Then-Verify pipeline: a feature index first pings candidate windows via SQL, after which generated Python programs lock onto and verify those candidates against the raw signals, evaluated on the new NLQTSBench benchmark.

What carries the argument

Search-Then-Verify pipeline that combines SQL-based candidate retrieval from a feature index with generated Python program verification for morphological intents such as shapes or anomalies.

If this is right

Non-experts can extract specific temporal events from massive records without writing queries or code.
Queries about shapes, anomalies, and other morphological features become practical on histories too long for direct processing.
NLQTSBench supplies a common testbed for measuring progress on natural language time series retrieval.
The hybrid SQL-plus-code design scales verification cost with the number of candidates rather than the full data length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage pattern could be adapted to natural language queries over other ordered data such as video frames or sensor streams.
Stronger code-generation models would directly raise verification accuracy on edge-case patterns.
Better feature indexes could shrink the set of candidates passed to the Python stage and lower overall latency.

Load-bearing premise

The Python programs generated on the fly will correctly decide whether each candidate window matches the user's intended continuous pattern even when the full history is extremely long.

What would settle it

A test set from NLQTSBench in which the generated verification programs systematically accept windows that lack the queried shape or anomaly, or reject windows that contain it.

Figures

Figures reproduced from arXiv: 2602.17001 by Chang Xu, Ming Jin, Shirui Pan, Shiyu Wang, Xiping Liu, Yiji Zhao, Yuxuan Liang, Zhao Tan.

**Figure 1.** Figure 1: Comparison of querying paradigms. While Text-to-SQL fails to express morphological intents and Time Series Models are limited by context length, Sonar-TS adopts a “Search-Then-Verify” pipeline: it uses SQL to search a symbolic index for candidates and Python to verify them on raw data. cant barrier for non-expert users. Unlike simple numerical lookups (e.g., “maximum value in May”), users often prioritize… view at source ↗

**Figure 2.** Figure 2: The hierarchical taxonomy of tasks in NLQTSBench. The benchmark ranges from Level 1 (Basic Operations) which tests numerical filtering, to Level 2 (Pattern Recognition) for morphological grounding, Level 3 (Semantic Reasoning) for logical composition, and finally Level 4 (Insight Synthesis) for narrative reporting. context processing to active database evidence localization. Since preprocessing such as do… view at source ↗

**Figure 3.** Figure 3: The overview of the Sonar-TS framework. The workflow is organized into three stages: (1) Offline Data Processing constructs compact multi-scale Feature Tables to serve as a queryable index; (2) Online Querying, where the Task Planner and Code Generator synthesize SQL for rapid candidate search and Python for exact verification, supported by a closed-loop Prompt Cold Start mechanism that evolves analysis in… view at source ↗

**Figure 4.** Figure 4: Case Study. Text-to-SQL (Left) lacks morphological expressivity, and TS Models (Right) fail the logical constraint. Sonar-TS (Middle) succeeds via Search-Then-Verify. that fail to capture the geometric pattern. Conversely, time series models correctly recognize typical plateau shapes within short contexts but fail to align with the “longest” intent, lacking the global reasoning to compare durations. Sonar-… view at source ↗

**Figure 5.** Figure 5: The human verification interface. Annotators inspect both global context and local details (where the injected signal in orange is overlaid on the raw data in blue) to validate the ground truth. A.3. Evaluation Implementation Given the diversity of output formats (scalars, intervals, sets, and natural language reports), we implement a robust evaluation suite consisting of four specialized metrics. 1. Scala… view at source ↗

**Figure 6.** Figure 6: Visualization of Multi-Scale SAX Representations. The framework discretizes time series data across hierarchical granularities to support pattern matching at different resolutions: (Top) The Daily View captures high-frequency local fluctuations; (Middle) The Monthly View summarizes intermediate trends; (Bottom) The Yearly View abstracts long-term seasonality. The colored horizontal bars represent the assig… view at source ↗

read the original abstract

Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sonar-TS offers a practical search-then-verify pipeline for natural language time series queries but the LLM-generated verification step needs concrete evidence to back the main claims.

read the letter

Sonar-TS sets up a two-stage system for natural language queries over time series databases. It first runs SQL against a feature index to pull candidate windows, then uses LLM-generated Python code to check those windows against the raw data for patterns like shapes or anomalies. The authors also introduce NLQTSBench as a new benchmark aimed at this specific setting. This is the first systematic framing of NLQ4TSDB as a distinct problem that sits between standard text-to-SQL and time-series modeling.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Sonar-TS, a neuro-symbolic Search-Then-Verify framework for natural language querying over time series databases (NLQ4TSDB). It first applies SQL queries against a feature index to retrieve candidate windows, then uses LLM-generated Python programs to verify those candidates against raw signals for complex morphological intents such as shapes and anomalies. The paper presents NLQTSBench as a new large-scale benchmark and states that experiments demonstrate Sonar-TS successfully handles queries where traditional Text-to-SQL and time-series methods fail, positioning the work as the first systematic study of NLQ4TSDB.

Significance. If the verify stage is shown to scale reliably, Sonar-TS could meaningfully advance accessible querying of massive temporal datasets by non-experts. The introduction of NLQTSBench supplies a concrete evaluation standard that future work can build upon.

major comments (2)

[Abstract] Abstract: the statement that 'experiments demonstrate effectiveness' and that Sonar-TS 'effectively navigates complex temporal queries where traditional methods fail' is unsupported by any quantitative metrics, error rates, runtime figures, or analysis of the verification programs, leaving the central claim without visible evidence.
[Verification stage description] Verification component: the claim that LLM-generated Python programs reliably 'lock on' to shapes, anomalies, and other continuous patterns on ultra-long histories is load-bearing for the neuro-symbolic pipeline, yet no success rates, false-negative analysis, or scalability results on raw signals are reported.

minor comments (1)

[Abstract] Abstract: consider adding one concrete example query and its expected output to illustrate the morphological intents that defeat pure Text-to-SQL approaches.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, agreeing where additional quantitative support is warranted and outlining the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'experiments demonstrate effectiveness' and that Sonar-TS 'effectively navigates complex temporal queries where traditional methods fail' is unsupported by any quantitative metrics, error rates, runtime figures, or analysis of the verification programs, leaving the central claim without visible evidence.

Authors: We agree that the abstract would benefit from explicit quantitative support to ground the claims. The full manuscript (Section 5) reports precision/recall metrics, runtime comparisons against Text-to-SQL and time-series baselines, and overall success rates on NLQTSBench, showing Sonar-TS outperforming alternatives on morphological queries. We will revise the abstract to include key figures (e.g., accuracy improvements and failure modes of baselines) while preserving brevity. revision: yes
Referee: [Verification stage description] Verification component: the claim that LLM-generated Python programs reliably 'lock on' to shapes, anomalies, and other continuous patterns on ultra-long histories is load-bearing for the neuro-symbolic pipeline, yet no success rates, false-negative analysis, or scalability results on raw signals are reported.

Authors: We acknowledge that dedicated metrics for the verification stage would strengthen the neuro-symbolic claims. While overall pipeline results are presented, we will add a new subsection in the experiments (or appendix) reporting success rates of the LLM-generated verification programs, false-negative analysis on shape/anomaly detection, and scalability tests across varying history lengths on raw signals. revision: yes

Circularity Check

0 steps flagged

Sonar-TS introduces a new neuro-symbolic Search-Then-Verify framework with no circular derivation

full rationale

The paper proposes Sonar-TS as an original construction: a feature-index SQL search stage followed by LLM-generated Python verification programs on raw signals. No equations, fitted parameters, or predictions appear. The central claims rest on the empirical performance of this pipeline on the newly introduced NLQTSBench benchmark rather than on any self-referential definitions, self-citation chains, or renamings of prior results. The derivation is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unstated premise that SQL-based feature indexing plus LLM-generated Python verification can be made accurate and efficient for arbitrary morphological queries; no free parameters or invented entities are mentioned.

axioms (2)

domain assumption Existing Text-to-SQL methods cannot handle continuous morphological intents such as shapes or anomalies.
Stated directly in the abstract as the motivation for the new approach.
domain assumption Time series models cannot scale to ultra-long histories.
Stated directly in the abstract as a limitation of prior work.

pith-pipeline@v0.9.0 · 5738 in / 1286 out tokens · 63525 ms · 2026-05-21T12:07:47.548052+00:00 · methodology

Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)