pith. machine review for the scientific record. sign in

cs.DB

Databases

Covers database management, datamining, and data processing. Roughly includes material in ACM Subject Classes E.2, E.5, H.0, H.2, and J.1.

0
cs.DB 2026-05-14 1 theorem

Benchmark shows top multimodal models lag on e-commerce

OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems

Tests of 20 leading models on 6300 real instances across merchants, customers and operators reveal modest scores and narrowed gaps due to a

Figure from the paper full image
abstract click to expand
LLMs and MLLMs have become indispensable tools across a wide range of applications. E-commerce, however, poses distinctive challenges -- including intricate domain knowledge, long-tail product evidence, heterogeneous visual data, and the interplay among multiple stakeholder roles -- that diverge substantially from the general world knowledge these models are primarily trained on, often causing a notable gap between their open-domain and e-commerce performance. To systematically quantify this gap, we introduce OxyEcomBench, a unified multimodal benchmark comprising approximately 6,300 high-quality instances for real-world bilingual Chinese--English e-commerce. Although several e-commerce benchmarks have been proposed, they typically adopt a single stakeholder perspective, target a narrow set of tasks, or address isolated challenges, making it difficult to holistically assess models' understanding of the full e-commerce pipeline. OxyEcomBench addresses these limitations by jointly covering platform operators, merchants, and customers across 6 capability aspects and 29 tasks, supporting text-only and mixed-modality inputs with single-image, multi-image, single-turn, and multi-turn configurations. All data is sourced from authentic e-commerce platforms and verified by domain experts. The benchmark further adopts a difficulty-aware design with a four-level P0--P3 rubric applied to all 29 tasks whose difficulty admits stable expert consensus, and rigorously prioritizes visually salient multimodal cases in which key evidence resides in images rather than text alone. Evaluations on 20 mainstream LLMs and MLLMs show that even the leading models attain modest performance and that performance gaps narrow on OxyEcomBench, suggesting that insufficient e-commerce-specific knowledge infusion mutes the advantages of advanced general-purpose models in this domain.
0
0
cs.DB 2026-05-13 Recognition

Chase termination undecidable even for decidable queries

Will My Favorite Chases Terminate if Evaluating Conjunctive Queries Does? One Does Not Simply Decide This

Rule sets that make conjunctive query entailment decidable still do not let us decide if standard chase variants terminate or if treewidths

Figure from the paper full image
abstract click to expand
Existential rules are a prominent formalism to enrich a database with knowledge from the domain of interest, but make even basic reasoning tasks on the resulting knowledge base undecidable. To circumvent this, several classes of rules offering various useful properties have been identified. One such class, for instance, contains all sets of rules on which the chase algorithm always terminates, which guarantees the existence of a finite universal model. However, these classes are often abstract rather than concrete: it may be undecidable to check whether a given set of rules belongs to them. Given that the most studied classes of existential rules are designed for reasoning on databases, thus ensuring decidable conjunctive query entailment, we ask: Within a class that supports decidable query entailment, do the usual abstract classes become concrete? We answer in the negative for classes based upon the termination of all classical chase variants and for the bounded treewidth set (BTS) class.
0
0
cs.DB 2026-05-13 Recognition

Separating instances pick correct NL2SQL candidate

Data-aware candidate selection in NL2SQL translation via small separating instances

Minimal database examples that distinguish good SQL from bad ones outperform baselines when only two or three candidates exist and no scores

Figure from the paper full image
abstract click to expand
We propose a data-aware candidate selection method for NL2SQL translation based on separating instances and provenance. We implement this approach and evaluate it against three natural baselines on a subset of BIRD-DEV. Experiments show that our method significantly outperforms baselines when only two or three candidates are given and no consistency score is available. The code of our prototype can be found at https://github.com/staskikotx/SISelection
0
0
cs.DB 2026-05-13 Recognition

Knowledge graphs source optimization problems via queries

Graph-Grounded Optimization: Rao-Family Metaheuristics, Classical OR, and SLM-Driven Formulation over Knowledge Graphs

This reveals data flaws and non-linear objectives that text inputs and linear solvers miss across seven real cases.

Figure from the paper full image
abstract click to expand
We propose graph-grounded optimization: a paradigm in which the decision variables, constraints, and objective coefficients of a real-world optimization problem are sourced from a property knowledge graph (KG) via Cypher queries, rather than supplied as free-form natural-language text or static tabular input. We motivate the paradigm by surveying recent LLM/SLM-driven optimization systems -- OptiMUS, Chain-of-Experts, LLMOPT, OPRO, FunSearch, Eureka -- none of which consume property graphs as the primary input modality. We instantiate the paradigm in the open-source samyama-graph database and evaluate seven real-world public-domain KG-backed problems spanning drug repurposing (245K-node biomedical KG), clinical-trial site selection (7.78M-node trial registry), Indian supply-chain rerouting (5.34M-node OSM road graph), healthcare equity allocation (WHO/GAVI/IHME KG), economic-environmental grid dispatch, antimicrobial-resistance stewardship (NCBI AMRFinderPlus, 10.4K resistance genes), and wildfire evacuation routing (OSM Paradise, CA). We compare a portfolio of Rao-family metaheuristics (BMWR, Jaya, SAMP-Jaya, EHR-Jaya, Rao-1) against Google OR-tools (CP-SAT and GLOP) reference solvers. We find that (i) no single Rao variant dominates: BMWR wins on discrete-with-tradeoff and high-dim-with-hard-constraint problems while Rao-1 wins on continuous low-/mid-dim problems, empirically supporting a portfolio approach; (ii) OR-tools dominates on small linear/MILP-friendly sub-problems but cannot encode the non-linear objectives that emerge in several of the real-world settings; (iii) graph-grounded formulations surface data-quality issues (missing properties, degenerate aggregates) that purely text-formulated optimizations would silently mask
0
0
cs.DB 2026-05-13 Recognition

Replicas detect and repair database corruption without stopping work

PROTECT-DB: Protecting Data using Replicated State Machines: Efficient Corruption Detection & Recovery

Shared transaction logs plus deterministic execution let the system catch attacks and fix data while transactions continue.

Figure from the paper full image
abstract click to expand
Data is critical for the operation of any organization and needs to be protected, especially against attacks that compromise the state of the database. In this paper, we explore an approach based on Byzantine-fault tolerant replicated state machines, built on top of a deterministic extension of PostgreSQL. Each replica deterministically executes transactions recorded in a shared log/blockchain. Our focus is on creating a practical system that is designed for efficient and quick detection of corruption, as well as quick repair concurrent with execution of transactions. We also present a performance study showing the efficiency and practicality of our approach. We believe our work lays the foundations for the practical use of the BFT replicated state machine approach in the context of databases.
0
0
cs.DB 2026-05-12 2 theorems

SHACL-DS validates named graphs faster than standard SHACL

Keeping track of errors: A study of SHACL-DS for RDF dataset validation on the ERA RINF Knowledge Graph

Two migration strategies match baseline results on the ERA RINF graph while adding per-graph provenance and scope control.

Figure from the paper full image
abstract click to expand
SHACL-DS extends SHACL for RDF dataset validation by introducing declarative targeting of named graphs and graph combinations, but has not yet been demonstrated and assessed on a real, large-scale Knowledge Graph (KG). In this paper, we apply the SHACL-DS approach to validate its use on such a KG. We apply SHACL-DS to the European Railway Infrastructure (ERA RINF) KG, a large-scale RDF dataset in which 56 infrastructure managers contribute data to dedicated named graphs. We migrate the ERA-RINF shapes to SHACL-DS using two strategies and evaluate their performance using a TopBraid SHACL-DS implementation developed for this study. We compare the performance against the SHACL approach, which "flattens" all graphs into a single data graph. Both strategies produce the same results and are faster than the SHACL baseline. Not only do we demonstrate that SHACL-DS is at least as expressive as SHACL, but SHACL-DS also allows the validation scope to be declared inside the shapes artefact, enforces triple provenance through \texttt{GRAPH} clauses, enriches validation reports with per-graph annotations, and enables shape organisation across named shapes graphs.
0
0
cs.DB 2026-05-12 Recognition

Single GPU kernel fuses IO and query steps for faster analytics

Data Path Fusion in GPU for Analytical Query Processing

Merging data movement, decompression and processing inside one kernel cuts transfers and raises GPU utilization on standard benchmarks.

Figure from the paper full image
abstract click to expand
One major technical challenge for modern analytical database systems is how to leverage GPU to exploit their massive parallelism and high bandwidth. Yet, existing GPU-driven database engines suffer from inefficiencies caused by frequent host-device interactions and fragmented execution across multiple GPU kernels, limiting their ability to fully utilize GPU's computational and IO capabilities. This paper proposes Data Path Fusion (DPF), a novel GPU-driven data processing architecture that integrates a sequence of data path operations -- including IOs, decompression, and query operations -- into a single GPU kernel. By fusing the data path, DPF reduces host-device communication overheads and enables more efficient utilization of GPU resources for analytical query workloads. DPF seamlessly integrates GPU-friendly optimization techniques, including type-specific compression/decompression, variable-length attribute support, and state-of-the-art GPU-driven IO mechanism, to work in concert, enabling efficient end-to-end query execution directly on GPU. Through extensive experimental evaluation using a prototyped DPF-based GPU-driven database engine (DPFProto) with analytical benchmark workloads, this paper demonstrates that DPF achieves speedups of 2.66 to 6.22 on TPC-H and 3.84 to 16.81 on SSB over the state-of-the-art approach in the representative configuration. Our results show that DPF effectively unlocks the computational and IO potential of modern GPU, providing a promising direction for next-generation analytical database systems.
0
0
cs.DB 2026-05-12 Recognition

Text2Cypher must reason across multiple graph databases

Toward Multi-Database Query Reasoning for Text2Cypher

Single-database systems fail when answers span separate sources; routing, splitting, and merging become necessary steps.

Figure from the paper full image
abstract click to expand
Large language models have significantly improved natural language interfaces to databases by translating user questions into executable queries. In particular, Text2Cypher focuses on generating Cypher queries for graph databases, enabling users to access graph data without query language expertise. Most existing Text2Cypher systems assume a single preselected graph database, where queries are generated over a known schema. However, real-world systems are often distributed across multiple independent graph databases organized by domain or system boundaries, where relevant information may span multiple sources. To address this limitation, we propose a shift from single-database query generation to multi-database query reasoning. Instead of assuming a fixed execution context, the system must reason about (i) relevant databases, (ii) how to decompose a question across them, and (iii) how to integrate partial results. We formalize this setting through a three-phase roadmap: database routing, multi-database decomposition, and heterogeneous query reasoning across database types and query languages. This work provides a structured formulation of multi-database reasoning for Text2Cypher and identifies challenges in source selection, query decomposition, and result integration, aiming to support more realistic and scalable natural language interfaces to graph databases.
0
0
cs.DB 2026-05-12 2 theorems

Cloud GPUs speed graph index construction by 9x at 6x lower cost

ScaleGANN: Accelerate Large-Scale ANN Indexing by Cost-effective Cloud GPUs

ScaleGANN divides data for parallel spot GPU processing to handle large datasets affordably.

Figure from the paper full image
abstract click to expand
Graph-based ANNS algorithms have gained increasing research interest and market adoption due to their efficiency and accuracy in retrieval. Existing approaches primarily rely on CPUs for graph index construction and retrieval, but this often requires significant time, especially for large-scale and high-dimensional datasets. Some studies have explored GPU-based solutions. However, GPUs are costly and their limited memory makes handling large datasets challenging. In this paper, we propose a novel end-to-end system ScaleGANN that enables users to efficiently construct graph indexes for large-scale, high-dimensional datasets by leveraging low-cost spot GPU resources in a distributed cloud system. ScaleGANN utilized the idea of divide-and-merge, with an optimized vector partitioning algorithm to further improve the indexing time and space efficiency while guaranteeing good index quality. Its novel resource allocation strategy realized multi-GPU indexing parallelism and overall cost-effectiveness for both build and query. Besides, we designed a task scheduler and cost model for better spot instance management and evaluation. We tested our system on large real-world datasets. Experiment results show that our approach can significantly accelerate the index build time to up to 9x times at even 6x lower price compared with the state-of-the-art extendable ANNS benchmark DiskANN.
0
0
cs.DB 2026-05-11 1 theorem

Krone decomposes logs into entity-action-status units for modular anomaly detection

Detect, Localize, and Explain: Interactive Hierarchical Log Anomaly Analytics with LLM Augmentation

The hierarchical split plus selective LLM calls lets operators localize problems and review explanations without scanning entire sequences.

Figure from the paper full image
abstract click to expand
Logs are ubiquitous in modern systems. Unfortunately, their unstructured nature in flat sequences limits understanding of execution behaviors, hindering effective anomaly diagnosis. To address this, Krone introduces a novel hierarchical log abstraction that transforms flat log sequences into semantically coherent units across entity, action, and status levels. Building on this abstraction, Krone introduces a hierarchical orchestration framework that decomposes flat log sequences into hierarchical execution units and performs modular detection over them. It executes and optimizes the modular detection tasks across levels, enabling precise anomaly detection, localization, and explanation with selective invocation of LLM-based reasoning. In this work, we present Krone-viz, an interactive visualization system based on Krone, which makes hierarchical log analysis interpretable and actionable for software engineers and system operators. Demonstrated on the widely used HDFS benchmark dataset, Krone-viz supports: 1) examining hierarchical decompositions of flat log sequences, 2) inspecting detection results and abnormal segments identified by Krone with LLM-generated explanations, and 3) reusing, reviewing, and revising knowledge generated by LLMs with human-in-the-loop guardrails. The code of Krone-viz is available at https://github.com/LeiMa0324/KRONE_Demo_official, and we deploy a live demo at https://leima0324.github.io/KRONE_Demo_official.
0
0
cs.DB 2026-05-11 2 theorems

Personalized privacy cuts infinite stream estimation error by 53.6%

Personalized w-Event Privacy for Infinite Stream Estimation

User-specific privacy levels replace uniform budgets while delivering lower error on continuous data streams.

Figure from the paper full image
abstract click to expand
In applications such as event monitoring, log analysis, and video querying, $w$-event privacy protects individual data within a sliding time window while supporting accurate stream statistics. Existing studies on infinite data streams mainly assume homogeneous privacy requirements for all users, which cannot capture user-specific privacy preferences. This paper studies personalized $w$-event privacy for private data stream estimation. We first design the Personalized Window Size Mechanism (PWSM), which supports personalized privacy requirements at each time slot. Based on PWSM, we propose Personalized Budget Distribution (PBD) and Personalized Budget Absorption (PBA) to estimate streaming statistics under $\boldsymbol{w}$-Event $\boldsymbol{\mathcal{E}}$ Personalized Differential Privacy (($\boldsymbol{w}$, $\boldsymbol{\mathcal{E}}$)-EPDP). PBD guarantees that the budget reserved for the next time step is no smaller than the budget consumed in the previous release, while PBA improves the current budget by absorbing unused budgets from the previous $k$ time slots and borrowing from the next $k$ time slots. We further develop Dynamic Personalized Budget Distribution (DPBD) and Dynamic Personalized Budget Absorption (DPBA), which allow users to dynamically adjust privacy requirements while satisfying $(\tau, \boldsymbol{w}_B, \boldsymbol{w}_F)$-Event $(\boldsymbol{\mathcal{E}}_B, \boldsymbol{\mathcal{E}}_F)$-Personalized Differential Privacy. We prove that all proposed methods achieve the corresponding personalized differential privacy guarantees and derive their error upper bounds. Experiments show that our methods reduce estimation error by at least $53.6\%$ compared with state-of-the-art algorithms.
0
0
cs.DB 2026-05-11 Recognition

LLMs fall short on natural language data prep tasks

PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

PrepBench evaluates disambiguation, code generation and workflow translation showing persistent difficulties for current models.

Figure from the paper full image
abstract click to expand
Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations through visual operators and workflows. Recent advances in large language models (LLMs) raise the possibility of a paradigm shift toward natural language (NL)-driven data preparation, in which users can specify preparation intents in NL directly. However, it remains unclear how far current LLM-based agents are from this paradigm shift in practice. Existing code generation benchmarks do not capture key characteristics of data preparation, including ambiguous user intents, imperfect real-world data, and the need to translate code into interpretable workflows for validation. To bridge this gap, we present PrepBench, a benchmark designed to evaluate NL-driven data preparation along three core capabilities: interactive disambiguation, prep-code generation, and code-to-workflow translation. We crawl data from the Preppin' Data Challenges, and then extend it into a systematically designed benchmark. The benchmark covers diverse domains, and each task involves 3 to 18 data preparation steps. Nearly half of the tasks require over 100 lines of Python code, and the longest solutions approach 300 lines. Our evaluation shows that, despite recent progress, realizing this paradigm shift remains challenging for state-of-the-art LLMs. PrepBench provides a principled benchmark for measuring this gap and helps identify key challenges toward realizing NL-driven data preparation.
0
0
cs.DB 2026-05-11 Recognition

Elastic scheduling meets stream deadlines at lowest cost

Elastic Scheduling of Intermittent Query Processing in a Cluster Environment

Batched intermittent processing on scalable clusters handles concurrent queries and rate changes while using minimal resources.

Figure from the paper full image
abstract click to expand
Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.
0
0
cs.DB 2026-05-11 Recognition

Heavy-light partitioning maintains arbitrary joins under updates

Maintaining Queries under Updates Using Heavy-Light Partitioning of the Input Relations

Update time bounded by maintenance width for any join query, generalizing prior tailored methods and preserving constant-delay output.

Figure from the paper full image
abstract click to expand
We study the classical incremental view maintenance problem: Given a query and a database, maintain the query output under single-tuple updates (inserts or deletes) to the database such that the tuples in the query output can be enumerated with constant delay after any update. We introduce a maintenance approach whose update time matches or improves the best update time reported in prior work. Whereas prior approaches are manually tailored to each of a handful of queries, our approach generalizes to arbitrary join queries. It combines three techniques: delta queries, trees of materialized views, and heavy-light data partitioning. The overall update time incurred by our approach for a given join query is characterized by the maintenance width, a new measure that is parameterized by the heavy-light threshold for data partitioning. We show how to find the threshold that minimizes the maintenance width.
0
0
cs.DB 2026-05-08

SkipDisk hits 63% HNSW latency at 20% memory

Low-Latency Out-of-Core ANN Search in High-Dimensional Space

Dedicated pivots and three-level pruning let a hybrid disk-memory structure match in-memory ANN speeds with far less RAM.

Figure from the paper full image
abstract click to expand
In-memory graph-based approximate nearest neighbor (ANN) search has superior search performance but incurs significant memory footprint. Disk-based methods reduce memory usage but suffer from high disk access latency. A common challenge is how to achieve low-latency search while significantly reducing memory footprint. In this paper, we propose SkipDisk, a disk-memory hybrid ANN search that significantly reduces memory footprint while achieving search latency comparable to or lower than in-memory method HNSW. By analyzing existing disk-based methods, we observed that disk access remains the primary bottleneck, and existing lower bound based filtering methods are two loose to effectively reduce disk access. Therefore, we design SkipDisk to achieve tight lower bound with low memory footprint to reduce the search latency. First, we design a dedicated pivot for each point to improve the lower bound of the triangle inequality for effective filtering. We further design an estimation-based approach based on this lower bound. Second, to reduce the memory footprint, we employ a three-level data pruning strategy to preserve informative data in memory. Third, to further reduce search latency, we design an asynchronous I/O strategy based on the decoupling of in-memory search and disk access by storing neighbor nodes in memory. Experiments show that our method achieves a latency of 85 of HNSW's latency with approximately 10 memory footprint, and a latency to 63 of HNSW's with a slightly higher memory footprint of around 20.
0
0
cs.DB 2026-05-08

Query rewrite rules written once deploy across database engines

An Extensible and Verifiable Language for Query Rewrite Rules

Rulescript uses a verified core language and adapters to let rules run on different systems like CockroachDB without reimplementation.

Figure from the paper full image
abstract click to expand
Logical query plan rewriting transforms a relational database query into an equivalent but more efficient form and is crucial to the performance of database-backed applications. In existing systems, rewrite rules are typically implemented manually, tightly coupled to specific execution engines, and often lack formal correctness guarantees. Consequently, developing a new engine requires reimplementing both legacy and new rules, incurring significant engineering cost, limiting portability, and every new implementation is an opportunity for introducing new bugs. We introduce Rulescript, an engine-agnostic domain-specific language (DSL) for developing query rewrite rules. Rulescript separates rule definition from execution infrastructure via a relational algebra-inspired core language and an explicit decomposition of rules into matching and transformation phases. Developers express rewrites by pattern-matching query plans using Rulescript's core operators and constructing semantically equivalent transformed plans, with all rewrites automatically verified formally to ensure correctness. Rulescript is extensible: users can define custom operators in terms of the core language to capture engine-specific semantics. To integrate with an existing system, developers need only implement a lightweight adapter that maps Rulescript's core and custom operators to the operators implemented in the target engine. We evaluate Rulescript by reimplementing 33 rewrite rules from Apache Calcite and extending the language with several custom operators. To demonstrate portability, we automatically deploy these rules to CockroachDB and Apache Data Fusion, two engines with substantially different backends. Our results show that Rulescript enables "write once, deploy everywhere" paradigm for query plan rewriting, with minimal effort required to deploy previously written rules on a new data engine.
0
0
cs.DB 2026-05-08

Every query reduces to Filter

Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation

W5H dimensions vary sharply by domain, with healthcare queries dominated by time and people rather than cause or mechanism.

Figure from the paper full image
abstract click to expand
Natural language interfaces to databases have gained popularity, yet the theoretical foundations for evaluating and designing these systems remain underdeveloped. We present QUEST (Query Understanding Evaluation through Semantic Translation), a framework resting on two independently motivated components: the FAR structural invariant, which holds that every well-formed query reduces to Filter, Aggregate, and Return operations; and the W5H dimensional framework, which holds that all filtering criteria map to six semantic dimensions (Who, What, Where, When, Why, and How). Validated across five text-to-SQL datasets (n = 120,464), FAR conformance is universal across all domains and schema types, while W5H dimensional profiles vary substantially. Healthcare queries are strongly concentrated in temporal (WHEN: 80.4%) and person-centric (WHO: 73.0%) dimensions far exceeding general-domain benchmarks, and causal (WHY) and mechanistic (HOW) reasoning are near-zero everywhere, with apparent HOW exceptions reflecting quantitative aggregation rather than genuine procedural reasoning. These results identify a frontier that must be crossed for genuine machine reasoning over structured data.
0
0
cs.DB 2026-05-07

Caching cuts redundant CBO calls in cost-based query rewrite

Efficient Cost-Based Rewrite in a Bottom-Up Optimizer

Multi-level reuse of intermediate costs plus upper-bound pruning lets bottom-up optimizers apply cost-dependent rules without exploding runn

Figure from the paper full image
abstract click to expand
The query optimizer in a Database Management Systems (DBMS), translates declarative queries into efficient execution plans. Conventional bottom-up optimization consists of two main stages: Query Rewrite (QRW) and Cost-Based Optimization (CBO). However, applying a rewrite rule during QRW may not always be beneficial; the best choice may depend on the (estimated) execution cost of the original and rewritten expressions. Fully exploiting such cost-dependent rules necessitates interleaving QRW with frequent CBO invocations, thereby incurring substantial overhead and often impractical optimization times. To mitigate this inefficiency, we introduce a novel cost-based rewrite framework for bottom-up optimizers. The core of our approach is a multi-level caching mechanism for intermediate CBO results aimed at eliminating redundant computation. Furthermore, we establish and exploit upper cost bounds to intelligently prune the search space during optimization. We also contribute methodological solutions for caching and reusing intermediate plan results within a bottom-up optimizer architecture. The framework has been implemented in the GaussDB optimizer. Experiments show that it significantly reduces overall optimization time, demonstrating the effectiveness of our approach.
0
0
cs.DB 2026-05-07

Hierarchical agents clean messy time series without ground truth

AegisTS: A Hierarchical Agent System with Reinforcement Learning for Multivariate Time Series Data Cleaning

They decide issue order and pick cleaning methods using rewards tied to downstream results, boosting quality up to 96%.

Figure from the paper full image
abstract click to expand
Multivariate time series (MTS) are frequently affected by co-occurring quality issues, such as missing values, outliers, and constraint violations, which significantly undermine downstream analytics. Existing cleaning approaches fix only a limited set of such issues, making them ill-suited for scenarios where multiple quality problems arise simultaneously. Furthermore, these methods commonly depend on the availability of ground truth data or domain-specific rules, both of which are rarely accessible in real-world applications. In this paper, we introduce \sys, an agent system with reinforcement learning designed to clean multiple data quality issues in MTS. We cast the cleaning process as a joint optimization problem that simultaneously handles quality issue order and cleaning model selection, allowing efficient navigation of the large space of possible cleaning pipelines. Our framework relies on a hierarchical agent architecture, where a high-level agent determines the order in which data quality issues should be processed, while a low-level agent identifies the most suitable cleaning method for each issue. To guide the agent toward an optimal cleaning pipeline, we propose a dual-stage reward mechanism that couples upstream (cleaning) and downstream performance, enabling effective optimization without relying on ground truth. Our experimental results show that \sys consistently outperforms existing methods, achieving up to 96\% improvement in data cleaning quality and 27\% improvement in downstream performance.
0
0
cs.DB 2026-05-06

Database repairs match preferred extensions in SETAFs

Inconsistent Databases and Argumentation Frameworks with Collective Attacks

For denial constraints and local-as-view TGDs, maximal repairs align with preferred argument sets; preprocessing yields a unique stable one.

Figure from the paper full image
abstract click to expand
The connection between subset-maximal repairs for inconsistent databases involving various integrity constraints and acceptable sets of arguments within argumentation frameworks has recently drawn growing interest. In this paper, we contribute to this domain by establishing a new connection when integrity constraints (ICs) include denial constraints and local-as-view tuple-generating dependencies. It turns out that SET-based Argumentation Frameworks (SETAFs), an extension of Dung's argumentation frameworks (AFs) allowing collective attacks, are needed. It is known that subset-maximal repairs under denial constraints correspond to the naive extensions, which also coincide with the preferred and stable extensions in the resulting SETAFs. Our main findings establish that repairs under the considered fragment of tuple-generating dependencies correspond to the preferred extensions. Moreover, for these dependencies, additional preprocessing allows computing a unique extension that is stable and naive. Allowing both types of constraints breaks this relationship, and even the pre-processing does not help as only preferred semantics captures these repairs. Finally, while it is known that functional dependencies do not require set-based attacks, we prove the same regarding inclusion dependencies. Thus, one can translate inconsistent databases under these restricted classes of ICs to plain AFs with attacks only between arguments.
0
0
cs.DB 2026-05-06

ConRAD introduces a framework that applies conformal risk control inside neural graph…

ConRAD: Conformal Risk-Aware Neural Databases

ConRAD uses conformal risk control and a new conformal gate operator to enforce declarative recall guarantees in neural graph queries while…

abstract click to expand
Querying incomplete knowledge graphs with neural predictors is powerful but dangerous. Errors compound across multi-hop pipelines with no formal bound on the completeness of results. We introduce ConRAD, the first framework to enforce declarative recall guarantees natively within a neural graph database query engine. Given a user-specified risk budget, ConRAD automatically derives per-operator prediction thresholds that satisfy the recall target with finite-sample, distribution-free statistical validity via Conformal Risk Control, while maximizing end-to-end precision. To scale calibration across multi-operator query topologies, we introduce a quantile-space scalarization that reduces intractable high-dimensional threshold searches to a single parameter. We further design the conformal gate, a novel physical operator that dynamically bypasses neural inference when local graph evidence suffices, eliminating unnecessary model inferences in dense graph regions. Evaluated across three benchmarks and three query topologies, ConRAD strictly satisfies all risk budgets, with empirical recall falling below the target by at most 0.046 across all settings. It reduces neural invocations to zero in near-complete graph regions, and achieves precision that matches or exceeds best-case static baselines that offer no guarantees and require manual threshold search.
0
0
cs.DB 2026-05-06

Sliced kd-trees speed up multi-dimensional queries in memory

In-memory Multidimensional Indexing Using the skd-tree

Each node divides space into slices along one dimension, cutting tree depth and enabling constant-SIMD traversal for ranges and nearest-neig

Figure from the paper full image
abstract click to expand
In this paper, we revisit the problem of indexing multi-dimensional data in memory for the efficient support of multi-dimensional range queries and nearest neighbor queries. This is a classic problem in main-memory databases, where there is a need for indexing multiple columns simultaneously. Established data structures include the R-tree, kd-tree, quad-tree, and grid-based partitioning. More recently, multi-dimensional learned indexes have also been proposed to address this problem. We propose slicing kd-tree (skd-tree), a variant of the kd-tree, where each node partitions the space of its subtree into multiple slices across a single splitting dimension. By compressing the splitters of the partitions and with the help of data-parallelism, we (i) radically reduce the number of levels of the tree and (ii) limit the number of computations required for multi-dimensional range and proximity queries. The nodes of the skd-tree resemble the nodes of a main-memory B+-tree, however, a different dimension is used at each level. Our novel range and kNN algorithms on the skd-tree apply only a small constant number of SIMD instructions at each node during tree traversal. Our contributions also include a novel top-down construction algorithm, different types of inner and leaf nodes that warrant tree balancing, and a novel update algorithm. Our skd-tree achieves strong performance compared to existing methods, according to our experimental evaluation on real and synthetic datasets.
0
0
cs.DB 2026-05-06

3B model hits 85% Text-to-SQL accuracy using fine-grained rewards

FINER-SQL: Boosting Small Language Models for Text-to-SQL

FINER-SQL provides continuous learning signals from SQL execution to stabilize training of compact models and reduce reliance on large LLMs.

Figure from the paper full image
abstract click to expand
Large language models have driven major advances in Text-to-SQL generation. However, they suffer from high computational cost, long latency, and data privacy concerns, which make them impractical for many real-world applications. A natural alternative is to use small language models (SLMs), which enable efficient and private on-premise deployment. Yet, SLMs often struggle with weak reasoning and poor instruction following. Conventional reinforcement learning methods based on sparse binary rewards (0/1) provide little learning signal when the generated SQLs are incorrect, leading to unstable or collapsed training. To overcome these issues, we propose FINER-SQL, a scalable and reusable reinforcement learning framework that enhances SLMs through fine-grained execution feedback. Built on group relative policy optimization, FINER-SQL replaces sparse supervision with dense and interpretable rewards that offer continuous feedback even for incorrect SQLs. It introduces two key reward functions: a memory reward, which aligns reasoning with verified traces for semantic stability, and an atomic reward, which measures operation-level overlap to grant partial credit for structurally correct but incomplete SQLs. This approach transforms discrete correctness into continuous learning, enabling stable, critic-free optimization. Experiments on the BIRD and Spider benchmarks show that FINER-SQL achieves up to 67.73\% and 85\% execution accuracy with a 3B model -- matching much larger LLMs while reducing inference latency to 5.57~s/sample. These results highlight a cost-efficient and privacy-preserving path toward high-performance Text-to-SQL generation. Our code is available at https://github.com/thanhdath/finer-sql.
0
0
cs.DB 2026-05-05

Static checker catches JDBC type errors before runtime

Static Type Checking for Database Access Code

Extending the Checker Framework verifies Java types for prepared statements and result sets on unmodified code.

Figure from the paper full image
abstract click to expand
JDBC remains a key technology for database access in Java applications. Since the database dictionary and the Java type system have distinct scopes, developers inevitably need to deal with bugs in SQL-to-Java type mappings. We propose an extension of the Java compiler, based on the established Checker Framework, which allows us to bridge this gap. Our approach verifies statically that the correct Java types are used when setting prepared statement parameters or when getting values from result sets. This allows us to lift a practically important class of runtime errors to compile time. Our approach is sound and, therefore, is guaranteed not to produce false negatives. Our prototype implementation also offers a degraded mode for type-checking legacy software, if developers are only interested in a subset of errors. Our experiments show that our approach detects a wide range of type mismatches in realworld application code and can indeed prevent errors which might otherwise surface as runtime errors. From the perspective of the developer, our approach is extremely lightweight: it processes the unmodified Java code, yet developers may add their own annotations. This allows us to perform type-checking even across method boundaries, whereas commercial developer tools are restricted to local checks. Finally, we show that we can type-check real-world JDBC software with reasonable overhead during compilation.
0
0
cs.DB 2026-05-05

eBPF scheduler doubles throughput for time-sensitive DB tasks

Unfair by design: eBPF-based scheduling of mixed database workloads

UFS limits background tasks to idle CPUs and shares lock hints to halve tail latency versus standard Linux options in mixed workloads.

Figure from the paper full image
abstract click to expand
Modern database systems increasingly co-schedule time-sensitive and background tasks. In such mixed workloads, background tasks should ideally utilize only spare CPU capacity without interfering with latency-critical requests. While some database-level solutions address this challenge, many database systems still rely on operating system (OS) schedulers, which, despite supporting priorities, do not reliably isolate high-priority tasks. Furthermore, they remain vulnerable to priority inversion, where preempted background tasks can delay other work. We present UFS, a selectively unfair scheduler implemented as an eBPF-based sched_ext scheduler in the Linux kernel. UFS restricts background tasks to idle CPU capacity and preempts them immediately when time-sensitive tasks arrive. To address priority inversion, UFS incorporates application-level hints via eBPF maps, ensuring that background tasks are not unnecessarily delayed should time-sensitive tasks wait for them to release locks. Our integration of UFS into PostgreSQL demonstrates that, under mixed workloads, UFS improves throughput for time-sensitive tasks by up to 2X, while reducing tail latency by half, compared to existing scheduling options in Linux.
0
0
cs.DB 2026-05-05

2-bit vectors build ANN graphs for 16x faster search

QuIVer: Rethinking ANN Graph Topology via Training-Free Binary Quantization

QuIVer performs all topology decisions in quantized space to achieve high recall at tens of thousands of queries per second with minimal mem

Figure from the paper full image
abstract click to expand
Approximate nearest neighbor (ANN) graph indices such as HNSW and Vamana construct graph topology in full-precision or high-fidelity quantized metric spaces, using binary quantization (BQ) only as a post-hoc search-time distance estimator. We ask whether BQ can build the graph itself. We present QuIVer (Quantized Index for Vector Retrieval), a training-free ANN graph index that performs edge selection, pruning, and graph navigation entirely in a 2-bit Sign-Magnitude BQ metric space. QuIVer combines: (i) a 2-bit encoding that preserves sign and magnitude strength at 1/12 the memory of float32 vectors; (ii) Vamana alpha-diversity pruning directly on BQ distances; and (iii) symmetric BQ beam search using XOR/AND/Popcount, followed by float32 reranking over a small candidate set. On six embedding datasets spanning 384--3072 dimensions, QuIVer achieves at least 88% Recall@10 at 13--41K multi-threaded QPS with 58--262-second construction and less than 1.3 GB hot memory. At matched recall on Cohere-1M, it outperforms the official DiskANN Rust implementation by 2.5--3.3x, hnswlib by 3.6--4.7x, and FAISS HNSW by 3.8--4.9x in multi-threaded throughput. Controlled experiments on additional datasets, including word vectors, CV features, uniform random vectors, multimodal CLIP embeddings, and low-rank synthetic data, delineate QuIVer's applicability boundary: high recall requires cosine-native distributions with low effective dimensionality, while monotonic recall gains with increasing ef suggest that BQ noise mainly affects navigation efficiency in the tested regimes.
0
0
cs.DB 2026-05-05 2 theorems

2-bit quantization builds ANN graphs without training

QuIVer: Rethinking ANN Graph Topology via Training-Free Binary Quantization

QuIVer reaches 88% recall at up to 41K QPS while using one-twelfth the memory of full-precision vectors for both construction and navigation

Figure from the paper full image
abstract click to expand
Approximate nearest neighbor (ANN) graph indices such as HNSW and Vamana construct graph topology in full-precision or high-fidelity quantized metric spaces, using binary quantization (BQ) only as a post-hoc search-time distance estimator. We ask whether BQ can build the graph itself. We present QuIVer (Quantized Index for Vector Retrieval), a training-free ANN graph index that performs edge selection, pruning, and graph navigation entirely in a 2-bit Sign-Magnitude BQ metric space. QuIVer combines: (i) a 2-bit encoding that preserves sign and magnitude strength at 1/12 the memory of float32 vectors; (ii) Vamana alpha-diversity pruning directly on BQ distances; and (iii) symmetric BQ beam search using XOR/AND/Popcount, followed by float32 reranking over a small candidate set. On six embedding datasets spanning 384--3072 dimensions, QuIVer achieves at least 88% Recall@10 at 13--41K multi-threaded QPS with 58--262-second construction and less than 1.3 GB hot memory. At matched recall on Cohere-1M, it outperforms the official DiskANN Rust implementation by 2.5--3.3x, hnswlib by 3.6--4.7x, and FAISS HNSW by 3.8--4.9x in multi-threaded throughput. Controlled experiments on additional datasets, including word vectors, CV features, uniform random vectors, multimodal CLIP embeddings, and low-rank synthetic data, delineate QuIVer's applicability boundary: high recall requires cosine-native distributions with low effective dimensionality, while monotonic recall gains with increasing ef suggest that BQ noise mainly affects navigation efficiency in the tested regimes.
0
0
cs.DB 2026-05-04 2 theorems

Dual HNSW graphs enable fast search for any Lp metric

U-HNSW: An Efficient Graph-based Solution to ANNS Under Universal Lp Metrics

By generating candidates from L1 and L2 indexes and verifying with early stops, the method cuts query time by orders of magnitude versus LSH

Figure from the paper full image
abstract click to expand
Approximate nearest neighbor search under universal L_p metrics (ANNS-U-L_p) is an important and challenging research problem, as it requires answering queries under all possible p (0<p <= 2) values simultaneously without building an index for each possible p value. The state-of-the-art solution, called MLSH, is a Locality-Sensitive Hashing (LSH)-based ANNS method with barely acceptable query performance. In contrast, graph-based ANNS methods, which offer significantly improved query efficiency on the ANNS-L_p problem (with a fixed p-value), cannot be naively extended to the ANNS-U-$L_p$ problem. In this paper, we propose U-HNSW, the first graph-based method for ANNS-U-L_p. Our scheme uses HNSW graph indexes built on two base metrics ($L_1$ and $L_2$) to generate promising nearest neighbors candidates, and then verifies these candidates with an early-termination strategy that substantially reduces the number of expensive L_p distance computations. Experimental results show that U-HNSW not only achieves up to 2670 times shorter query times than the original MLSH implementation running on a RAM disk (up to 15 times shorter than the idealized MLSH), but also outperforms the original HNSW on the ANNS-L_p problem (with a fixed p-value), except for a few special p values.
0
0
cs.DB 2026-05-04 3 theorems

This paper proposes Action Units as structured extensions to knowledge representations…

Actionable Understanding: Action Units for Bridging the Knowledge-Action Gap in Post-FAIR Knowledge Infrastructures

Action Units are introduced as typed, composable components in knowledge graphs that encode epistemic, transformational, and intervention…

abstract click to expand
Despite unprecedented growth in biodiversity data, a persistent gap remains between what is known and what is acted upon. Existing frameworks such as the FAIR and CLEAR Principles have improved data accessibility and interpretability but do not provide the components required to translate knowledge into context-sensitive action. We argue that closing this knowledge-action gap requires a shift toward statement-centred and action-oriented knowledge infrastructures. We identify a fundamental distinction between actionability as the structural capacity of a representation to support operations and applicability as the epistemic validity of using that knowledge in a specific context. Building on the Semantic Units Framework, we introduce Action Units as structured extensions of plan specifications that make applicability conditions and contextual grounding explicit as first-class typed components. Three types are distinguished, epistemic, transformational, and intervention action units, corresponding to three operation classes that define a minimal operational architecture for actionable knowledge. Action units can also be granularly composed across operation classes, reflecting the cross-class character of real-world knowledge-driven processes. Conditional action units, operationalized as executable IF-THEN structures, enable knowledge graphs to function as graph-native decision-support systems, constituting a transition toward post-FAIR knowledge infrastructures. Applied to biodiversity science, the framework reinterprets documented intervention and epistemic failures as consequences of incomplete action unit structures and constructs worked examples across all three operation classes. We propose the TripleA Principle: Actionability, Applicability, and Auditability, as a guiding framework for next-generation knowledge infrastructure design extending the FAIR and CLEAR Principles.
0
0
cs.DB 2026-05-04

Lattice method balances duplication and search for authorized vectors

Don't Be a Pot Stirrer! Authorized Vector Data Retrieval via Access-Aware Indexing

Veda partitions data into role blocks, merges under storage limits, and prunes mixed nodes via distance bounds from pure nodes.

Figure from the paper full image
abstract click to expand
Vector databases increasingly enforce role-based access control, where each top-k approximate nearest neighbor query must return only vectors the querying role is authorized to access. Two extremes bracket the design space. A single global index built over all vectors avoids duplication but wastes search effort on unauthorized vectors and degrades recall, while an oracle index, built with all authorized vectors to the query roles, searches only authorized vectors but duplicates every shared vector between roles or queries. We present Veda and its efficient variant EffVeda, two indexing strategies built on an access-aware lattice to address access control in vector databases. The methods first partitions the dataset into disjoint data blocks by role combination, then leverage the structure of the access-aware lattice to apply copy and merge operations to group co-accessed blocks under a user-specified storage budget. Large nodes in the lattice are then indexed with HNSW, while small nodes are retained for linear scan. To facilitate query processing on the lattice, our methods construct a query plan that selects the minimal set of nodes that covers all authorized data for each role. At query time, coordinated search first queries pure (authorized-only) nodes to populate a global top-k heap, then leverages the resulting distance bound of the k-th data in the heap to prune exploration on impure nodes, avoiding the inflated search that independent per-index execution would require. Evaluations show that our methods deliver higher throughput at high recall while closely tracking the storage budget.
0
0
cs.DB 2026-05-04

Five patterns decouple writes from reads in search engines

Write-Read Decoupling in Modern Large-Scale Search Engines: Architectures, Techniques, and Emerging Approaches

Per-field routing lets scalar fields update in O(1) time with immediate visibility while full-text fields use separate paths.

Figure from the paper full image
abstract click to expand
Large-scale search engines face a fundamental tension: the index must be updated frequently to maintain freshness, yet updates create resource contention that inflates query latency. In the dominant Lucene-based architecture, segment merges triggered by writes compete with concurrent queries for CPU cycles, disk I/O bandwidth, and operating-system page cache -- a problem we term \emph{write-read contention}. This survey systematically examines the architectural solutions that industry and academia have developed to decouple write pressure from read latency. We identify five principal patterns: (i)~node-level read-write separation; (ii)~compute-storage separation; (iii)~full in-memory indexing; (iv)~log-structured write paths; and (v)~in-place partial updates. We survey representative systems including Elasticsearch, LinkedIn Galene, Uber Sia, Quickwit, Alibaba Havenask, Algolia, Milvus, and Vespa, and discuss an emerging synthesis -- the ScaleSearch architecture -- that combines compute-storage separation with full in-memory indexing and dedicated write nodes. A key contribution of ScaleSearch is \emph{per-field update routing}: each field is assigned its own Kafka topic and update path, allowing scalar fields (price, stock, tags) to be updated in-place in $O(1)$ RAM with immediate visibility while full-text fields follow the segment-based compute-storage path. We conclude with open challenges in hybrid vector-and-full-text retrieval, serverless deployments, and AI-integrated search.
0
0
cs.DB 2026-05-04

Team projects built into database courses raise grades and teamwork scores

Complete Integration of Team Project-based Learning into a Database Syllabus

Four years of data from advanced database subjects show gains in peer reviews, student feedback, and final performance when real projects go

Figure from the paper full image
abstract click to expand
Team project-based learning (TPBL) combines two learning techniques: project-based learning (PBL) and teamwork. This combination leverages the learning outcomes of both methods and places students in a real work situation where they must develop and solve a real project while working as a team. TPBL has been used in two advanced database subjects in Jaume I University (UJI)'s Computer Science degree program. This learning method was used for four years (academic years from 2018/19 to 2021/2022) with positive outcomes. This study presents the project development, which includes teamwork formation, activities, timetable, and exercised learning competencies (both soft and specific). Further, the project's results were evaluated from three different perspectives: a) teamwork evaluation by teammates, b) Students' opinions on the subject and project, and c) subject final grades.
0
0
cs.DB 2026-05-04

One abstraction unifies database evolution

Living Databases: A Unified Model for Continuous Schema Evolution, Versioning, and Transformations

Shared primitives add conditional updates, alerts, and declarative control over views and derived models.

Figure from the paper full image
abstract click to expand
Databases, and datasets more generally, evolve continuously through updates, transformations, versioning, schema changes, streaming operations, and other mechanisms. While prior work has noted connections among some of these areas, they have traditionally been studied in isolation, each with its own abstractions, algorithms, and system implementations. In this paper, we argue for unifying these diverse functionalities under a single abstraction and a common set of computational primitives. We present such an abstraction, powerful enough to encompass existing use cases and to support new ones. Going beyond previous approaches, our framework seamlessly integrates provenance tracking for system-visible operations, conditional propagation of updates, and configurable alerts on change events. It also offers a principled treatment of dependent objects such as views and derived artifacts like machine learning models, by providing declarative mechanisms to control their evolution. Finally, we sketch a prototype implementation in a relational-like database system based on an adaptation of the "Prolly Tree", a Merkle tree-inspired data structure with tunable parameters to meet varying performance requirements, and present some initial experimental results.
0
0
cs.DB 2026-05-04

Execution-verified renamings recover Text-to-SQL accuracy on noisy schemas

EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement

A greedy pipeline screens names, proposes alternatives, checks them against query runs, and emits safe views that improve models without DBs

Figure from the paper full image
abstract click to expand
Text-to-SQL enables non-expert users to query databases in natural language, yet real-world schemas often suffer from ambiguous, abbreviated, or inconsistent naming conventions that degrade model accuracy. Existing approaches treat schemas as fixed and address errors downstream. In this paper, we frame schema refinement as a constrained optimization problem: find a renaming function that maximizes downstream Text-to-SQL execution accuracy while preserving query equivalence through database views. We analyze the computational hardness of this problem, which motivates a column-wise greedy decomposition, and instantiate it as EGRefine: a four-phase pipeline that screens ambiguous columns, generates context-aware candidate names, verifies them through execution-grounded feedback, and materializes the result as non-destructive SQL views. The pipeline carries two structural properties: column-local non-degradation, ensured by the conservative selection rule in the verification phase, and database-level query equivalence, ensured by the view-based materialization phase. Together they make the resulting refinement safe by construction at the column level, with cross-column and prompt-level interactions handled empirically rather than analytically. Across controlled schema-degradation, real-world, and enterprise benchmarks, EGRefine recovers accuracy lost to schema naming noise where applicable and correctly abstains where the underlying task exceeds current Text-to-SQL capabilities, with refined schemas transferring across model families to enable refine-once, serve-many-models deployment. Code and data are publicly available at https://github.com/ai-jiaqian/EGRefine.
0
0
cs.DB 2026-05-04

SPARQL multiset patterns match Datalog and relational algebra

Multiset semantics in SPARQL, Relational Algebra and Datalog

The three languages are expressively equivalent for AND, UNION, FILTER, EXCEPT and SELECT under bag semantics.

Figure from the paper full image
abstract click to expand
The paper analyzes and characterizes the algebraic and logical structure of the multiset semantics for SPARQL patterns involving AND, UNION, FILTER, EXCEPT, and SELECT. To do this, we align SPARQL with two well-established query languages: Datalog and Relational Algebra. Specifically, we study (i) a version of non-recursive Datalog with safe negation extended to support multisets, and (ii) a multiset relational algebra comprising projection, selection, natural join, arithmetic union, and except. We prove that these three formalisms are expressively equivalent under multiset semantics.
0
0
cs.DB 2026-05-01

Two-phase sampling cuts online aggregation cost up to 3x

Index-Assisted Stratified Sampling for Online Aggregation

First-phase index samples both start the answer and tune strata under the real probe-cost model for the rest of the query.

Figure from the paper full image
abstract click to expand
Ad-hoc queries over frequently updated data in a flat schema are common in real-time data analysis applications and often require very low latency. Online aggregation can achieve so by providing approximate aggregation answers with confidence bound guarantees. It relies on the ability to draw samples online in a linear time to sample size rather than database size, which can be supported by index-assisted Sampling-based Approximate Query Processing (S-AQP) systems. However, the query latencies of approximate queries in these systems can still suffer from excessive sampling cost required to achieve a desired confidence bound, due to increased sample size for data with high variance in value distribution and selectivity. Classic stratified sampling methods with Neyman allocation can minimize sample size in theory, but several challenges prevent it from being applicable in index-assisted S-AQP systems, including requiring apriori statistics, high optimization cost, and inaccurate sampling cost model based on sample size. Towards that, we design index-assisted stratified sampling for online aggregation, which features a two-phase sampling framework. Samples drawn from first phase are used for both online aggregation and optimizing future sampling cost, while the second phase continues the online aggregation using the optimized strata. We prove optimal stratification and sample size allocation strategies for index-based sampling cost model, and design several greedy and dynamic programming based optimization methods to balance optimization cost and effectiveness in cost reduction. We evaluate our methods on several real-world and synthetic datasets and queries, and the results show ours consistently achieve good speedup and, in extreme cases, up to 3x speedup and 98708x speedup, when compared to index-assisted uniform sampling and classic scan-based stratified sampling respectively.
0
0
cs.DB 2026-05-01

Tailwind speeds TPC-H queries 1.38x on average

Tailwind: A Practical Framework for Query Accelerators

External planner uses user-defined abstract plans and neural models to apply accelerators transparently in any RDBMS with import/export.

Figure from the paper full image
abstract click to expand
Relational database management systems (RDBMSes) can process general-purpose queries, but often have lower performance compared to custom-built solutions for specific queries. For example, consider a group-by query over a few known groups (e.g., grouping by country). While an RDBMS would likely use a hash map to do the grouping, a faster method could hard-code the expected groups into the query executor. But such workload-specific techniques, which we call query accelerators, are not widely used in practice because the engineering effort (optimizer and engine changes, potential bugs) does not justify the isolated performance gains (speedup on a single specific query). We propose Tailwind: an external query planner that brings accelerators into any RDBMS that supports data import/export. Users define their accelerators using abstract logical plans (ALPs): a new mostly-declarative abstraction over relational operators built on regular tree expressions. ALPs allow Tailwind to automatically build customized neural network models to estimate when using a particular accelerator is beneficial. At runtime, Tailwind sits atop an RDBMS and transparently rewrites queries to run across one or more accelerators when predicted to be beneficial, falling back to the underlying RDBMS when not. On Redshift and DuckDB with a library of four diverse accelerators, Tailwind accelerates TPC-H queries by 1.38x on average (up to 29x).
0
0
cs.DB 2026-04-30

Synthetic databases reveal 3-14 percent drops in text-to-SQL accuracy

SynSQL: Synthesizing Relational Databases for Robust Evaluation of Text-to-SQL Systems

LLM-generated test data conditioned on questions shows models fail more often than static benchmarks indicate.

Figure from the paper full image
abstract click to expand
Evaluating text-to-SQL systems remains largely fragile: correctness is typically judged by executing predicted and gold SQL queries on a single static database, even though the same queries may behave differently under alternative database instances. This raises a broader language modeling question: Can large language models synthesize semantically meaningful, schema-consistent relational data directly from a natural language question? If so, such generation can serve as a controlled mechanism for stress-testing text-to-SQL systems beyond fixed benchmark databases. We introduce SynSQL, a framework that synthesizes test databases conditioned on question-schema alignment rather than gold SQL queries. SynSQL decomposes the task into three stages: (1) schema selection, (2) question-guided data synthesis, and (3) constraint-aware critique with iterative refinement, framing database construction as structured generation under semantic and relational constraints. Across ten text-to-SQL models on Spider, BIRD, and Spider 2.0, SynSQL-generated databases reveal performance drops of 3-14% compared to static evaluation, exposing errors masked by benchmark artifacts. We further analyze generation quality, constraint adherence, and failure modes, highlighting both the promise and limitations of LLMs in structured data synthesis. Our findings position synthetic database generation as a new lens for studying LLM reasoning, controllability, and robustness in structured environments.
0
0
cs.DB 2026-04-30

One model unifies table discovery from text and table queries

Unified Data Discovery across Query Modalities and User Intents

UniDisc learns shared representations from lake context to rank tables for varied query types and user goals without heavy labeling.

Figure from the paper full image
abstract click to expand
Data discovery - retrieving relevant tables from a data lake in response to user queries - is a fundamental building block for downstream analytics. In practice, data discovery must support different query modalities, including natural language (NL) statements and tables, and accommodate diverse user intents, ranging from open-ended enrichment to task-driven inference for applications such as table question answering and fact verification. However, most existing methods are designed for a single query modality or a specific user intent, limiting their generalizability. We propose UniDisc, a unified data discovery framework that supports both NL statements and tables as queries and generalizes across diverse user intents without intent-specific representations or relevance modeling. UniDisc learns a common cross-modal representation model that produces unified representations for queries of different modalities and candidate tables, enabling uniform relevance assessment across discovery scenarios. Since learning such a model typically requires large labeled collections of query-table pairs, which are expensive to obtain, UniDisc instead exploits contextual signals naturally available in data lakes. Specifically, it models NL statements and tables as nodes in a heterogeneous graph with multiple edge types, and applies dual-view neighbor aggregation and joint optimization to learn robust, context-aware representations under limited supervision. These representations support flexible relevance estimation during retrieval. Experiments on seven datasets show that UniDisc consistently outperforms strong baselines on both NL- and table-based discovery.
0
0
cs.DB 2026-04-30

Graphify turns GraphQL into single optimized Gremlin queries in linear time

Graphify: Automated Synthesis of Type-Safe Graph Backends via O(S) GraphQL-to-Gremlin Transpilation

Automated mapping supports full CRUD, nested predicates, sorting and pagination while enforcing type safety at the API layer.

Figure from the paper full image
abstract click to expand
Graph databases offer unparalleled flexibility for managing interconnected data, yet the lack of strict schema enforcement often leads to runtime uncertainties and complex query development. This paper introduces Graphify, an end-to-end framework that enables developers to visually model graph data schemas and automatically synthesize a fully functional, type-safe backend. This paper proposes a formal mapping of GraphQL artifacts to the Gremlin traversal machine, supporting complete CRUD operations and arbitrarily nested queries. The system generates a transpiler capable of converting complex GraphQL requests into a single, optimized Gremlin query, including advanced features such as nested logical predicates, multi-key sorting, and pagination. At the core of the framework is a recursive state machine algorithm that processes GraphQL Abstract Syntax Trees (ASTs) with linear time complexity $O(S)$ relative to the number of selected fields. This paper demonstrates the practical efficiency and theoretical robustness of the approach through formal complexity analysis and empirical evaluation using the MovieLens 100k dataset. The result is a system that enables the generation of graph interfaces in minutes, bridging the gap between flexible graph storage and type-safe API consumption.
0
0
cs.DB 2026-04-30

LLM search aligns pivot table schemas at 88% accuracy

PiLLar: Matching for Pivot Table Schema via LLM-guided Monte-Carlo Tree Search

It works without training on anonymized data from unseen domains and comes with a new real-world benchmark for evaluation.

Figure from the paper full image
abstract click to expand
Pivot tables are ubiquitous in data lakes of modern data ecosystems, making accurate schema matching over pivot tables a key prerequisite for data integration. In this paper, we focus on matching for pivot table schema, which is a novel joint schema-value matching task. It aims to align schemas between pivot tables and standard relational tables, where a correct match must be semantically consistent at the schema level and compatible at the value level. However, due to the inherent data sensitivity of this task, the prevalence of anonymized data in practice poses significant challenges to its matching accuracy and generalization capability. To tackle these challenges, we propose PiLLar, the first matching for pivot table schema framework. We first formulate PiLLar as an LLM-driven search paradigm that operates with minimal annotated privacy-compliant data, thereby achieving training-free adaptation across diverse domains. Next, we provide a theoretical analysis on the error dynamics of the paradigm to ensure the asymptotic convergence of the proposed method. Furthermore, we introduce a new benchmark PTbench, derived from four representative real-world domains and constructed by mining unpivot-suitable tables, performing unpivot on semantically coherent attributes, and applying sampling and anonymization. Extensive experiments demonstrate the superiority of PiLLar, which achieves an average accuracy of 87.94% on the correctly predicted matches.
0
0
cs.DB 2026-04-30

LLM assistant cuts big data support tickets by 20.8%

SiriusHelper: An LLM Agent-Based Operations Assistant for Big Data Platforms

Hierarchical retrieval and automated SOP extraction from tickets let the system handle both routine and complex queries on large platforms.

Figure from the paper full image
abstract click to expand
Big data platforms are widely used in modern enterprises, and an in-production intelligent assistant is increasingly important to help users quickly find actionable guidance and reduce operational burden. While recent LLM+RAG assistants provide a natural interface, they face practical challenges in real deployments: limited scenario coverage across both general consultation and domain-specific troubleshooting workflows, inefficient knowledge access due to inadequate multi-hop retrieval and flat knowledge organization, and high maintenance cost because escalated tickets are unstructured and hard to convert into assistant improvements and reusable SOPs. In this paper, we present SiriusHelper, a deployed intelligent assistant for big data platforms. SiriusHelper serves as a unified online assistant that automatically identifies user intent and routes queries to the right handling path, including dedicated expert workflows for specialized scenarios (e.g., SQL execution diagnosis). To support complex troubleshooting, SiriusHelper combines a DeepSearch-driven mechanism with a priority-based hierarchical knowledge base to enable multi-hop retrieval without context overload, thus improving answer reliability and latency. To reduce expert overhead, SiriusHelper further introduces automated ticket understanding and SOP distillation: it diagnoses the assistant failure reason (e.g., missing knowledge or wrong routing) and extracts domain-specific SOPs to continuously enrich the knowledge base. Experiments and online deployment on Tencent Big Data platform show that SiriusHelper outperforms representative alternatives and reduces online ticket volume by 20.8\%.
0
0
cs.DB 2026-04-29

Evergreen converts verification of claims in LLM-generated semantic aggregates into…

Evergreen: Efficient Claim Verification for Semantic Aggregates

Evergreen verifies claims from semantic aggregates by compiling them into optimized semantic queries, achieving F1=1.0 at 3.2x lower cost…

abstract click to expand
With recent semantic query processing engines, semantic aggregation has become a primitive operator, enabling the reduction of a relation into a natural language aggregate using an LLM. However, the resulting semantic aggregate may contain claims that are not grounded in the underlying relation. Verifying such claims is challenging: they often involve quantifiers, groupings, and comparisons over relations that far exceed LLM context windows and require a costly combination of semantic and symbolic processing. We present Evergreen, a system that recasts claim verification as a semantic query processing task with tailored optimizations and provenance capture. Evergreen compiles each claim into a declarative semantic verification query and executes it on the same engine that produced the aggregate. To reduce cost and latency, Evergreen avoids unnecessary LLM calls through verification-aware optimizations (early stopping, relevance sorting, and estimation with confidence sequences) and general-purpose optimizations for semantic queries (operator fusion, similarity filtering, and prompt caching). Each verdict is accompanied by citations that identify a minimal set of tuples justifying the result, with semantics based on semiring provenance for first-order logic. On a benchmark of real-world restaurant review datasets reflecting production-inspired workloads, Evergreen achieves excellent verification quality (F1 = 1.00) with a strong LLM while reducing cost by 3.2x and latency by 4.0x compared to unoptimized verification. Even with a significantly weaker LLM, Evergreen outperforms a strong LLM-as-a-judge baseline in F1 at 48x lower cost and 2.3x lower latency. Relative to a retrieval-augmented agent, Evergreen compares favorably in F1 and latency with similar cost when both use a strong LLM; yet, with a much weaker LLM, it achieves the same F1 at 63x lower cost and 4.2x lower latency.
0
0
cs.DB 2026-04-29

CacheRAG turns stateless KGQA planning into cached learning

CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering

Semantic storage and diverse retrieval of past plans reduce hallucinations while keeping expansion costs bounded.

Figure from the paper full image
abstract click to expand
The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answering (KGQA). However, existing LLM-driven KGQA systems act as stateless planners, generating retrieval plans in isolation without exploiting historical query patterns: analogous to a database system that optimizes every query from scratch without a plan cache. This fundamental design flaw leads to schema hallucinations and limited retrieval coverage. We propose CacheRAG, a systematic cache-augmented architecture for LLM-based KGQA that transforms stateless planners into continual learners. Unlike traditional database plan caching (which optimizes for frequency), CacheRAG introduces three novel design principles tailored for LLM contexts: (1) Schema-agnostic user interface: A two-stage semantic parsing framework via Intermediate Semantic Representation (ISR) enables non-expert users to interact purely in natural language, while a Backend Adapter grounds the LLM with local schema context to compile executable physical queries safely. (2) Diversity-optimized cache retrieval: A two-layer hierarchical index (Domain $\rightarrow$ Aspect) coupled with Maximal Marginal Relevance (MMR) maximizes structural variety in cached examples, effectively mitigating reasoning homogeneity. (3) Bounded heuristic expansion: Deterministic depth and breadth subgraph operators with strict complexity guarantees significantly enhance retrieval recall without risking unbounded API execution. Extensive experiments on multiple benchmarks demonstrate that CacheRAG significantly outperforms state-of-the-art baselines (e.g., +13.2% accuracy and +17.5% truthfulness on the CRAG dataset).
0
0
cs.DB 2026-04-29

Negative patterns raise viral classification accuracy

Mining Negative Sequential Patterns to Improve Viral Genomic Feature Representation and Classification

Mining absence-based subsequences in RNA viral genomes lifts average accuracy by 10 percent over prior negative mining and 25 percent over a

Figure from the paper full image
abstract click to expand
Viruses represent the most abundant biological entities on Earth and play a pivotal role in microbial ecosystems, yet, as prominent human pathogens, they are closely linked to human morbidity and mortality. Accurate identification of viral sequences from viral genome sequences is therefore essential, but existing genome-based classification models that largely relying on composition- or frequency-based subsequence features often suffer from limited interpretability and reduced accuracy, particularly on complex or imbalanced datasets. To address these limitations, we propose GeneNSPCla (Genomic Negative Sequential Pattern-based Classification), a novel viral classification framework based on Negative Sequential Patterns (NSPs) that extracts discriminative absence-based features from nucleotide sequences of RNA viral genomes. By transforming these NSPs into numerical feature vectors and integrating them into multiple supervised classifiers, GeneNSPCla effectively captures both presence and absence signals in viral sequences. Furthermore, we propose a negative pattern mining algorithm adapted for processing genomic data: GONPM+, which can discover longer and more biologically meaningful negative sequential patterns. The experimental results demonstrate that the average accuracy of GONPM+ in 8 classifiers has improved by 10.03% compared to the original negative pattern mining algorithm and by 24.75% compared to the positive pattern mining algorithm. These findings highlight the effectiveness of incorporating absence-based sequential information, providing a new and complementary perspective for viral genome analysis and classification.
0
0
cs.DB 2026-04-29

VisualNeo connects visual queries to Neo4j for graph searches

VisualNeo: Bridging the Gap between Visual Query Interfaces and Graph Query Engines

The system reuses advanced visual tools and routes queries through the Neo4j driver so users without code can explore large graphs.

Figure from the paper full image
abstract click to expand
Visual Graph Query Interfaces (VQIs) empower non-programmers to query graph data by constructing visual queries intuitively. Devising efficient technologies in Graph Query Engines (GQEs) for interactive search and exploration has also been studied for years. However, these two vibrant scientific fields are traditionally independent of each other, causing a vast barrier for users who wish to explore the full-stack operations of graph querying. In this demonstration, we propose a novel VQI system built upon Neo4j called VisualNeo that facilities an efficient subgraph query in large graph databases. VisualNeo inherits several advanced features from recent advanced VQIs, which include the data-driven gui design and canned pattern generation. Additionally, it embodies a database manager module in order that users can connect to generic Neo4j databases. It performs query processing through the Neo4j driver and provides an aesthetic query result exploration.
0
0
cs.DB 2026-04-29

Algorithms hide all sensitive cross-level utility patterns without fakes

Cross-level Privacy Preserving Utility Mining

Min-RF, Max-RF and Best-NSCF use a new dictionary to sanitize databases that carry taxonomies while preserving true high-utility itemsets.

Figure from the paper full image
abstract click to expand
Privacy-preserving utility mining (PPUM) aims to hide sensitive high-utility patterns while preserving the utility of the sanitized database. In practice, however, many datasets are associated with taxonomic information, which makes the identification and processing of generalized items more challenging. To address this, we investigate the cross-level privacy-preserving utility mining (CLPPUM) problem and propose a method for protecting generalized items. Based on different victim item selection strategies, we develop three CLPPUM algorithms: minimum RGISU first (Min-RF), maximum RGISU first (Max-RF), and best NSC first (Best-NSCF). Furthermore, to enable efficient victim item identification, a novel dictionary structure named GI-dic is designed to accelerate the computation of required utility metrics. Experimental results on multiple datasets demonstrate that the proposed algorithms successfully hide all sensitive cross-level high-utility itemsets without introducing artificial itemsets. The results also show that our method performs well on sparse datasets, and both Min-RF and Best-NSCF consistently outperform Max-RF. Overall, Min-RF achieves the best performance, particularly when the minimum utility threshold is low and the dataset is dense.
0
0
cs.DB 2026-04-28

Autoencoder rewrites speed hybrid vector queries 2x on average

BoomHQ: Learning to Boost Multiple Hybrid Queries on Vector DBMSs

Modeling vector-scalar correlations and neighborhood selectivities lets the system rewrite queries for large time savings at fixed recall.

abstract click to expand
Hybrid queries, which combine vector nearest neighbor searches with scalar predicates, represent a fundamental challenge in managing vector databases. Existing methods often restrict the number of vector columns involved or the complexity of scalar predicates, thereby limiting their flexibility in handling diverse query patterns. Moreover, these approaches typically do not fully leverage the correlations between scalar and vector attributes, or the distributional patterns observed from query vector neighborhoods. To address these limitations, we introduce BoomHQ, a learning-based framework to boost multiple hybrid queries on vector DBMSs. First, BoomHQ models the correlation between vector and scalar attributes using an autoencoder-based architecture, which is also friendly to data updates. Second, BoomHQ captures prevailing query patterns, particularly using estimated selectivity of scalar predicates within the neighborhood of a query vector. Guided by these two key features, BoomHQ predicts the execution hints and rewrites the original query into an optimized version. Furthermore, we extend well-known benchmarks by introducing vector and scalar data with inherent correlations to better evaluate query execution. Experimental results demonstrate that for multiple hybrid queries at specified recall thresholds, our method achieves a 2x average and over 25x peak speedup compared to the state-of-the-art. Additionally, BoomHQ shows strong robustness against data updates and consistent optimization effectiveness across three representative vector database systems.
0
0
cs.DB 2026-04-28

Sliding window finds dense patterns exactly without gap parameters

Exact Mining of Dense Patterns via Direct Evaluation of Local Interval Frequency Using a Sliding Window

Direct local-frequency evaluation plus anti-monotonicity pruning removes the accuracy trade-off that gap constraints impose on both patterns

Figure from the paper full image
abstract click to expand
Accurately extracting patterns that appear frequently only within specific time intervals, together with their dense intervals, is important in many applications such as understanding seasonal demand and detecting anomalous behavior.Frequent itemset mining evaluates support over the entire dataset and therefore cannot detect locally dense patterns. Existing methods for dense pattern mining with interval output estimate dense intervals through occurrence-gap constraints; however, since the gap constraint parameter governs both pattern identification accuracy and interval detection accuracy simultaneously, finding a parameter setting that achieves high accuracy for both objectives is difficult.In this paper, we propose Apriori-window, an exact algorithm that resolves this structural limitation. The proposed method directly evaluates local frequency within a sliding window and thus requires no gap constraint parameter, and it efficiently enumerates dense intervals through anti-monotonicity-based pruning of the search space and stride-skip reduction of the number of window scans. Experiments on three real-world datasets demonstrate that existing methods struggle to simultaneously achieve high accuracy in both pattern identification and dense interval detection, and scalability experiments on synthetic data confirm the practical applicability of the proposed method.
0
0
cs.DB 2026-04-28

IM chat turns natural language into complete data reports

DataClaw: An Autonomous Data Agent with Instant Messaging Integration

The agent plans and runs the full analytical pipeline then delivers results back in the same conversation.

Figure from the paper full image
abstract click to expand
In daily life, there are many scenarios that people need to tackle data-related tasks, such as filling out forms, analyzing Excel files, and visualize data report. However, the tools available for these tasks often fragment, requiring users to switch between multiple applications and manually orchestrate steps like data processing, querying, and visualization. Moreover, these tools often assume a certain level of technical proficiency, creating barriers for non-technical users. To facilitate tacking daily data task, we present DataClaw, an autonomous data agent that integrates directly into familiar instant messaging (IM) platforms. By simply typing a natural language request in a chat interface, users enable DataClaw to autonomously plan and execute a complete analytical pipeline, delivering insights, charts, and reports directly back into the conversation. Under the hood, DataClaw is powered by a transparent ReAct reasoning engine, a multi-tiered memory system for cross session context preservation, and a pluggable skill architecture for on-the-fly extensibility. In this demonstration, attendees will interact with DataClaw via standard IM platforms to solve real-world data scenarios, experiencing how it serves as a highly capable personal data assistant.
0
0
cs.DB 2026-04-27

SEMA-SQL mixes SQL with LLM reasoning for semantic database queries

SEMA-SQL: Beyond Traditional Relational Querying with Large Language Models

Automating hybrid query generation and batching LLM calls enables handling of tasks like entity matching beyond standard relational limits.

Figure from the paper full image
abstract click to expand
Relational databases excel at structured data analysis, but real-world queries increasingly require capabilities beyond standard SQL, such as semantically matching entities across inconsistent names, extracting information not explicitly stored in schemas, and analyzing unstructured text. While text-to-SQL systems enable natural language querying, they remain limited to relational operations and cannot leverage the semantic reasoning capabilities of modern large language models (LLMs). Conversely, recent semantic operator systems extend relational algebra with LLM-powered operations (e.g., semantic joins, mappings, aggregations), but require users to manually construct complex query pipelines. To address this gap, we present SEMA-SQL, a system that automatically answers natural language questions by generating efficient queries that combine relational operations with LLM semantic reasoning. We formalize Hybrid Relational Algebra (HRA), a declarative abstraction unifying traditional relational operators with LLM user-defined functions (UDFs). The system automates three critical aspects: (1) query generation via in-context learning that produces HRA queries with precise natural language specifications for LLM UDFs, (2) query optimization via cost-based transformations and UDF rewriting, and (3) efficient execution algorithms that reduce LLM invocations by an average of 93% in semantic joins through intelligent batching. Extensive experiments with known benchmarks, and extensions thereof, demonstrate the significant query capability improvements possible with our design.
0
0
cs.DB 2026-04-27

Dataset released for 10,000 early AI agents on Ethereum

A dataset of early blockchain-registered AI agents on Ethereum

Integrates on-chain records and off-chain metadata to enable research on agent reputation and decentralized AI systems.

abstract click to expand
This study presents a structured dataset of blockchain-registered artificial intelligence agents under the ERC-8004 standard on Ethereum. The dataset integrates on-chain identity records, minting transactions, transfer events, reputation summaries, and individual feedback records, together with resolved off-chain metadata where available. Data were collected from Ethereum mainnet using Web3 RPC queries and processed into tabular form to enable reproducible analysis. The dataset covers 10,000 agents within a defined block range and includes both event-level records and aggregated summaries. It enables empirical research on agent identity formation, reputation systems, service exposure, and early-stage decentralized AI ecosystems. This resource supports studies in blockchain analytics, decentralized trust infrastructure, and the emerging agentic economy.
0
0
cs.DB 2026-04-27

Atomic RDF Datasets can serve as standardized messages for streaming

It's Time to Standardize RDF Messages

Making message boundaries explicit in the data model would let IoT streams, archives, and event feeds interoperate without transport-level h

abstract click to expand
RDF-based systems increasingly operate in event-driven and streaming settings, where producers and consumers exchange data as discrete units of communication rather than as freely mergeable RDF statements. As existing RDF semantics and tooling do not provide an interoperable notion of what statements belong together as one message, developers often rely on out-of-standard techniques, transport-level assumptions, or heuristics, leading to interoperability problems and inefficiencies. We propose the concept of an RDF Message as an RDF Dataset intended to be interpreted atomically as a single communicative act, laying the foundation for defining RDF Message Streams and RDF Message Logs. The proposal makes message boundaries explicit across serializations, transport, and storage systems, which in turn enables incremental consumption and reproducible replay in use cases such as IoT observations, archived RDF streams, nanopublications, or processing SPARQL CONSTRUCT results. Building on this, RDF Message Profiles, such as Linked Data Event Streams or ActivityStreams, then provide the terms for describing pagination, message structure, ordering, or retention policies. As part of the W3C Community Group on RDF Stream Processing, we are now seeking broader support and comments on the proposal from the Semantic Web community.
0
0
cs.DB 2026-04-27

Bounded self-joins make fact relevance as easy as query evaluation

How Hard is it to Decide if a Fact is Relevant to a Query?

Relevance checking drops to NP or LogCFL once repeated relations are forbidden, matching ordinary query evaluation complexity.

Figure from the paper full image
abstract click to expand
We consider the following fundamental problem: given a database D, Boolean conjunctive query (CQ) q, and fact f in D, decide whether f is relevant to q wrt. D, i.e., does f belong to a minimal subset S of D such that S |= q. Despite being of central importance to query answer explanation, the combined complexity of deciding query relevance has not been studied in detail, leaving open what makes this problem hard, and which restrictions can yield lower complexity. Relevance has already been shown to be harder than query evaluation: namely, $\Sigma^p_2$-complete for CQs, even over a binary signature. We further observe that NP-hardness applies already to (acyclic) chain CQs. Our work identifies self-joins (multiple atoms with the same relation) as the culprit. Indeed, we prove that if we forbid or bound the occurrence of self-joins, then relevance has the same complexity as query evaluation, namely, NP (without structural restrictions) and LogCFL (for bounded hypertreewidth classes). In the ontology setting, we establish an analogous result for ontology-mediated queries consisting of a CQ and DL-Lite_R ontology, namely that relevance is no harder than query answering provided that we bound the interaction width (which generalizes both self-join width and a recently introduced 'interaction-free' condition). Our results thus pinpoint what makes relevance harder than query evaluation and identify natural classes of queries which admit efficient relevance computation.
0
0
cs.DB 2026-04-27

Unified model pivots database migration across heterogeneous systems

A Model-Driven Approach to Database Migration with a Unified Data Model

Mappings to a common representation replace pairwise transformations and maintain structural and query fidelity in round-trip evaluations.

Figure from the paper full image
abstract click to expand
Database migration is a key task in software modernization, increasingly involving transformations across heterogeneous data models such as relational and NoSQL systems. Existing approaches are typically designed for specific source-target combinations, which limits their applicability in multi-model environments. This paper proposes a generic database migration approach based on the U-Schema unified data model, which acts as a pivot representation. By defining mappings between each data model and U-Schema, the approach reduces the number of required transformations and enables schema conversion across heterogeneous paradigms. Trace information is generated during schema transformation to capture correspondences between source and target elements, and is subsequently used to guide data migration in a decoupled manner. The approach has been implemented and evaluated through experiments covering schema-level validation, data-level semantic preservation, and performance analysis. The results show that the migration pipeline achieves high structural preservation under round-trip reconstruction, produces document schemas consistent with the intended design decisions, and preserves query behavior across a variety of access patterns, including joins, aggregations, and nested structures. Performance results demonstrate the feasibility of the approach for datasets of increasing size. The evaluation focuses on relational-to-document migration using both synthetic datasets and the Northwind benchmark. While this scenario provides a concrete instantiation, the approach is designed to support multiple data models within a unified framework.
0
0
cs.DB 2026-04-27

Maximal-clique index speeds filtered nearest-neighbor search

MCI: A Maximal Clique Index for Efficient Arbitrary-Filtered Approximate Nearest Neighbor Search

Compressing dense neighborhoods into cliques cuts space needs and boosts query speed up to tenfold at high recall.

Figure from the paper full image
abstract click to expand
Approximate Nearest Neighbor Search with arbitrary filtering predicates (AFANNS) is essential for modern data applications, yet existing methods often incur substantial storage and computational costs. In this work, we introduce the Maximal Clique Index (\mci), a novel graph-based index designed for robust and efficient AFANNS. The core idea of \mci is to approximate a dense Nearest Neighbor Graph (NNG) through a compact, clique-based representation. We propose two key techniques: (1) Maximal Clique Cover (\mcc), which exploits the geometric transitivity of high-dimensional spaces to encode dense neighborhoods as maximal cliques, achieving an index with high compression and connectivity; and (2) Local Neighborhood Graph Geometric Densification, a strategy that constructs an index approximating a large NNG from a sparse initial NNG, recovers global connectivity by progressively increasing distance thresholds to locally densify the structure. The index is built in a lock-free parallel manner for scalability and queried via a carefully-designed multi-seed strategy to handle fragmented predicate-induced subgraphs. Extensive experiments on 10 datasets show that \mci significantly outperforms state-of-the-art methods by up to one order of magnitude in QPS at high recall while using substantially smaller space, and remains competitive even on range/keyword filtering tasks, demonstrating robust general-purpose performance.
0
0
cs.DB 2026-04-24

ESPRESSO scales keyword search over Solid pods with privacy

Implementation and Privacy Guarantees for Scalable Keyword Search on SOLID-based Decentralized Data with Granular Visibility Constraints

WebID-scoped indexes and metadata enable efficient ranking while a threat model limits inference risks from distributed personal data

Figure from the paper full image
abstract click to expand
In decentralized personal data ecosystems grounded in architectures such as Solid, users retain sovereignty over their data via personal online data stores (pods), hosted on Solid-compliant server infrastructures. In such environments, data remains under the control of pod owners, which complicates search due to distribution across numerous pods and user-specific access constraints. ESPRESSO is a decentralized framework for scalable keyword-based search across distributed Solid pods under user-defined visibility policies. It addresses key challenges of decentralized search by constructing WebID-scoped indexes within pods and employing privacy-aware metadata to enable efficient source selection and ranking across servers. This paper further introduces a formal threat model for ESPRESSO, analysing the security and privacy risks associated with the generation, aggregation, and use of indexes and metadata. These risks include unintended metadata leakage and the potential for adversaries to infer sensitive information about data that resides within personal data stores. The analysis identifies key design principles that limit metadata exposure while mitigating unauthorized inference. The proposed threat model provides a foundation for evaluating privacy-preserving decentralized search and informs the design of systems with stronger privacy guarantees.
0
0
cs.DB 2026-04-24

Query algebra and wrappers replace LLM agents for enterprise data

An Alternate Agentic AI Architecture (It's About the Data)

AQL with source-specific wrappers delivers traceable, deterministic access across heterogeneous systems.

Figure from the paper full image
abstract click to expand
For the last several years, the dominant narrative in "agentic AI" has been that large language models should orchestrate information access by dynamically selecting tools, issuing sub-queries, and synthesizing results. We argue this approach is misguided: enterprises do not suffer from a reasoning deficit, but from a data integration problem. Enterprises are data-centric: critical information is scattered across heterogeneous systems (e.g., databases, documents, and external services), each with its own query language, schema, access controls, and performance constraints. In contrast, contemporary LLM-based architectures are optimized for reasoning over unstructured text and treat enterprise systems as either corpora or external tools invoked by a black-box component. This creates a mismatch between schema-rich, governed, performance-critical data systems and text-centric, probabilistic LLM architectures, leading to limited transparency, weak correctness guarantees, and unpredictable performance. In this paper, we present RUBICON, an alternative architecture grounded in data management principles. Instead of delegating orchestration to an opaque agent, we introduce AQL (Agentic Query Language), a small, explicit query algebra - Find, From, and Where - executed through source-specific wrappers that enforce access control, schema alignment, and result normalization. All intermediate results are visible and inspectable. Complex questions are decomposed into structured, auditable query plans rather than hidden chains of LLM calls. Our thesis is simple: enterprise AI is not a prompt engineering problem; it is a systems problem. By reintroducing explicit query structure, wrapper-based mediation, and cost-based optimization, we obtain the breadth of agentic search while preserving traceability, determinism, and trust in enterprise environments.
0
0
cs.DB 2026-04-24

SQLyzr adds diverse metrics and realism to text-to-SQL evaluation

A Demonstration of SQLyzr: A Platform for Fine-Grained Text-to-SQL Evaluation and Analysis

Platform replaces single scores with fine-grained reports, real-world workload alignment, and error diagnostics for better model diagnosis.

Figure from the paper full image
abstract click to expand
Text-to-SQL models have significantly improved with the adoption of Large Language Models (LLMs), leading to their increasing use in real-world applications. Although many benchmarks exist for evaluating the performance of text-to-SQL models, they often rely on a single aggregate score, lack evaluation under realistic settings, and provide limited insight into model behaviour across different query types. In this work, we present SQLyzr, a comprehensive benchmark and evaluation platform for text-to-SQL models. SQLyzr incorporates a diverse set of evaluation metrics that capture multiple aspects of generated queries, while enabling more realistic evaluation through workload alignment with real-world SQL usage patterns and database scaling. It further supports fine-grained query classification, error analysis, and workload augmentation, allowing users to better diagnose and improve text-to-SQL models. This demonstration showcases these capabilities through an interactive experience. Through SQLyzr's graphical interface, users can customize evaluation settings, analyze fine-grained reports, and explore additional features of the platform. We envision that SQLyzr facilitates the evaluation and iterative improvement of text-to-SQL models by addressing key limitations of existing benchmarks. The source code of SQLyzr is available at https://github.com/sepideh-abedini/SQLyzr.
0
0
cs.DB 2026-04-23

New framework checks isolation levels without database internals

Making TransactionIsolation Checking Practical

Boomslang parses traces into graphs, uses superpositions for uncertainty, and verifies configs in TiDB and MariaDB that earlier tools could

Figure from the paper full image
abstract click to expand
Checking whether database transactions adhere to isolation levels is a crucial yet challenging problem. We present Boomslang, the first general-purpose checking framework capable of verifying configurations that were previously uncheckable. Boomslang advances beyond prior work in three key aspects: (1) it supports arbitrary operation types provided by modern transactional key-value stores, (2) it requires no knowledge of database internals, and (3) it offers a modular, extensible pipeline amenable to customization and optimizations. Boomslang adopts a front-/back-end separation. As the front-end, it parses a database trace into an Abstract Semantic Graph, which is then lowered -- via semantic analysis -- into a low-level intermediate representation (IR). The back-end converts this IR to a set of constraints for SMT solving. This design is enabled by a key abstraction in the IR, called superpositions, which capture the uncertainty and complexity caused by arbitrary operations and missing information. Our experiments show that with just 271--386 lines of code, the core logic of three prior checkers can be reimplemented as Boomslang modules, achieving comparable or superior performance. Using Boomslang, we also identify a new bug in TiDB, audit the metadata layer of the JuiceFS file system, check vendor-specific behavior in MariaDB, support five previously unchecked isolation levels, and confirm a theoretical result on the correctness of strict serializability.
0
0
cs.DB 2026-04-23

Low-dim stats cut noise in private power-law exponent estimates

Estimating Power-Law Exponent with Edge Differential Privacy

Releasing only the statistics needed for alpha avoids the distortion of full noisy degree distributions under edge DP.

Figure from the paper full image
abstract click to expand
Many real-world graphs have degree distributions that are well approximated by a power-law, and the corresponding scaling parameter $\alpha$ provides a compact summary of that structure which is useful for graph analysis and system optimization. When graphs contain sensitive relationship data, $\alpha$ must be estimated without revealing information about individual edges. This paper studies power-law exponent estimation under edge differential privacy. Instead of first releasing a noisy degree distribution and then fitting a power-law model, we propose privatizing only the low-dimensional sufficient statistics needed to estimate $\alpha$, thereby avoiding the high distortion introduced by traditional approaches. Using these released statistics, we support both discrete approximation and likelihood-based numerical optimization for efficient parameter estimation. We develop edge-DP algorithms for both centralized and local DP models, compare degree release and log-statistic release in the local setting, and evaluate the resulting methods on various graph datasets across multiple privacy budgets and tail-cutoff settings.
0
0
cs.DB 2026-04-23

ML model predicts query slot-time before execution

Pre-Execution Query Slot-Time Prediction in Cloud Data Warehouses: A Feature-Scoped Machine Learning Approach

Pre-execution features alone yield 30-37 percent lower error than baselines on typical cost-significant queries, though long-tail cases stay

Figure from the paper full image
abstract click to expand
Cloud data warehouses bill compute based on slot-time consumed. In shared multi-tenant environments, query cost is highly variable and hard to estimate before execution, causing budget overruns and degraded scheduling. Static query-planner heuristics fail to capture complex SQL structure, data skew, and workload contention. We present a feature-scoped machine learning approach that predicts BigQuery slot-time before execution using only pre-execution observable signals: a structured query complexity score derived from SQL operator costs, data volume features from planner estimates and workload metadata, and textual features from query text. We deliberately exclude runtime factors (slot-pool utilization, cache state, realized skew) unknowable at submission. The model uses a HistGradientBoostingRegressor trained on log-transformed slot-time, with a TF-IDF + TruncatedSVD-512 text pipeline fused with numeric and categorical features. Trained on 749 queries across seven deployment environments and evaluated out-of-distribution on 746 queries from two held-out environments, the model achieves MAE 1.17 slot-minutes, RMSE 4.71, and 74% explained variance on the full workload. On cost-significant queries (slot-time >= 0.01 min, N=282) the model achieves MAE 3.10 versus 4.95 for a predict-mean baseline and 4.54 for predict-median, a 30-37% reduction. On long-tail queries (>= 20 min, N=22) the model does not outperform trivial baselines, consistent with the hypothesis that long-tail queries are dominated by unobserved runtime factors outside the current feature scope. A complexity-routed dual-model architecture is described as a practical refinement, and directions for closing the long-tail gap are identified as future work.
0
0
cs.DB 2026-04-23

LLM agent finds minimal data sets for analysis at 83% F1

An Agentic Approach to Metadata Reasoning

It beats baselines by 32 points on real benchmarks and keeps 85% accuracy while avoiding bad tables in noisy data lakes.

Figure from the paper full image
abstract click to expand
As LLM-driven autonomous agents evolve to perform complex, multi-step tasks that require integrating multiple datasets, the problem of discovering relevant data sources becomes a key bottleneck. Beyond the challenge posed by the sheer volume of available data sources, data-source selection is difficult because the semantics of data are extremely nuanced and require considering many aspects of the data. To address this, we introduce the Metadata Reasoner, an agentic approach to metadata reasoning, designed to identify a small set of data sources that are both sufficient and minimal for a given analytical task. The Metadata Reasoner leverages a table-search engine to retrieve candidate tables, and then autonomously consults various aspects of the available metadata to determine whether the candidates fit the requirements of the task. We demonstrate the effectiveness of the Metadata Reasoner through a series of empirical studies. Evaluated on the real-world KramaBench datasets for data selection, our approach achieves an average F1-score of 83.16%, outperforming state-of-the-art baselines by a substantial margin of 32 percentage points. Furthermore, evaluations on a newly-created synthetic benchmark based on the BIRD data lake reveal that the Metadata Reasoner is highly robust against redundant and low-quality tables that may be in the data lake. In this noisy environment, it maintains an average of 85.5% F1-score for selecting the right datasets and demonstrates a 99% success rate in avoiding low-quality data.
0
0
cs.DB 2026-04-23

Garfield cuts RFANNS index size 4.4x and raises throughput 120x

A GPU-Accelerated Framework for Multi-Attribute Range Filtered Approximate Nearest Neighbor Search

Cell partitioning with fixed cross edges plus cluster-guided GPU traversal keeps storage linear while reusing candidates for filtered vector

Figure from the paper full image
abstract click to expand
Range-filtered approximate nearest neighbor search (RFANNS) is increasingly critical for modern vector databases. However, existing solutions suffer from severe index inflation and construction overhead. Furthermore, they rely exclusively on CPUs for the heavy indexing and query processing, significantly restricting the throughput due to the limited memory bandwidth and parallelism. In this paper, we present Garfield, a GPU-accelerated framework for multi-attribute range filtered ANNS that overcomes these bottlenecks through designing a lightweight index structure and hardware-aware execution pipeline. Garfield introduces the GMG index, which partitions data into cells and builds local graph indexes. It guarantees linear storage and indexing overhead by adding a constant number of cross-cell edges. For queries, Garfield utilizes a cluster-guided ordering strategy that reorders query-relevant cells, enabling a highly efficient cell-by-cell traversal on the GPU that aggressively reuses candidates as entry points across cells. To handle datasets exceeding GPU memory, Garfield features a cell-oriented out-of-core pipeline. It dynamically schedules cells to minimize the number of active queries per batch and overlaps GPU computation with CPU-to-GPU index streaming. Extensive evaluations demonstrate that Garfield reduces index size by 4.4x, while delivering 119.8x higher throughput than state-of-the-art RFANNS methods.
0
0
cs.DB 2026-04-23

First GPU Datalog engine uses WCOJ to avoid memory blowup

Scaling Worst-Case Optimal Datalog to GPUs

Delivers 21x-47x speedups on program-analysis queries that cause binary-join engines to run out of memory.

Figure from the paper full image
abstract click to expand
Datalog is a declarative logic-programming language used for complex analytic reasoning workloads such as program analysis and graph analytics. Datalog's popularity is due to its unique price-point, marrying logic-defined specification with the potential for massive data parallelism. While traditional engines are CPU-based, the memory-bound nature of Datalog has led to increasing interest in leveraging GPUs. These engines beat CPU-based engines by operationalizing iterated relational joins via SIMT-friendly join algorithms. Unfortunately, all existing GPU Datalog engines are built on binary joins, which are inadequate for the complex multi-way queries arising in production systems such as DOOP and ddisasm. For these queries, binary decomposition can incur the AGM bound asymptotic blowup in time and space, leading to OOM failures regardless of join order. Worst-Case Optimal Joins (WCOJ) avoid this blowup, but their attribute-at-a-time intersections map poorly to SIMT hardware under key skew, causing severe load imbalance across Streaming Multiprocessors (SMs). We present SRDatalog, the first GPU Datalog engine based on WCOJ. SRDatalog uses flat columnar storage and two-phase deterministic memory allocation to avoid the OOM failures of binary joins and the index-rebuild overheads of static WCOJ systems. To mitigate skew and hide hardware stalls, SRDatalog further employs root-level histogram-guided load balancing, structural helper-relation splitting, and stream-aligned rule multiplexing. On real-world program-analysis workloads, SRDatalog achieves geometric-mean speedups of 21x to 47x.
0
0
cs.DB 2026-04-22

GPU pipeline speeds 3D polyhedral spatial joins by 9x

3DPipe: A Pipelined GPU Framework for Scalable Generalized Spatial Join over Polyhedral Objects

Chunked streaming and multi-level pruning allow processing of datasets larger than GPU memory by overlapping CPU and device work.

Figure from the paper full image
abstract click to expand
Spatial join is a fundamental operation in spatial databases. With the rapid growth of 3D data in applications such as LiDAR-based object detection and 3D digital pathology, there is an increasing need to support spatial join over 3D datasets. However, existing techniques are largely designed for 2D data, leaving 3D spatial join underexplored and computationally expensive. We present 3DPipe, a pipelined GPU framework for scalable spatial join over polyhedral objects. 3DPipe exploits GPU parallelism across both filtering and refinement stages, incorporates a multi-level pruning strategy for efficient candidate reduction, and employs chunked streaming to handle datasets exceeding GPU memory. Its pipelined execution overlaps CPU data preparation, host-device data transfer, and GPU computation to improve throughput. Experiments show that 3DPipe achieves up to 9.0$\times$ speedup over the state-of-the-art GPU solution, TDBase, while maintaining excellent scalability. 3DPipe is open-sourced at https://github.com/lyuheng/3dpipe.
0
0
cs.DB 2026-04-22

Online schema alignment recovers full results in decentralized queries

Demonstrating Online Schema Alignment in Decentralized Knowledge Graphs Querying

Dynamic discovery and scoping of alignment rules during link traversal handles vocabulary differences with low overhead.

abstract click to expand
Decentralized Knowledge Graphs querying enables integrating distributed data without centralization, but is highly sensitive to vocabulary heterogeneity. Query issuers cannot realistically anticipate all vocabulary mismatches, especially when alignment rules are local, scoped, or discovered at runtime. We present an online schema alignment approach for Link Traversal Query Processing (LTQP) that discovers, scopes, and applies alignment rules dynamically during query execution while preserving traversal behavior. This demo paper demonstrates the approach on a decentralized social-media scenario through a web interface built on a Comunica-based LTQP engine. Source code, a CLI, and a reusable library are publicly available. The demonstration shows that online schema alignment recovers complete query results with low overhead, providing a practical foundation for web-scale reasoning in LTQP systems.
0
0
cs.DB 2026-04-22

Monotonic embeddings prune more vertices in subgraph matching

LIVE: Learnable Monotonic Vertex Embedding for Efficient Exact Subgraph Matching (Technical Report)

By making dominance inherent, learning optimizes pruning directly and pairs with a lightweight index for large graphs.

Figure from the paper full image
abstract click to expand
Exact subgraph matching is a fundamental graph operator that supports many graph analytics tasks, yet it remains computationally challenging due to its NP-completeness. Recent learning-based approaches accelerate query processing via dominance-preserving vertex embeddings, but they suffer from expensive offline training, limited pruning effectiveness, and heavy reliance on complex index structures, all of which hinder the scalability to large graphs. In this paper, we propose \textit{\underline{L}earnable Monoton\underline{I}c \underline{V}ertex \underline{E}mbedding} (\textsc{LIVE}), a learning-based framework for efficient exact subgraph matching that scales to large graphs. \textsc{LIVE} enforces monotonicity among vertex embeddings by design, making dominance correctness an inherent structural property and enabling embedding learning to directly optimize vertex-level pruning power. To this end, we introduce a query cost model with a differentiable surrogate objective to guide efficient offline training. Moreover, we design a lightweight one-dimensional \textit{iLabel} index that preserves dominance relationships and supports efficient online query processing. Extensive experiments on both synthetic and real-world datasets demonstrate that \textsc{LIVE} significantly outperforms state-of-the-art methods in efficiency and pruning effectiveness.
0
0
cs.DB 2026-04-22

Heuristic partitioning cuts multi-tenant query P95 latency from 61s to 2s

Heuristic Search Space Partitioning for Low-Latency Multi-Tenant Cloud Queries

Dynamic predicate injection and a two-phase heuristic keep each sub-query's data in memory, raising throughput without schema changes.

Figure from the paper full image
abstract click to expand
Large-scale cloud security platforms must continuously query millions of structured cloud resource records distributed across thousands of tenant accounts. Broad, account-spanning queries saturate database infrastructure, producing P95 latencies exceeding 60 seconds. We identify buffer cache pressure as the dominant latency driver: in a controlled experiment, the same query executing with the same plan completed in 3.7 seconds when its working set was memory-resident and 94 seconds when concurrent load had evicted those pages. No query plan optimization can address this; the only effective intervention is reducing the number of pages each query must touch. We present the Heuristic Search Space Partitioning System (HSSPS), a query-time optimization layer that logically partitions the search space through dynamic predicate injection, without schema modification. A two-phase heuristic engine selects partition key values and scores candidate query plans before execution. A client-side page token maintains cross-partition traversal state without server-side sessions, enabling horizontal scalability. Controlled evaluation across representative query types demonstrates 50-97% P95 latency reduction (95-97% on high-cardinality queries), 8-10x throughput improvement, and 41x reduction in average active sessions. Production rollout across live multi-tenant traffic reduced P95 latency from 61s to 2s across successive releases, sustained over 14,000 eligible queries per measurement window. The technique generalizes to any multi-tenant system where broad queries execute against large shared databases and physical schema modification is impractical.
1 0
0
cs.DB 2026-04-21

Open data model v3 fixes wastewater surveillance data sharing

The Public Health and Environmental Surveillance Open Data Model (PHES-ODM) Version 3: An Open, Relational Data Model and Interoperability Framework for Wastewater Surveillance

New tables for health actions, linkages, and mapping tools standardize data for use across programs and countries.

Figure from the paper full image
abstract click to expand
Wastewater surveillance (WWS) has emerged as a valuable tool for public health surveillance, particularly since the COVID-19 pandemic. Its long-term utility is constrained, however, by fragmented data systems, inconsistent metadata practices, and poor interoperability. The Public Health and Environmental Surveillance Open Data Model (PHES-ODM) was developed as an open, collaborative framework to standardize WWS data and support transparent, ethical data use aligned with FAIR principles. Adopted by the Public Health Agency of Canada and adapted by the EU Sewage Sentinel System, the model is now used in over 25 countries. This paper introduces version 3 of the model, which addresses persistent barriers to interoperability and data utility. Key enhancements include new tables for public health actions, external repository linkages (e.g., GISAID, GenBank), and analytical workflow documentation, as well as support for complex relational linkages across sites, samples, measures, and populations. Tools for mapping across other data formats, including PHA4GE and the US CDC National Wastewater Surveillance System, and for supporting long and wide data formats are also introduced. We compare PHES-ODM against six other WWS data standards across 25 features. Balancing robustness with usability, PHES-ODM v3 provides a scalable, modular infrastructure adaptable to diverse WWS and environmental surveillance programs.
0
0
cs.DB 2026-04-20

Branchable databases slow reads up to 4000x as agent branches deepen

BranchBench: Aligning Database Branching with Agentic Demands

Benchmark of five agentic workloads finds no current system avoids major latency penalties for deep speculative exploration.

Figure from the paper full image
abstract click to expand
Branchable databases are evolving from developer tools to infrastructure for agentic workloads characterized by speculative mutations and non-linear state exploration. Traditional RDBMS mechanisms such as nested transactions do not provide the persistent isolation and concurrent branch management required by autonomous agents, and recent "zero-copy" designs make different trade-offs whose impact on agentic workloads remains unclear. To clarify this space, we present BranchBench, a benchmark for evaluating branchable relational DBMSes under agentic exploration. We characterize five representative workloads-agentic software engineering, failure reproduction, data curation, MCTS, and simulation-and design parameterized macrobenchmarks that execute branch-mutate-evaluate loops to reflect these workloads, along with microbenchmarks that isolate branch lifecycle costs. We evaluate state of the art systems including Neon, DoltgreSQL, Tiger Data, Xata, and PostgreSQL baselines, and find a fundamental tension: systems optimized for fast branching suffer up to 5-4000x slower reads as branches deepen, while systems optimized for fast data operations incur 25-1500x higher branch creation and switching latency. Further, no current system supports the representative workloads at scale. These results highlight the need for branch-native DBMSes designed specifically for agentic exploration.
0
0
cs.DB 2026-04-20

Flipped indexing delivers 6.5x lower GPU query latency with dynamic updates

FliX: Flipped-Indexing for Scalable GPU Queries and Updates

Warps are assigned to buckets and locate operations via batch binary search instead of traversing a maintained index layer.

Figure from the paper full image
abstract click to expand
GPU-based concurrent data structures (CDSs) achieve high throughput for read-only queries, but efficient support for dynamic updates on fully GPU-resident data remains challenging. Ordered CDSs (e.g., B-trees and LSM-trees) maintain an index layer that directs operations to a data layer (buckets or leaves), while hash tables avoid the cost of maintaining order but do not support range or successor queries. On GPUs, maintaining and traversing an index layer under frequent updates introduces contention and warp divergence. To tackle these problems, we flip the indexing paradigm on its head with FliX, a comparison-based, flipped indexing strategy for dynamic, fully GPU-resident CDSs. Traditional GPU CDSs typically take a batch of operations and assign each operation to a GPU thread or warp. FliX, however, assigns compute (e.g., a warp) to each bucket in the data layer, and each bucket then locates operations it is responsible for in the batch. FliX can replace many index layer traversals with a single binary search on the batch, reducing redundant work and warp divergence. Further, FliX simplifies updates as no index layer must be maintained. In our experiments, FliX achieves 6.5x reduced query latency compared to a leading GPU B-tree and 1.5x compared to a leading GPU LSM-tree, while delivering 4x higher throughput per memory footprint than ordered competitors. Despite maintaining order, FliX also surpasses state-of-the-art unordered GPU hash tables in query and deletion performance, and is highly competitive in insertion performance. In update-heavy workloads, it outperforms the closest fully dynamic ordered baseline by over 8x in insertion throughput while supporting dynamic memory reclamation. These results suggest that eliminating the index layer and adopting a compute-to-bucket mapping can enable practical, fully dynamic GPU indexing without sacrificing query performance.
0
0
cs.DB 2026-04-20

Policy structure dictates database optimizer plans

Compliance in Databases: A Study of Structural Policies and Query Optimization

Adding structured compliance rules to queries changes execution plans and measured performance in controlled analytical workloads.

Figure from the paper full image
abstract click to expand
Growing privacy regulations and internal governance mandates are driving demand for fine-grained, context-sensitive access control in data management systems. Among competing approaches, content-based access control -- where access decisions depend on the data values referenced by a query -- is becoming particularly prominent, and is supported directly in modern database engines. While simple content-based predicates often incur negligible overhead, increasingly rich policies can interact in subtle ways with query optimization, leading to significant and poorly understood performance variability. This paper investigates this gap by introducing a structural framework and expressive policy grammar for modelling content-based compliance policies and analysing their impact on query planning and execution in database systems. Building on this framework, we augment an analytical benchmark with structured policy workloads, enabling controlled evaluation of enforcement mechanisms and optimization strategies under combined query - policy workloads. Our experimental results show that policy structure has a decisive impact on optimizer behaviour and end-to-end performance, underscoring the need for policy-aware database and optimizer design.
0
0
cs.DB 2026-04-20

Agent autonomy pushes humans to supervisor roles in visual analytics

Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows

Survey of 55 systems defines four agent roles and shows how humans adapt from operators to overseers as AI handles planning through context.

Figure from the paper full image
abstract click to expand
Agentic visual analytics (VA) represents an emerging class of systems in which large language model (LLM)-driven agents autonomously plan, execute, evaluate, and iterate across the full visual analytics pipeline. By shifting users from low-level tool operations to high-level analytical goals expressed through natural language, these systems are fundamentally transforming how humans interact with data. However, the rapid proliferation of such systems in recent years has outpaced our understanding of their design landscape. Two intertwined problems remain open: how do autonomous agents reshape the traditional VA pipeline, and how must human involvement adapt as agent autonomy increases? To address these questions, this paper presents a comprehensive survey of 55 primary agentic VA systems and introduces a co-evolutionary framework. This framework is essential because it jointly analyzes the progression of agent autonomy alongside the necessary shift in human roles from manual operators to strategic supervisors. Within this framework, we define a role-workflow taxonomy that aligns four key agentic roles (PLANNER, CREATOR, REVIEWER, and CONTEXT MANAGER) and maps them onto established VA pipeline stages. Our analysis uncovers recurring trade-offs along three foundational axes: autonomy levels, agentic roles, and the VA workflow. We consolidate these findings into actionable design guidelines and outline future research directions for agentic visual analytics. A web-based interactive browser of our co-evolutionary framework, including the corpus and design guidelines, is available at agenticva.github.io/AgenticVA/.
0
0
cs.DB 2026-04-20

Response feedback backpropagates to refine KG-RAG by 7.34%

EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

Utility scores from each answer are traced to paths and triplets so the graph updates itself toward higher accuracy on evolving tasks.

Figure from the paper full image
abstract click to expand
Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) has emerged as a promising paradigm for enhancing LLM reasoning by retrieving multi-hop paths from KGs. However, existing KG-RAG frameworks often underperform in real-world scenarios because the pre-captured knowledge dependencies are not tailored to the downstream task or its evolving requirements. These frameworks struggle to adapt to task-specific requirements and lack mechanisms to filter low-contribution knowledge during generation. We observe that feedback on generated responses offers effective supervision for improving KG quality, as it directly reflects user expectations and provides insights into the correctness and usefulness of the output. However, a key challenge lies in effectively linking response-level feedback to triplet-level contribution evaluation and knowledge updates in the KG. In this work, we propose EvoRAG, a self-evolving KG-RAG framework that leverages the feedback over generated responses to continuously refine the KG and enhance reasoning accuracy. EvoRAG introduces a feedback-driven backpropagation mechanism that attributes feedback to retrieved paths by measuring their utility for response and propagates this utility back to individual triplets, supporting fine-grained KG refinements towards more adaptive and accurate reasoning. Through EvoRAG, we establish a closed loop that couples feedback, LLM, and graph data, continuously enhancing the performance and robustness in real-world scenarios. Experimental results show that EvoRAG improves reasoning accuracy by $7.34\%$ over state-of-the-art KG-RAG frameworks. The source code has been made available at https://github.com/iDC-NEU/EvoRAG.
0
0
cs.DB 2026-04-17

Small model attention prunes long docs to 10% for big QA

SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing

SAGE uses differential attention to pick relevant chunks, cutting tokens 90% while matching full-context accuracy on benchmarks.

Figure from the paper full image
abstract click to expand
Large language models with long context windows can answer complex questions directly from full-length academic, technical, and policy documents, but passing entire documents is often costly, slow, and can degrade answer quality while increasing the risk of unnecessary data leakage. This paper targets the common setting of answering many heterogeneous questions over long document(s), where fixed position heuristics and standard retrieval-augmented generation (RAG) can fail due to document structure variability and weak query-chunk semantic similarity, which often requires task- and domain-specific tuning of embedding retrievers. We propose {Selective Attention-Guided Extraction} (\ourmethod), a training-free, plug-and-play context reduction framework that uses a lightweight local LLM to perform a single prefilling pass and convert language model attention signals into a query-specific relevance heatmap at configurable granularities. \ourmethod\ further introduces \emph{differential attention} strategies to better isolate question-relevant evidence, then selects the top-scoring units under a user-defined token budget and forwards only this reduced context to a downstream LLM for answer generation. \ourmethod\ surpasses traditional reduction techniques across multiple long-document QA benchmarks, notably securing a top-4 rank on QuALITY-hard while constrained to a 10\% context budget. This enables a 90\% reduction in tokens with competitive accuracy, without the need for model fine-tuning or complex calibration.
0
0
cs.DB 2026-04-17

SQL and Python agreement on tiny database picks correct queries

DPC: Training-Free Text-to-SQL Candidate Selection via Dual-Paradigm Consistency

Training-free method turns candidate selection into execution verification and lifts accuracy up to 2.2 percent on standard benchmarks.

Figure from the paper full image
abstract click to expand
While Large Language Models (LLMs) demonstrate impressive proficiency in generating SQL queries, they fundamentally lack the capability to self-evaluate correctness without an execution oracle. This limitation creates a stark Generation-Selection Gap, where high potential accuracy (Pass@K) fails to translate into execution accuracy (Pass@1). Although supervised verifiers offer mitigation, they incur prohibitive annotation costs and suffer from domain fragility. Consequently, recent research has pivoted to the training-free setting. However, existing methods--such as Self-Consistency or LLM-as-a-Judge--remain hampered by systematic bias (consensus on hallucinations) and symbolic blindness (inability to simulate execution states). We introduce DPC (Dual-Paradigm Consistency), a multi-agent framework that reformulates SQL selection from a probabilistic guessing task on hidden data into a deterministic verification task on visible data. Specifically, DPC employs a SLICER and a TESTER agent to collaboratively construct a Minimal Distinguishing Database (MDD)--an adversarial, fully observable micro-environment engineered to expose logical discrepancies between candidates. To break the self-correction bias, a SOLVER agent then verifies the SQL candidates by cross-referencing their execution against a parallel Python/Pandas solution. By validating execution consistency between declarative (SQL) and imperative (Python) paradigms, DPC robustly discriminates correct logic from systematic hallucinations. Experiments on BIRD and Spider across multiple LLMs demonstrate that our method consistently outperforms existing selection baselines, achieving absolute accuracy improvements of up to 2.2% over strong competitors like Self-Consistency.
1 0
0
cs.DB 2026-04-17

Four-layer architecture unifies reconciliation and anomaly detection

Data Engineering Patterns for Cross-System Reconciliation in Regulated Enterprises: Architecture, Anomaly Detection, and Governance

The GERA approach adds semantic standards and security controls for regulated firms with siloed systems.

Figure from the paper full image
abstract click to expand
Regulated enterprises in the United States -- banks, telecommunications providers, large technology companies -- operate across heterogeneous systems that were rarely designed to interoperate. ERP platforms, billing engines, supply chain tools, and financial reporting infrastructure coexist within the same organization, but they do not talk to each other well. The resulting fragmentation produces familiar problems: transactions recorded in one system but unreconciled in another, asset inventories drifting from their systems of record, and audit-readiness that depends on manual effort. The PCAOB's 2024 inspection cycle put a number on the consequences: a 39% aggregate Part I.A deficiency rate across all inspected firms. This paper introduces the GERA Framework (Governed Enterprise Reconciliation Architecture) -- a vendor-neutral, four-layer data architecture that integrates deterministic cross-system reconciliation, statistical anomaly detection (baseline Z-Score with robust alternatives), governed semantic standardization, and NIST CSF 2.0-aligned security controls into a single methodology. The architecture spans four layers (ingestion, staging, core models, and semantic serving), following the multi-layer pattern now common in modern data platforms. The patterns are demonstrated through U.S. broadband operations -- where billing reconciliation, inventory aging, and governance are tightly coupled -- and draw on the author's implementation experience across three regulated enterprise environments: a regional bank, a national broadband provider, and a Fortune 500 technology company's central finance organization. This is a practitioner reference -- an architectural framework paper documenting field-tested patterns -- not a controlled experiment or benchmark study. No proprietary systems, datasets, or internal implementations are disclosed.
0
0
cs.DB 2026-04-17

PP-FP-tree finds top keyword k-core communities in public-private graphs

Efficient Community Search on Attributed Public-Private Graphs

A public global index plus compact private tree from neighbors enables fast search for the most attribute-matched connected k-core.

Figure from the paper full image
abstract click to expand
Public-private graph, where a public network is visible to everyone and every user is also associated with its own small private graph accessed by itself only, widely exists in real-world applications of social networks and financial networks. Most existing work on community search, finding a query-dependent community containing a given query, only studies on a public graph, neglecting the privacy issues in public-private networks. However, considering both the public and private attributes of users enables community search to be more accurate, comprehensive, and personalized to discover hidden patterns. In this paper, we study a novel problem of attributed community search in public-private graphs (ACS-PP), aiming to find a connected k-core community that shares the most keywords with the query node. This problem uncovers structurally cohesive communities, such as interest-based user groups or core teams in collaborative networks. To optimize search efficiency, we propose an integrated scheme of constructing a public global graph index and a private personalized graph index. For the private index, we developed a compact structure of the PP-FP-tree index. The PP-FP-tree is constructed based on the public and private neighbors of the query node in the public-private graph, serving as an efficient index to mine frequent node sets that share the most common attributes with the query node. Extensive experiments on real public-private graph datasets validate both the efficiency and quality of our proposed PP-FP search algorithm against existing competitors. The case study on public-private collaboration networks provides insights into the discovery of public-private communities.
0
0
cs.DB 2026-04-17

RELOAD is a learned query optimizer that reduces individual query performance regressions…

RELOAD: A Robust and Efficient Learned Query Optimizer for Database Systems

Addresses instability and slow training in reinforcement learning methods for database query planning on standard benchmarks.

Figure from the paper full image
abstract click to expand
Recent advances in query optimization have shifted from traditional rule-based and cost-based techniques towards machine learning-driven approaches. Among these, reinforcement learning (RL) has attracted significant attention due to its ability to optimize long-term performance by learning policies over query planning. However, existing RL-based query optimizers often exhibit unstable performance at the level of individual queries, including severe performance regressions, and require prolonged training to reach the plan quality of expert, cost-based optimizers. These shortcomings make learned query optimizers difficult to deploy in practice and remain a major barrier to their adoption in production database systems. To address these challenges, we present RELOAD, a robust and efficient learned query optimizer for database systems. RELOAD focuses on (i) robustness, by minimizing query-level performance regressions and ensuring consistent optimization behavior across executions, and (ii) efficiency, by accelerating convergence to expert-level plan quality. Through extensive experiments on standard benchmarks, including Join Order Benchmark, TPC-DS, and Star Schema Benchmark, RELOAD demonstrates up to 2.4x higher robustness and 3.1x greater efficiency compared to state-of-the-art RL-based query optimization techniques.
0

browse all of cs.DB β†’ full archive Β· search Β· sub-categories