arxiv: 2605.05536 · v1 · submitted 2026-05-07 · 💻 cs.DB

Recognition: unknown

An Extensible and Verifiable Language for Query Rewrite Rules

Alvin Cheung, Shuxian Wang, Sicheng Pan, Vijay Sharma, Wesley Zheng, Zirong Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:43 UTC · model grok-4.3

classification 💻 cs.DB

keywords query rewrite rulesdomain-specific languagedatabase optimizationformal verificationrelational algebraquery plan transformationengine portability

0 comments

The pith

Rulescript is an engine-agnostic DSL for writing formally verified query rewrite rules that deploy to multiple database engines via lightweight adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Rulescript to address the high cost of reimplementing query rewrite rules for each new database engine. It proposes a domain-specific language based on relational algebra that separates rule writing from engine-specific details. Rules are split into matching and transformation parts and automatically checked for correctness. Only a small adapter needs to be written to connect to a new engine, enabling the same rules to work across systems like CockroachDB and Apache Data Fusion. This approach reduces engineering effort and the risk of bugs from repeated implementations.

Core claim

Rulescript enables developers to define query rewrite rules using a core set of relational algebra operators for pattern matching and plan transformation, with all rules formally verified for equivalence. Custom operators can be added for engine-specific features, and integration with any target system requires only implementing a mapping adapter. The language was used to reimplement 33 rules from Apache Calcite and successfully deployed to two different engines.

What carries the argument

Rulescript, a domain-specific language with a relational algebra core that decomposes rewrite rules into explicit matching and transformation phases, supported by automatic formal verification and extensible custom operators.

If this is right

Rewrite rules written in Rulescript can be deployed to new database engines without rewriting the logic.
Automatic formal verification ensures that each rule produces semantically equivalent query plans.
Custom operators allow capturing semantics unique to specific engines while staying within the verified framework.
Porting requires only a lightweight adapter, as demonstrated by deployments to CockroachDB and Apache Data Fusion.
Existing rules from systems like Apache Calcite can be reimplemented once and reused.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Database engine developers could maintain a shared repository of verified rules to accelerate new system creation.
Similar DSL approaches might apply to other optimization tasks beyond query planning, such as in compilers or dataflow systems.
Reducing reimplementation could lower the barrier for experimenting with new database architectures.

Load-bearing premise

The lightweight adapter correctly translates Rulescript operators to the target engine without semantic errors or breaking the verification guarantees.

What would settle it

A case where a Rulescript-verified rule, after adapter mapping, produces a query result or performance that differs from the original rule in the target engine.

Figures

Figures reproduced from arXiv: 2605.05536 by Alvin Cheung, Shuxian Wang, Sicheng Pan, Vijay Sharma, Wesley Zheng, Zirong Zeng.

**Figure 1.** Figure 1: The full lifecycle of a rewrite rule in RuleScript: Rules specified using the RuleScript DSL (Sec. 3, Sec. 4) can be statically verified for correctness (Sec. 5), and executed in query engines through RuleScript adapters (Sec. 6). core language to capture engine-specific constructs, such as specialized physical operators or proprietary scalar expressions. These custom operators can then be used within re… view at source ↗

**Figure 2.** Figure 2: Syntax of RuleScript patterns. Here 𝑞 ranges over patterns, 𝑝 over predicates, 𝑒 over scalar expressions, and 𝛼 over aggregate expressions. Lambda binders 𝜆𝑥 name the input tuple, making column references explicit. comparison and to contain a distinguished Null value. Type symbols with the same name are considered identical across a rule. Crucially, a type symbol represents the entire tuple type of a plan’… view at source ↗

**Figure 3.** Figure 3: Syntax for custom operator definitions. Here view at source ↗

read the original abstract

Logical query plan rewriting transforms a relational database query into an equivalent but more efficient form and is crucial to the performance of database-backed applications. In existing systems, rewrite rules are typically implemented manually, tightly coupled to specific execution engines, and often lack formal correctness guarantees. Consequently, developing a new engine requires reimplementing both legacy and new rules, incurring significant engineering cost, limiting portability, and every new implementation is an opportunity for introducing new bugs. We introduce Rulescript, an engine-agnostic domain-specific language (DSL) for developing query rewrite rules. Rulescript separates rule definition from execution infrastructure via a relational algebra-inspired core language and an explicit decomposition of rules into matching and transformation phases. Developers express rewrites by pattern-matching query plans using Rulescript's core operators and constructing semantically equivalent transformed plans, with all rewrites automatically verified formally to ensure correctness. Rulescript is extensible: users can define custom operators in terms of the core language to capture engine-specific semantics. To integrate with an existing system, developers need only implement a lightweight adapter that maps Rulescript's core and custom operators to the operators implemented in the target engine. We evaluate Rulescript by reimplementing 33 rewrite rules from Apache Calcite and extending the language with several custom operators. To demonstrate portability, we automatically deploy these rules to CockroachDB and Apache Data Fusion, two engines with substantially different backends. Our results show that Rulescript enables "write once, deploy everywhere" paradigm for query plan rewriting, with minimal effort required to deploy previously written rules on a new data engine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rulescript gives a verified DSL for writing query rewrites once and porting them via adapters, but the adapters themselves stay unverified and manual.

read the letter

Rulescript is a DSL that lets developers write query rewrite rules against a core relational algebra, split into explicit matching and transformation steps, with automatic formal verification of each rule. Custom operators can be added by defining them in the core language, and a lightweight adapter then maps everything to a target engine's operators. They reimplemented 33 rules from Calcite, added a few custom ones, and deployed the same rules to CockroachDB and Data Fusion with what they describe as minimal extra work. That shows the portability claim in action and directly attacks the repeated engineering cost of reimplementing optimizers for new backends. The verification step at the Rulescript level is a clear practical win for catching rule-level errors early. The main limitation is that verification stops before the adapters. The adapters are hand-coded per engine, and nothing described checks that they preserve the intended semantics or the verified equivalences. A mismatch in how a custom operator is implemented on the target side would break the guarantees without any formal warning. The abstract also gives no numbers on verification runtime, differential testing of the ports, or performance impact of the rewritten plans, so the engineering savings and overhead remain qualitative. This is for database systems people who build or maintain query optimizers and want to share rules across engines without losing correctness checks. A reader who has ported rules between systems will see the value immediately. I would send it to peer review; the approach is concrete enough and the evaluation demonstrates feasibility, even if the adapter gap and missing quantitative results need more detail in revision.

Referee Report

3 major / 1 minor

Summary. The paper introduces Rulescript, an engine-agnostic DSL for query rewrite rules. Rules are expressed via decomposition into pattern-matching and transformation phases over a relational algebra core, with automatic formal verification of correctness. The language is extensible via custom operators defined in terms of the core; integration with a target engine requires only a lightweight adapter that maps core and custom operators. The evaluation reimplements 33 rules from Apache Calcite (with custom operators) and automatically deploys them to CockroachDB and Apache Data Fusion, claiming a 'write once, deploy everywhere' paradigm with minimal porting effort.

Significance. If the central claims hold, the work could meaningfully reduce engineering cost and bug risk when porting or developing rewrite rules for new database engines, while providing formal correctness guarantees that are rare in this domain. The separation of rule logic from engine-specific adapters and the emphasis on verification address a genuine pain point in query optimizer development. However, the absence of concrete evidence on verification methods, adapter semantics preservation, and quantitative outcomes limits the assessed significance in its current form.

major comments (3)

[Evaluation] Evaluation section: the claim of successful deployment of 33 rules to CockroachDB and Data Fusion is presented without any quantitative correctness results (e.g., differential testing outcomes, equivalence checks post-adaptation, or failure rates), which is load-bearing for the portability and 'minimal effort' assertions.
[§3] §3 (Language and Verification): the assertion that 'all rewrites [are] automatically verified formally' provides no description of the verification technique, the properties proven (e.g., equivalence of match+transform), or the handling of custom operators, leaving the verifiable claim without supporting detail.
[Adapter description] Adapter description (near §4): the central 'lightweight adapter' mechanism is stated to preserve semantics and verification guarantees, yet no evidence, proof sketch, or testing is given that adapters avoid semantic mismatches; this directly affects whether Rulescript-level verification extends to deployed rules on target engines.

minor comments (1)

[Abstract] Abstract: 'several custom operators' is mentioned without enumeration or examples; adding a brief list would improve clarity without altering the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key gaps in evidence and detail that weaken the central claims. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of verification, adapters, and evaluation results.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the claim of successful deployment of 33 rules to CockroachDB and Data Fusion is presented without any quantitative correctness results (e.g., differential testing outcomes, equivalence checks post-adaptation, or failure rates), which is load-bearing for the portability and 'minimal effort' assertions.

Authors: We agree that the evaluation lacks quantitative correctness metrics. The current text reports only that the 33 rules were reimplemented and deployed, without reporting equivalence checks, differential testing results, or failure rates across the target engines. In the revised version we will add a dedicated subsection (or table) that presents the outcomes of systematic equivalence verification for the deployed rules, including any discrepancies encountered and the effort required for adapter implementation. This will directly support the portability claims. revision: yes
Referee: [§3] §3 (Language and Verification): the assertion that 'all rewrites [are] automatically verified formally' provides no description of the verification technique, the properties proven (e.g., equivalence of match+transform), or the handling of custom operators, leaving the verifiable claim without supporting detail.

Authors: The manuscript asserts automatic formal verification but does not describe the underlying technique, the precise properties established (e.g., semantic equivalence of the match-and-transform phases), or how custom operators are incorporated into the verification process. We will expand §3 with a clear account of the verification method, the properties proven, and the reduction of custom operators to the core language for verification purposes. revision: yes
Referee: [Adapter description] Adapter description (near §4): the central 'lightweight adapter' mechanism is stated to preserve semantics and verification guarantees, yet no evidence, proof sketch, or testing is given that adapters avoid semantic mismatches; this directly affects whether Rulescript-level verification extends to deployed rules on target engines.

Authors: We acknowledge that the paper describes adapters as lightweight and semantics-preserving without providing a proof sketch, formal argument, or empirical validation that the mappings avoid semantic mismatches. In the revision we will augment the adapter section with a proof sketch showing preservation of core and custom operator semantics, together with a brief report on the concrete adapter implementations and any testing performed for CockroachDB and Data Fusion. revision: yes

Circularity Check

0 steps flagged

No circularity: Rulescript claims rest on new DSL design and reimplementation

full rationale

The paper introduces Rulescript as a new engine-agnostic DSL with explicit core algebra, matching/transformation decomposition, and formal verification of rewrites at the abstract level. Evaluation reimplements 33 existing Calcite rules and ports them via adapters to CockroachDB and Data Fusion. No derivation reduces to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations; the verification and portability claims are supported by the language definition and empirical porting rather than circular reduction to prior results or inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that relational algebra forms an adequate core for expressing all relevant query plans and rewrites, plus the ad-hoc assumption that custom operators and lightweight adapters can fully capture engine-specific semantics without compromising verification.

axioms (1)

domain assumption Relational algebra provides a suitable core language for expressing query plans and rewrite rules across engines
The design explicitly builds on a relational algebra-inspired core language as stated in the abstract.

invented entities (1)

Rulescript DSL with matching/transformation decomposition no independent evidence
purpose: To enable engine-agnostic definition and formal verification of query rewrite rules
New language and decomposition introduced by the paper; no independent evidence provided beyond the abstract description.

pith-pipeline@v0.9.0 · 5592 in / 1397 out tokens · 68289 ms · 2026-05-08T03:43:36.087832+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 4 canonical work pages

[1]

Astrahan, Mike W

Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P. Eswaran, Jim N Gray, Patricia P. Griffiths, W Frank King, Raymond A. Lorie, Paul R. McJones, James W. Mehl, et al. 1976. System R: Relational approach to database management.ACM Transactions on Database Systems (TODS)1, 2 (1976), 97–137

1976
[2]

Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J Mior, and Daniel Lemire. 2018. Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. InProceedings of the 2018 International Conference on Management of Data. 221–230

2018
[3]

Chen and P.S

M.-S. Chen and P.S. Yu. 1993. Combining joint and semi-join operations for dis- tributed query processing.IEEE Transactions on Knowledge and Data Engineering 5, 3 (1993), 534–542. https://doi.org/10.1109/69.224205

work page doi:10.1109/69.224205 1993
[4]

Shumo Chu, Chenglong Wang, Konstantin Weitz, and Alvin Cheung. 2017. Cosette: An Automated Prover for SQL.. InCIDR. 1–7

2017
[5]

Shumo Chu, Konstantin Weitz, Alvin Cheung, and Dan Suciu. 2017. HoTTSQL: Proving query rewrites with univalent SQL semantics.ACM SIGPLAN Notices 52, 6 (2017), 510–524

2017
[6]

Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al. 2016. The snowflake elastic data warehouse. InProceed- ings of the 2016 International Conference on Management of Data. 215–226

2016
[7]

2022.Transpose Semi-join Aggregation Rule

Apache Doris. 2022.Transpose Semi-join Aggregation Rule. https://github.com/ apache/doris/blob/8cb5a4210f794abca49b9aa0355120877aba2f32/fe/fe-core/src/ main/java/org/apache/doris/nereids/rules/rewrite/TransposeSemiJoinAgg.java

2022
[8]

Goetz Graefe. 1995. The cascades framework for query optimization.IEEE Data Eng. Bull.18, 3 (1995), 19–29

1995
[9]

Goetz Graefe and William J McKenna. 1993. The volcano optimizer genera- tor: Extensibility and efficient search. InProceedings of IEEE 9th international conference on data engineering. IEEE, 209–218

1993
[10]

Yannis E Ioannidis and Raghu Ramakrishnan. 1995. Containment of conjunctive queries: Beyond relations as sets.ACM Transactions on Database Systems (TODS) 20, 3 (1995), 288–324

1995
[11]

2024.MySQL Bug #114435: Incorrect query results caused by the subquery optimization

MySQL. 2024.MySQL Bug #114435: Incorrect query results caused by the subquery optimization. https://bugs.mysql.com/bug.php?id=114435

2024
[12]

Hamid Pirahesh, Joseph M Hellerstein, and Waqar Hasan. 1992. Extensible/rule based query rewrite optimization in Starburst.ACM Sigmod Record21, 2 (1992), 39–48

1992
[13]

2015.PostgreSQL Bug #13592: Optimizer throws out join constraint causing incorrect result

PostgreSQL. 2015.PostgreSQL Bug #13592: Optimizer throws out join constraint causing incorrect result. https://www.postgresql.org/message-id/20150826195031. 2091.40681%40wrigleys.postgresql.org

work page arXiv 2015
[14]

Manuel Rigger and Zhendong Su. 2020. Detecting optimization bugs in data- base engines via non-optimizing reference engine construction(ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 1140–1152. https://doi.org/10.1145/3368089.3409710

work page doi:10.1145/3368089.3409710 2020
[15]

P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. InProceedings of the 1979 ACM SIGMOD international conference on Management of data. 23–34

1979
[16]

2022.Push Down Left Semi/Anti-join Rule

Apache Spark. 2022.Push Down Left Semi/Anti-join Rule. https: //github.com/apache/spark/blob/168e5cf57147eed3e789b6d609a60396750094e4/ sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ PushDownLeftSemiAntiJoin.scala

2022
[17]

2025.Optimizer incorrectly push down aggregations to subqueries with distinct and union all.https://sqlite.org/forum/forumpost/a860f5fb2e

SQLite. 2025.Optimizer incorrectly push down aggregations to subqueries with distinct and union all.https://sqlite.org/forum/forumpost/a860f5fb2e

2025
[18]

Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, et al. 2020. Cockroachdb: The resilient geo-distributed sql database. InProceedings of the 2020 ACM SIGMOD international conference on management of data. 1493–1509

2020
[19]

The Apache Calcite Team. 2026. Apache Calcite: RelRule. https: //github.com/apache/calcite/blob/80e717ab2e5f9f6fb23995ab995264f2ec940882/ core/src/main/java/org/apache/calcite/plan/RelRule.java. [Accessed 02-27-2026]

2026
[20]

The Apache Datafusion Team. 2026. Apache DataFusion. https://datafusion. apache.org/. [Accessed 02-27-2026]

2026
[21]

The Apache Datafusion Team. 2026. Apache Datafu- sion: LogicalPlan. https://github.com/apache/datafusion/blob/ 6713439497561fa74a94177e5b8632322fb7cea5/datafusion/expr/src/logical_plan/ plan.rs. [Accessed 02-27-2026]

2026
[22]

The CockroachDB Team. 2026. CockroachDB: Optgen. https://github.com/ cockroachdb/cockroach/tree/58e75c8e68cafc9a34482eb75f05ef51e6d8ca9a/pkg/ sql/opt/optgen. [Accessed 02-27-2026]

2026
[23]

BA Trahtenbrot. 1963. Impossibility of an algorithm for the decision problem in finite classes.Nine Papers on Logic and Quantum Electrodynamics(1963), 1–5

1963
[24]

Transaction Processing Performance Council (TPC). 2020. TPC-H Benchmark Specification, Version 2.17.1. https://www.tpc.org/tpc_documents_current_ versions/pdf/tpc-h_v2.17.1.pdf. [Accessed 30-01-2026]

2020
[25]

Patrick Valduriez and Georges Gardarin. 1984. Join and Semijoin Algorithms for a Multiprocessor Database Machine.ACM Trans. Database Syst.9, 1 (March 1984), 133–161. https://doi.org/10.1145/348.318590

work page doi:10.1145/348.318590 1984
[26]

Shuxian Wang, Sicheng Pan, and Alvin Cheung. 2024. QED: A Powerful Query Equivalence Decider for SQL.Proc. VLDB Endow.17, 11 (2024), 3602–3614

2024
[27]

Qi Zhou, Joy Arulraj, Shamkant B Navathe, William Harris, and Jinpeng Wu
[28]

In2022 IEEE 38th International Conference on Data Engineering (ICDE)

SPES: A symbolic approach to proving query equivalence under bag semantics. In2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2735–2748. 13