pith. machine review for the scientific record. sign in

arxiv: 2604.20145 · v1 · submitted 2026-04-22 · 💻 cs.DB · cs.LG

Recognition: unknown

Pre-Execution Query Slot-Time Prediction in Cloud Data Warehouses: A Feature-Scoped Machine Learning Approach

Prashant Kumar Pathak

Pith reviewed 2026-05-09 23:22 UTC · model grok-4.3

classification 💻 cs.DB cs.LG
keywords query cost predictionslot-time estimationcloud data warehousemachine learningpre-execution featuresBigQuerySQL complexitygradient boosting
0
0 comments X

The pith

A machine learning model predicts BigQuery slot-time before execution using only pre-execution signals from query structure and planner estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that slot-time consumption in cloud data warehouses can be forecast prior to query execution by training on structured complexity scores, data-volume estimates, workload metadata, and query text alone. This matters because accurate advance estimates would reduce budget overruns and allow better scheduling in shared environments where costs vary widely. The approach trains a gradient-boosting regressor on log-transformed targets and fuses numeric features with a reduced text embedding, then evaluates out of distribution across separate deployment environments. Results show useful accuracy on the bulk of queries and a clear improvement over simple baselines for cost-significant ones, yet no gain on the longest-running tail, which the authors attribute to factors outside the chosen feature scope.

Core claim

A HistGradientBoostingRegressor trained on pre-execution features achieves an MAE of 1.17 slot-minutes and 74 percent explained variance across the full test workload, while reducing MAE from 4.95 to 3.10 slot-minutes on queries whose slot-time is at least 0.01 minutes, a 30-37 percent improvement over mean and median baselines; the same model shows no advantage on queries exceeding 20 minutes, consistent with the claim that long-tail behavior is driven by runtime conditions excluded from the feature set.

What carries the argument

A feature-scoped pipeline that combines a structured query complexity score, planner data-volume estimates, and TF-IDF-plus-TruncatedSVD text features inside a HistGradientBoostingRegressor to predict log-transformed slot-time.

Load-bearing premise

Signals observable before execution, such as query text and planner estimates, contain enough information to predict actual slot-time even when runtime contention and cache state are unknown.

What would settle it

A controlled experiment that adds current slot-pool utilization and cache-state features to the model and measures whether error on the long-tail queries drops substantially below the current baseline performance.

Figures

Figures reproduced from arXiv: 2604.20145 by Prashant Kumar Pathak.

Figure 1
Figure 1. Figure 1: Predicted versus actual slot-time for 746 held-out test queries (log-log [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prediction residuals across the actual slot-time range. Systematic [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Cloud data warehouses bill compute based on slot-time consumed. In shared multi-tenant environments, query cost is highly variable and hard to estimate before execution, causing budget overruns and degraded scheduling. Static query-planner heuristics fail to capture complex SQL structure, data skew, and workload contention. We present a feature-scoped machine learning approach that predicts BigQuery slot-time before execution using only pre-execution observable signals: a structured query complexity score derived from SQL operator costs, data volume features from planner estimates and workload metadata, and textual features from query text. We deliberately exclude runtime factors (slot-pool utilization, cache state, realized skew) unknowable at submission. The model uses a HistGradientBoostingRegressor trained on log-transformed slot-time, with a TF-IDF + TruncatedSVD-512 text pipeline fused with numeric and categorical features. Trained on 749 queries across seven deployment environments and evaluated out-of-distribution on 746 queries from two held-out environments, the model achieves MAE 1.17 slot-minutes, RMSE 4.71, and 74% explained variance on the full workload. On cost-significant queries (slot-time >= 0.01 min, N=282) the model achieves MAE 3.10 versus 4.95 for a predict-mean baseline and 4.54 for predict-median, a 30-37% reduction. On long-tail queries (>= 20 min, N=22) the model does not outperform trivial baselines, consistent with the hypothesis that long-tail queries are dominated by unobserved runtime factors outside the current feature scope. A complexity-routed dual-model architecture is described as a practical refinement, and directions for closing the long-tail gap are identified as future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a feature-scoped ML approach to predict BigQuery slot-time before execution using only pre-execution signals: a structured query complexity score from SQL operators, planner data-volume estimates, workload metadata, and TF-IDF+TruncatedSVD textual features from the query. A HistGradientBoostingRegressor is trained on log-transformed targets and evaluated out-of-distribution on 746 queries from two held-out environments after training on 749 queries from seven environments. It reports overall MAE 1.17, RMSE 4.71, and 74% explained variance; on cost-significant queries (slot-time >=0.01 min, N=282) MAE improves to 3.10 versus 4.95 (mean) and 4.54 (median) baselines (30-37% reduction). On long-tail queries (>=20 min, N=22) the model does not beat baselines, which the authors attribute to excluded runtime factors (slot-pool utilization, cache state, realized skew) and flag as future work, along with a complexity-routed dual-model refinement.

Significance. If the long-tail gap can be closed while preserving the pre-execution constraint, the work would provide a practical tool for reducing budget overruns and improving scheduling in multi-tenant cloud data warehouses. The explicit scoping of features to observables at submission time and the out-of-distribution evaluation across deployment environments are strengths that increase confidence in generalization within the current feature scope. The transparent reporting of failure on high-cost queries also usefully bounds the claims.

major comments (3)
  1. [Abstract] Abstract: The central claim that pre-execution signals suffice for accurate slot-time prediction is undermined by the reported result that the model does not outperform mean/median baselines on long-tail queries (>=20 min, N=22). Because these queries drive the most severe budget overruns and scheduling problems, this is a load-bearing limitation for the use case; the small N=22 also limits statistical reliability of the conclusion that runtime factors are the sole cause.
  2. [Abstract] Abstract / Methods: No derivation, validation, or sensitivity analysis is provided for the structured query complexity score derived from SQL operator costs, which is presented as a core component of the feature set. Without this, it is impossible to assess whether the score is reproducible or whether it meaningfully captures the intended aspects of query structure.
  3. [Evaluation] Evaluation: No ablation of feature groups (complexity score, planner estimates, metadata, text), no hyperparameter settings for HistGradientBoostingRegressor or TruncatedSVD-512, and no error bars or confidence intervals accompany the MAE/RMSE and 74% explained-variance figures. These omissions make it difficult to judge the robustness of the 30-37% improvement on the N=282 cost-significant queries.
minor comments (1)
  1. [Abstract] The abstract states that a complexity-routed dual-model architecture is described as a practical refinement, but no details on the routing logic, threshold, or performance of the dual model are supplied in the provided text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments correctly identify areas where additional detail and analysis would strengthen the manuscript. We address each major comment below, indicating where revisions will be made to improve clarity, reproducibility, and robustness while preserving the paper's core contributions and honest scoping of pre-execution features.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that pre-execution signals suffice for accurate slot-time prediction is undermined by the reported result that the model does not outperform mean/median baselines on long-tail queries (>=20 min, N=22). Because these queries drive the most severe budget overruns and scheduling problems, this is a load-bearing limitation for the use case; the small N=22 also limits statistical reliability of the conclusion that runtime factors are the sole cause.

    Authors: We agree this is an important limitation and have already flagged it explicitly in the manuscript as arising from runtime factors outside our pre-execution feature scope. The N=22 reflects the natural rarity of such queries in the production dataset; while this limits statistical power, the result is consistent with our hypothesis and bounds the claims appropriately. The primary use-case improvements (30-37% on cost-significant queries) remain valid. We will expand the abstract and discussion to more prominently qualify the scope, add a brief power analysis note for the tail, and elaborate on the complexity-routed dual-model refinement as a practical path forward. revision: partial

  2. Referee: [Abstract] Abstract / Methods: No derivation, validation, or sensitivity analysis is provided for the structured query complexity score derived from SQL operator costs, which is presented as a core component of the feature set. Without this, it is impossible to assess whether the score is reproducible or whether it meaningfully captures the intended aspects of query structure.

    Authors: The score is a linear combination of operator counts weighted by standard BigQuery cost heuristics (e.g., join, aggregate, scan costs). We will add a dedicated subsection in Methods with the exact formula, operator mapping table, and rationale. We will also include a sensitivity analysis varying the weights by ±20% and reporting the resulting change in model MAE/RMSE on the held-out set to demonstrate robustness. revision: yes

  3. Referee: [Evaluation] Evaluation: No ablation of feature groups (complexity score, planner estimates, metadata, text), no hyperparameter settings for HistGradientBoostingRegressor or TruncatedSVD-512, and no error bars or confidence intervals accompany the MAE/RMSE and 74% explained-variance figures. These omissions make it difficult to judge the robustness of the 30-37% improvement on the N=282 cost-significant queries.

    Authors: We will add (1) a feature-group ablation table showing incremental performance when adding complexity score, planner estimates, metadata, and text features; (2) the full hyperparameter configuration for HistGradientBoostingRegressor (learning_rate=0.05, max_depth=6, etc.) and TruncatedSVD (n_components=512, random_state=42); and (3) bootstrap 95% confidence intervals for all reported MAE, RMSE, and explained-variance metrics on both the full and cost-significant subsets. revision: yes

Circularity Check

0 steps flagged

No circularity: standard supervised ML evaluation on held-out data

full rationale

The paper reports an empirical machine learning regression (HistGradientBoostingRegressor) trained on 749 queries from seven environments and evaluated on 746 queries from two held-out environments. Performance numbers (MAE, RMSE, explained variance, and baseline comparisons) are obtained via standard train/test split and out-of-distribution evaluation; no derivation, equation, or claim reduces by construction to its own fitted inputs or to a self-citation chain. The explicit acknowledgment that long-tail queries are not predicted well because runtime factors were deliberately excluded is transparent rather than circular. No self-definitional, fitted-input-renamed-as-prediction, or uniqueness-imported steps appear in the provided text.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The central claim depends on several modeling choices and domain assumptions whose justification is not supplied in the abstract; these include the log transform, the exact definition of the complexity score, the 512-dimensional SVD truncation, and the premise that runtime factors can be ignored.

free parameters (3)
  • TruncatedSVD components = 512
    Dimensionality reduction parameter for text pipeline
  • Target transformation
    Log transform applied to slot-time before regression
  • HistGradientBoosting hyperparameters
    Learning rate, max depth, and other tuning choices not reported
axioms (1)
  • domain assumption Pre-execution signals suffice for useful slot-time prediction
    Explicitly stated by excluding runtime factors and focusing on submission-time observables

pith-pipeline@v0.9.0 · 5619 in / 1680 out tokens · 56547 ms · 2026-05-09T23:22:55.098354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Neo: A learned query optimizer,

    R. Marcus et al., “Neo: A learned query optimizer,”PVLDB, vol. 12, no. 11, pp. 1705–1718, 2019

  2. [2]

    Bao: Making learned query optimization practical,

    R. Marcus et al., “Bao: Making learned query optimization practical,” inProc. ACM SIGMOD, 2021, pp. 1275–1288

  3. [3]

    Learned cardinalities: Estimating correlated joins with deep learning,

    A. Kipf, T. Kipf, B. Radke, V . Leis, P. Boncz, and A. Kemper, “Learned cardinalities: Estimating correlated joins with deep learning,” inProc. CIDR, 2019

  4. [4]

    Access path selection in a relational database management system,

    P. G. Selinger et al., “Access path selection in a relational database management system,” inProc. ACM SIGMOD, 1979, pp. 23–34

  5. [5]

    Greedy function approximation: A gradient boosting machine,

    J. H. Friedman, “Greedy function approximation: A gradient boosting machine,”Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001

  6. [6]

    Amazon Redshift re-invented,

    N. Armenatzoglou et al., “Amazon Redshift re-invented,” inProc. ACM SIGMOD, 2022, pp. 2205–2217

  7. [7]

    Dremel: A decade of interactive SQL analysis at web scale,

    S. Melnik et al., “Dremel: A decade of interactive SQL analysis at web scale,”PVLDB, vol. 13, no. 12, pp. 3461–3472, 2020

  8. [8]

    The Snowflake elastic data warehouse,

    B. Dageville et al., “The Snowflake elastic data warehouse,” inProc. ACM SIGMOD, 2016, pp. 215–226

  9. [9]

    Amazon Aurora: Design considerations for high- throughput cloud-native relational databases,

    A. Verbitski et al., “Amazon Aurora: Design considerations for high- throughput cloud-native relational databases,” inProc. ACM SIGMOD, 2017, pp. 1041–1052

  10. [10]

    Towards multi-tenant performance SLOs,

    W. Lang, S. Shankar, J. M. Patel, and A. Kalhan, “Towards multi-tenant performance SLOs,” inProc. IEEE ICDE, 2012, pp. 702–713

  11. [11]

    Heuristic Search Space Partitioning for Low-Latency Multi-Tenant Cloud Queries

    P. K. Pathak, C. B. Mouleeswaran, and R. T. Repaka, “Heuristic search space partitioning for low-latency multi-tenant cloud queries,” arXiv preprint arXiv:2604.19057, 2026