arxiv: 2604.28141 · v1 · submitted 2026-04-30 · 💻 cs.DB

Recognition: unknown

Index-Assisted Stratified Sampling for Online Aggregation

Yunnan Yu, Zhuoyue Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:38 UTC · model grok-4.3

classification 💻 cs.DB

keywords stratified samplingonline aggregationapproximate query processingindex-assisted samplingtwo-phase samplingsample size allocationNeyman allocation

0 comments

The pith

A two-phase index-assisted stratified sampling method reduces the samples needed for accurate online aggregation by optimizing strata from the first batch of draws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to lower the latency of approximate answers with confidence bounds for ad-hoc queries on frequently updated flat data. It shows that classic stratified sampling cannot be used directly in index-assisted systems because those systems rely on index-based cost models rather than simple scan costs, and because they cannot afford heavy upfront statistics collection. The solution splits sampling into two phases: the first phase draws a modest number of samples both to start the aggregate estimate and to measure per-stratum variances and selectivities, then uses those measurements to set optimal stratum boundaries and sample allocations for the remaining work under the actual index cost model. If the approach holds, queries that would otherwise need large samples due to high data variance can finish with far fewer total draws while preserving the statistical guarantees.

Core claim

We design index-assisted stratified sampling for online aggregation, which features a two-phase sampling framework. Samples drawn from first phase are used for both online aggregation and optimizing future sampling cost, while the second phase continues the online aggregation using the optimized strata. We prove optimal stratification and sample size allocation strategies for index-based sampling cost model, and design several greedy and dynamic programming based optimization methods to balance optimization cost and effectiveness in cost reduction.

What carries the argument

The two-phase sampling framework that uses first-phase index-assisted samples to estimate stratum variances and selectivities, then applies proven optimal stratification and allocation rules under an index-based sampling cost model before completing the query in the second phase.

If this is right

For data with high variance or low selectivity, the total number of index probes needed to reach a target error bound drops because strata receive sample sizes proportional to their measured standard deviation and cost.
Query latency improves because fewer index accesses are performed overall; the paper reports consistent speedups and peaks of 3x versus index-assisted uniform sampling.
The same first-phase samples serve dual purposes, so the overhead of optimization is paid only once per query rather than requiring separate statistics collection.
The greedy and dynamic-programming planners let users trade a small extra optimization cost for larger reductions in sampling cost depending on query selectivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-use of early samples suggests a general pattern for any sampling-based estimator that can afford a small pilot phase: use the pilot both for the estimate and to tune the sampling plan for the remainder.
Because the cost model is tied to index probes rather than rows scanned, the same two-phase logic could be tested on other index structures such as B-trees or inverted indexes without changing the optimization proofs.
If the first phase is kept small, the method remains compatible with streaming updates that arrive between phases, provided the index can reflect those updates before the second phase begins.

Load-bearing premise

Samples from the first phase give sufficiently accurate estimates of per-stratum variances and selectivities to let the second-phase allocation actually reach the theoretically lower sampling cost without introducing bias into the final aggregate or its confidence bounds.

What would settle it

Execute the two-phase method on a dataset whose value distribution or selectivity changes sharply after the first phase is collected and check whether the final confidence interval still covers the true aggregate at the claimed rate and whether the total sample count is smaller than uniform index sampling would have required.

Figures

Figures reproduced from arXiv: 2604.28141 by Yunnan Yu, Zhuoyue Zhao.

**Figure 1.** Figure 1: Ad-hoc queries over in index-assisted S-AQP system view at source ↗

**Figure 2.** Figure 2: The percentage of cancelled flights and the total flight during the date range. Overall estimator variance is higher than those of any of the three smaller date ranges. However, the time to achieve a given confidence interval in these systems may vary heavily. The root cause is these systems often draw samples uniformly without considering the variations of data distribution and query selectivity within a … view at source ↗

**Figure 4.** Figure 4: Sampling from an aggregate B-tree. Underscored num view at source ↗

**Figure 5.** Figure 5: Union of two strata has a higher LCA height than view at source ↗

**Figure 8.** Figure 8: Greedy initial stratification for query range [8, 24) 6 10 20 26 11 13 19 P3 P3_1 ...... 19 P3_2 P3_3 ...... ...... view at source ↗

**Figure 10.** Figure 10: Partition granularity 𝑑 = 3 on a phase 1 sample with 6 distinct values results in 3 pre-grouped partitions. 𝑆𝑗 = 𝑛0 ∑︁ 𝑡 ∈S0∧𝑡.𝑥<𝐶𝑗 𝑒 (𝑡)P𝑓 (𝑡) 𝑆2𝑗 = ∑︁ 𝑡 ∈S0∧𝑡.𝑥<𝐶𝑗 (𝑛0𝑒 (𝑡)P𝑓 (𝑡) − 𝑆𝑗 /𝑚𝑗) 2 ℎ𝑗 = ∑︁ 𝑡 ∈S0∧𝑡.𝑥<𝐶𝑗 LCA height of t PROPOSITION 4.1. It takes𝑂(1) time to compute any 𝜎 2 [𝐶𝑗 ′,𝐶𝑗) and ℎ[𝐶𝑗 ′,𝐶𝑗), with 𝑂(𝑛0)-time pre-processing and 𝑂(𝐾) space3 . PROOF. Preprocessing and storage of 𝑚, ® 𝑆, ® 𝑆®2… view at source ↗

**Figure 11.** Figure 11: OptiAQP (left) components; (right) query plan (a) (b) view at source ↗

**Figure 13.** Figure 13: Query latency with varying requested confidence interval thus have similar per-sample cost. Greedy performs better than CostOpt and SizeOpt on flight and census dataset, but worse on intel dataset. The reason is Greedy can find a good stratification with lower optimization overhead for flight and census, while it fails to find a good stratification for intel based on the same stopping criteria. Notably, … view at source ↗

**Figure 15.** Figure 15: Speedup over uniform with randomly generated query ranges view at source ↗

**Figure 19.** Figure 19: Varying the initial sample size 𝑛0 for CostOpt good trade-offs and low query latencies. For CostOpt, preprocessing factor 𝑐0 represents the constant overhead of tree traversals and data structure settings. When migrating to a new environment, it requires only a one-time calibration, which is similar to other cost constants in optimizer. Partition granularity 𝑑 directly influences the 𝑂(𝑑 3 ) dynamic prog… view at source ↗

**Figure 16.** Figure 16: Varying parameters for CostOpt 30 200 600 1000 2000 per-stratum sample size ∆n0 4e-05 0.004 0.04 0.4 4 40 stopping threshold τ (%) 25 50 75 100 125 150 175 200 query latency (s) view at source ↗

**Figure 18.** Figure 18: Execution time breakdown for different stratification strategies over TPC-H the two baseline methods Equal and SizeOpt are significantly volatile in speedup and can have larger proportion that causes slow down, because they do not have optimality guarantees. 5.6 Parameter tuning We next investigate the impact of algorithm parameters on CostOpt and Greedy using TPC-H lineitem scale factor 20 with 3 highv… view at source ↗

**Figure 19.** Figure 19: The actual time for first phase includes optimization view at source ↗

read the original abstract

Ad-hoc queries over frequently updated data in a flat schema are common in real-time data analysis applications and often require very low latency. Online aggregation can achieve so by providing approximate aggregation answers with confidence bound guarantees. It relies on the ability to draw samples online in a linear time to sample size rather than database size, which can be supported by index-assisted Sampling-based Approximate Query Processing (S-AQP) systems. However, the query latencies of approximate queries in these systems can still suffer from excessive sampling cost required to achieve a desired confidence bound, due to increased sample size for data with high variance in value distribution and selectivity. Classic stratified sampling methods with Neyman allocation can minimize sample size in theory, but several challenges prevent it from being applicable in index-assisted S-AQP systems, including requiring apriori statistics, high optimization cost, and inaccurate sampling cost model based on sample size. Towards that, we design index-assisted stratified sampling for online aggregation, which features a two-phase sampling framework. Samples drawn from first phase are used for both online aggregation and optimizing future sampling cost, while the second phase continues the online aggregation using the optimized strata. We prove optimal stratification and sample size allocation strategies for index-based sampling cost model, and design several greedy and dynamic programming based optimization methods to balance optimization cost and effectiveness in cost reduction. We evaluate our methods on several real-world and synthetic datasets and queries, and the results show ours consistently achieve good speedup and, in extreme cases, up to 3x speedup and 98708x speedup, when compared to index-assisted uniform sampling and classic scan-based stratified sampling respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-phase index-assisted stratified sampler is a practical engineering step for lowering sampling cost in online aggregation, but the adaptive use of phase-1 samples to set strata and allocations for phase 2 risks breaking unbiasedness and standard confidence bounds.

read the letter

The paper's main contribution is a two-phase sampling framework for index-assisted S-AQP systems. Phase-1 samples feed both the running aggregate and the estimation of strata boundaries plus Neyman-style allocations for phase 2, which then finishes the query under a cheaper index-based plan. They prove optimal stratification and allocation rules for their cost model and supply greedy and DP optimizers to keep the tuning overhead manageable. This is new relative to classic stratified sampling, which assumes known strata and variances upfront and uses a sample-size-only cost model that does not match how indexes actually retrieve tuples. The work correctly identifies why those assumptions fail in online, index-driven settings and shows consistent speedups on real and synthetic data, reaching 3x versus index uniform sampling. The much larger gains versus scan-based stratified sampling are expected once indexes are in play. The soft spot is the adaptivity. Because strata and allocations for phase 2 are chosen from the same phase-1 draws that already contribute to the estimate, the overall design is data-dependent. Standard sampling theory says this generally biases the estimator and invalidates the usual variance formula used for confidence intervals unless an explicit correction (post-stratification weights, Horvitz-Thompson adjustment, or conditional inference) is applied. The abstract states proofs of optimality, but those proofs almost certainly assume strata are known in advance; nothing in the description indicates how the paper closes the gap. The extreme reported speedups also need careful experimental controls to rule out post-hoc tuning or dataset-specific effects. This is aimed at the approximate query processing community working on low-latency analytics over large, updating tables. It deserves peer review because the engineering framing and empirical results are concrete enough to be useful even if the statistical details require tightening. I would send it out; referees will focus on whether the bias issue is handled and whether the speedups are reproducible across workloads.

Referee Report

3 major / 3 minor

Summary. The paper proposes index-assisted stratified sampling for online aggregation over frequently updated flat-schema data. It introduces a two-phase framework in which phase-1 samples contribute to the running aggregate estimate while also being used to select strata boundaries and Neyman-style allocations that minimize an index-based sampling cost model for phase 2. The authors prove optimal stratification and allocation strategies under this cost model, present greedy and dynamic-programming algorithms to trade optimization overhead against cost reduction, and report empirical speedups of up to 3× versus index-assisted uniform sampling and up to 98 708× versus scan-based stratified sampling on real and synthetic workloads.

Significance. If the statistical validity of the final estimator is preserved, the work would meaningfully advance practical online aggregation in index-assisted S-AQP systems by removing the need for a priori statistics while still exploiting stratification. The explicit treatment of an index-specific cost model, the accompanying optimality proofs, and the provision of multiple optimization algorithms that balance overhead against benefit constitute clear technical contributions. The reported speedups, if reproducible and free of post-hoc selection, would indicate substantial latency gains for low-latency ad-hoc analytics.

major comments (3)

[two-phase sampling framework] The two-phase sampling framework (abstract and the description of the method) uses the same phase-1 draws both for the aggregate estimate and for data-dependent estimation of strata boundaries and allocations. Standard sampling theory shows that the resulting design is adaptive; the usual Neyman variance formula and the associated confidence intervals are then invalid unless an explicit correction (Horvitz–Thompson weights, post-stratification, or conditional inference) is applied. The manuscript must state precisely how unbiasedness and valid confidence bounds are maintained and must supply the corresponding derivation or reference. This issue is load-bearing for the headline claim of “confidence bound guarantees.”
[proofs of optimal stratification and sample size allocation] The proofs of optimal stratification and sample-size allocation (abstract) are stated for the index-based cost model under the assumption that strata and stratum variances are known a priori. Because the method estimates these quantities from the phase-1 sample, the optimality result does not directly apply. The paper should either (a) prove that the estimated design remains asymptotically optimal or (b) quantify the sub-optimality gap and show that the reported speedups survive this gap.
[experimental evaluation] Table or experimental section reporting the 98 708× speedup versus scan-based stratified sampling: the magnitude is extreme and depends on the interplay between data distribution, selectivity, and the index cost model. The manuscript must document the exact data-exclusion rules, the distribution of per-stratum variances, and the number of strata chosen, and must include an ablation that isolates the contribution of the index cost model versus the adaptive stratification itself.

minor comments (3)

[Abstract] Abstract: “apriori” should be “a priori”; “ours consistently achieve good speedup” should be rephrased for grammatical correctness.
[method description] The free parameters “number of strata” and “stratum boundaries” are listed in the axiom ledger; the manuscript should state explicitly how these are chosen in the reported experiments and whether they are treated as user-specified or internally optimized.
[optimization methods] The optimization-cost versus effectiveness trade-off is mentioned but not quantified; a plot or table showing cumulative optimization time versus total query latency for the greedy and DP variants would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and valuable suggestions. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are needed, we outline the changes to be made in the revised manuscript.

read point-by-point responses

Referee: [two-phase sampling framework] The two-phase sampling framework (abstract and the description of the method) uses the same phase-1 draws both for the aggregate estimate and for data-dependent estimation of strata boundaries and allocations. Standard sampling theory shows that the resulting design is adaptive; the usual Neyman variance formula and the associated confidence intervals are then invalid unless an explicit correction (Horvitz–Thompson weights, post-stratification, or conditional inference) is applied. The manuscript must state precisely how unbiasedness and valid confidence bounds are maintained and must supply the corresponding derivation or reference. This issue is load-bearing for the headline claim of “confidence bound guarantees.”

Authors: We appreciate the referee highlighting this critical statistical issue. The design is adaptive, and the manuscript's current presentation assumes the standard Neyman allocation applies directly to the combined sample. To correct this, we will revise the paper to clarify that the final estimator is unbiased because it is a stratified sample with fixed (though estimated) strata, using the combined phase-1 and phase-2 observations. For variance estimation and confidence intervals, we will incorporate a post-stratification adjustment and use the estimated stratum variances from the full sample, with a reference to literature on inference in adaptive sampling designs (such as conditional inference given the phase-1 data). We will add a formal derivation in an appendix showing that the bias is zero and the variance estimator is consistent. This addresses the validity of the confidence bounds. We will make these changes in the revised version. revision: yes
Referee: [proofs of optimal stratification and sample size allocation] The proofs of optimal stratification and sample-size allocation (abstract) are stated for the index-based cost model under the assumption that strata and stratum variances are known a priori. Because the method estimates these quantities from the phase-1 sample, the optimality result does not directly apply. The paper should either (a) prove that the estimated design remains asymptotically optimal or (b) quantify the sub-optimality gap and show that the reported speedups survive this gap.

Authors: We thank the referee for this precise comment. The proofs establish the optimal strata boundaries and Neyman-style allocation that minimize the index-based sampling cost when the stratum variances are known. Since the method uses phase-1 samples to estimate these, the achieved allocation is approximate. In the revision, we will add a subsection analyzing the asymptotic optimality: as the phase-1 sample size n1 grows (with n1 a fixed fraction of total n), the estimated variances converge in probability to the true ones by the law of large numbers, and thus the selected strata and allocations converge to the optimal, implying asymptotic optimality of the cost. Additionally, we will report new experiments comparing the estimated design to an oracle design with true variances, showing that the speedup gap is small (e.g., less than 10% loss) and the reported speedups largely survive. This will be included in the experimental evaluation section. revision: yes
Referee: [experimental evaluation] Table or experimental section reporting the 98 708× speedup versus scan-based stratified sampling: the magnitude is extreme and depends on the interplay between data distribution, selectivity, and the index cost model. The manuscript must document the exact data-exclusion rules, the distribution of per-stratum variances, and the number of strata chosen, and must include an ablation that isolates the contribution of the index cost model versus the adaptive stratification itself.

Authors: We agree that the large speedup figure warrants detailed documentation to allow readers to understand the conditions and to verify the contributions. In the revised manuscript, we will expand the experimental section as follows: include a table or paragraph detailing the data-exclusion rules (e.g., removal of records with null values or extreme outliers beyond 3 standard deviations); provide supplementary figures or tables showing the per-stratum variance distributions for key datasets and queries; specify the number of strata used in each experiment and the method for choosing it (e.g., via the optimization algorithm); and add an ablation study with four variants: (i) index-assisted uniform sampling, (ii) adaptive stratification with sample-size cost model, (iii) fixed strata with index cost model, and (iv) our full method. This will isolate the impact of the index cost model versus the adaptive stratification. We believe these additions will make the results more transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: optimality proof is independent of phase-1 estimates

full rationale

The paper derives and proves optimal stratification and Neyman-style allocation under an explicit index-based sampling cost model; this is a mathematical result on the model parameters (assumed known). The two-phase framework then applies those derived optima by estimating strata boundaries and allocations from phase-1 samples, which also contribute to the running aggregate. This is an implementation choice, not a reduction of the proof itself to fitted quantities by construction. The greedy and dynamic-programming methods are algorithmic approximations that trade off computation against the proven optimum; they do not redefine the optimum in terms of the same data. No equation or claim equates a prediction to its own input, and no self-citation chain is invoked to justify the core result. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the accuracy of the index-based sampling cost model and the assumption that first-phase samples suffice for reliable stratum optimization. No new physical entities are postulated.

free parameters (2)

number of strata
Chosen or optimized during the first phase; exact selection rule not specified in abstract.
stratum boundaries
Determined from first-phase samples; treated as fitted parameters in the optimization.

axioms (2)

domain assumption Index access cost is linear in the number of sampled rows and can be modeled independently of data values.
Invoked when proving optimal allocation strategies for the index-based cost model.
domain assumption First-phase samples are representative enough to optimize second-phase allocation without bias.
Required for the two-phase framework to improve over uniform sampling.

pith-pipeline@v0.9.0 · 5583 in / 1500 out tokens · 52654 ms · 2026-05-07T05:38:31.986374+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 17 canonical work pages

[1]

Borui Cai, Yong Xiang, Longxiang Gao, He Zhang, Yunfeng Li, and Jianxin Li

2008. Data Expo 2009: Airline on time data. https://doi.org/10.7910/DVN/ HG7NV7

work page doi:10.7910/dvn/ 2008
[2]

Phillips

Peyman Afshani and Jeff M. Phillips. 2019. Independent Range Sampling, Revis- ited Again. InSoCG 2019. 4:1–4:13

2019
[3]

Peyman Afshani and Zhewei Wei. 2017. Independent Range Sampling, Revisited. InESA 2017. 3:1–3:14

2017
[4]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. InEuroSys. 29–42

2013
[5]

Billingsley

Patrick. Billingsley. 2012.Probability and measure(anniversary edition. ed.). Wiley, Hoboken, N.J. 12

2012
[6]

Peter Bodik, Wei Hong, Carlos Guestrin, Sam Madden, Mark Paskin, and Romain Thibaux. 2004. Intel Lab Data. https://db.csail.mit.edu/labdata/labdata.html

2004
[7]

Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing.ACM Trans. Database Syst.32, 2 (June 2007), 9–es. https://doi.org/10.1145/1242524.1242526

work page doi:10.1145/1242524.1242526 2007
[8]

Bolin Ding, Silu Huang, Surajit Chaudhuri, Kaushik Chakrabarti, and Chi Wang
[9]

InSIGMOD ’16

Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee. InSIGMOD ’16. 679–694
[10]

Rong Gu, Han Li, Haipeng Dai, Wenjie Huang, Jie Xue, Meng Li, Jiaqi Zheng, Haoran Cai, Yihua Huang, and Guihai Chen. 2023. ShadowAQP: Efficient Approximate Group-by and Join Query via Attribute-Oriented Sample Size Allo- cation and Data Generation.Proc. VLDB Endow.16, 13 (Sept. 2023), 4216–4229. https://doi.org/10.14778/3625054.3625059

work page doi:10.14778/3625054.3625059 2023
[11]

Peter J. Haas. 1997. Large-Sample and Deterministic Confidence Intervals for Online Aggregation. InSSDBM ’97. 51–63

1997
[12]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. 1997. Online Aggregation. InProc. ACM SIGMOD International Conference on Management of Data

1997
[13]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kris- tian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries!Proc. VLDB Endow.13, 7 (mar 2020), 992–1005. https: //doi.org/10.14778/3384345.3384349

work page doi:10.14778/3384345.3384349 2020
[14]

D. G. Horvitz and D. J. Thompson. 1952. A generalization of sampling without replacement from a finite universe.J. Amer. Statist. Assoc.47 (1952), 663–685

1952
[15]

Xiaocheng Hu, Miao Qiao, and Yufei Tao. 2014. Independent Range Sampling. InPODS ’14. 246–255

2014
[16]

Saehan Jo and Immanuel Trummer. 2020. BitGourmet: Deterministic Approxima- tion via Optimized Bit Selection. In10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2020/papers/p5-jo-cidr20.pdf

2020
[17]

Saehan Jo and Immanuel Trummer. 2020. Demonstration of BitGourmet: Data Analysis via Deterministic Approximation. InProceedings of the 2020 Interna- tional Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pot- tinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and H...

work page doi:10.1145/3318464.3384709 2020
[18]

Srikanth Kandula. 2020. A Parallel Zipf-skewed Data Generator for TPC-H Benchmark. Retrieved July, 2023 from https://github.com/SrikanthKandula/ tpch_dbgen_zipf_skew

2020
[19]

Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. InSIGMOD ’16. 631–646

2016
[20]

Ron Kohavi. 1996. Census Income. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5GP7S

work page doi:10.24432/c5gp7s 1996
[21]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander Join: Online Aggre- gation via Random Walks. InProceedings of the 2016 International Conference on Management of Data(San Francisco, California, USA)(SIGMOD ’16). ACM, New York, NY , USA, 615–629. https://doi.org/10.1145/2882903.2915235

work page doi:10.1145/2882903.2915235 2016
[22]

Xi Liang, Stavros Sintos, Zechao Shang, and Sanjay Krishnan. 2021. Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing. InProceedings of the 2021 International Conference on Management of Data (Virtual Event, China)(SIGMOD ’21). Association for Computing Machinery, New York, NY , USA, 1129–1141. https://doi.org/10.1145/3...

work page doi:10.1145/3448016.3457277 2021
[23]

Qingzhi Ma and Peter Triantafillou. 2019. DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models. InSIGMOD ’19. 1553–1570

2019
[24]

F. Olken. 1993.Random Sampling from Databases. Ph.D. Dissertation. University of California at Berkeley

1993
[25]

Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing Approximate Query Processing. InSIGMOD ’18. 1461–1476

2018
[26]

Jinglin Peng, Dongxiang Zhang, Jiannan Wang, and Jian Pei. 2018. AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics. InSIGMOD ’18. 1477–1492

2018
[27]

Rumbaugh and Dong Xie

Douglas B. Rumbaugh and Dong Xie. 2023. Practical Dynamic Extension for Sampling Indexes.Proc. ACM Manag. Data1, 4, Article 254 (dec 2023), 26 pages. https://doi.org/10.1145/3626744

work page doi:10.1145/3626744 2023
[28]

Korth, and S

Avi Silberschatz, Henry F. Korth, and S. Sudarshan. 2020.Database System Concepts, Seventh Edition. McGraw-Hill Book Company. https://www.db- book.com/

2020
[29]

Yufei Tao. 2022. Algorithmic Techniques for Independent Query Sampling. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Princi- ples of Database Systems(Philadelphia, PA, USA)(PODS ’22). Association for Computing Machinery, New York, NY , USA, 129–138. https://doi.org/10.1145/ 3517804.3526068

work page arXiv 2022
[30]

Congying Wang, Nithin Sastry Tellapuri, Sphoorthi Keshannagari, Dylan Zinsley, Zhuoyue Zhao, and Dong Xie. 2023. Approximate Queries over Concurrent Updates.Proc. VLDB Endow.16, 12 (aug 2023), 3986–3989. https://doi.org/10. 14778/3611540.3611602

work page arXiv 2023
[31]

Phillips, Michael Matheny, and Feifei Li

Dong Xie, Jeff M. Phillips, Michael Matheny, and Feifei Li. 2021. Spatial Independent Range Sampling. InProceedings of the 2021 International Con- ference on Management of Data(Virtual Event, China)(SIGMOD ’21). Asso- ciation for Computing Machinery, New York, NY , USA, 2023–2035. https: //doi.org/10.1145/3448016.3452806

work page doi:10.1145/3448016.3452806 2021
[32]

Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: One Cardinality Estimator for All Tables. Proc. VLDB Endow.14, 1 (sep 2020), 61–73. https://doi.org/10.14778/3421424. 3421432

work page doi:10.14778/3421424 2020
[33]

Youngs and Elliot M

Edward A. Youngs and Elliot M. Cramer. 1971. Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms.Technometrics13, 3 (1971), 657–665. https://doi.org/10.1080/00401706.1971.10488826

work page doi:10.1080/00401706.1971.10488826 1971
[34]

Fangyuan Zhang, Mengxu Jiang, and Sibo Wang. 2023. Efficient Dynamic Weighted Set Sampling and Its Extension.Proc. VLDB Endow.17, 1 (Sept. 2023), 15–27. https://doi.org/10.14778/3617838.3617840

work page doi:10.14778/3617838.3617840 2023
[35]

Zhuoyue Zhao, Dong Xie, and Feifei Li. 2022. AB-Tree: Index for Concur- rent Random Sampling and Updates.Proc. VLDB Endow.15, 9 (may 2022), 1835–1847. https://doi.org/10.14778/3538598.3538606

work page doi:10.14778/3538598.3538606 2022
[36]

Yuxuan Zhu, Tengjun Jin, Stefanos Baziotis, Chengsong Zhang, Charith Mendis, and Daniel Kang. 2025. PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees.Proc. ACM Manag. Data3, 3, Article 198 (June 2025), 28 pages. https://doi.org/10.1145/3725335 13

work page doi:10.1145/3725335 2025