Recognition: no theorem link
Elastic Scheduling of Intermittent Query Processing in a Cluster Environment
Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3
The pith
Scheduling intermittent batched stream processing on elastic clusters meets deadlines at minimum cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In an elastic cluster environment, scheduling the batched processing of stream windows allows all deadlines to be met while using the minimum number of nodes, and the same schedules can accommodate concurrent queries, newly arriving queries, and variations in input rates.
What carries the argument
Scheduling algorithms that decide the timing of batch processing and the number of nodes to provision or release in order to satisfy both deadline constraints and cost minimization.
If this is right
- All queries in the system meet their deadlines under the proposed schedules.
- Resource usage and therefore cost stay at the lowest level needed to satisfy the deadlines.
- Multiple concurrent queries are handled together without one query causing another to miss its deadline.
- New queries that arrive while others are running can be integrated into the schedule on the fly.
- Changes in input arrival rates trigger adjustments that still keep deadlines and cost targets.
Where Pith is reading between the lines
- In pay-per-use clouds the method could translate directly into lower bills by releasing nodes between batches.
- The same scheduling logic could be adapted to other batch-oriented frameworks that support dynamic node counts.
- If scaling is not instantaneous in practice, the minimum cost would rise and a buffer of extra nodes might be required.
- Testing the scheduler with real measured scaling latencies would show how much the cost advantage shrinks.
Load-bearing premise
The cluster can add or remove nodes instantly and at zero cost, and processing times, costs, and input rates can be predicted accurately enough for the scheduler to guarantee deadlines without extra resources.
What would settle it
An experiment in which node scaling incurs measurable delay or cost, or in which actual processing times deviate from the model's predictions, and the proposed scheduler then misses one or more deadlines.
Figures
read the original abstract
Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes scheduling schemes for batched (intermittent) processing of stream queries over windows in an elastic parallel cluster environment. The schemes dynamically scale nodes up or down to meet specified deadlines at minimum cost while handling multiple concurrent queries, new query arrivals, and input rate variations. The approach is implemented on Apache Spark in the AWS EMR environment and evaluated on TPC-H and Yahoo Streaming datasets, with results claiming significant outperformance over fixed-node configurations and Spark Streaming baselines.
Significance. If the central claims hold under realistic conditions, the work would meaningfully extend intermittent query processing to elastic cloud settings, offering practical value for cost-sensitive stream analytics. The implementation on a real system (Spark/EMR) and handling of dynamic scenarios are strengths. However, the evaluation lacks error bars, detailed cost models, and full controls, which weakens support for the minimum-cost and deadline-guarantee assertions.
major comments (2)
- [Abstract] Abstract: The central claim that 'Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost' is load-bearing for the contribution but depends on unverified assumptions of instantaneous cost-free node scaling and exact a-priori knowledge of per-tuple processing times, per-node costs, and future input rates. The AWS EMR implementation does not provide instantaneous scaling, yet no sensitivity analysis or robustness experiments against scaling latency or estimation errors appear in the evaluation.
- [Evaluation] Evaluation section: The reported outperformance lacks error bars, statistical significance tests, and explicit controls for how costs are measured or how the scheduler obtains accurate runtime models under rate variations and new query arrivals. This makes it impossible to fully assess whether the schemes achieve true minimum cost while guaranteeing deadlines, as deviations from the assumptions would force either missed deadlines or over-provisioning.
minor comments (2)
- [Abstract] The abstract contains a minor grammatical issue ('ensure to meet' should be rephrased for clarity).
- [Evaluation] Figure captions and axis labels in the experimental results could be clarified to explicitly show cost units and deadline margins.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and indicate planned revisions to strengthen the presentation of assumptions, robustness, and evaluation rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost' is load-bearing for the contribution but depends on unverified assumptions of instantaneous cost-free node scaling and exact a-priori knowledge of per-tuple processing times, per-node costs, and future input rates. The AWS EMR implementation does not provide instantaneous scaling, yet no sensitivity analysis or robustness experiments against scaling latency or estimation errors appear in the evaluation.
Authors: We appreciate the referee pointing out the strength of the central claim. Our scheduling formulation explicitly models scaling as occurring between batches with low overhead relative to deadlines, consistent with the intermittent processing model; the AWS EMR implementation uses standard APIs for scaling and our experiments confirm deadline satisfaction in practice. We agree that explicit robustness checks are valuable. In the revision we will add a dedicated subsection with sensitivity experiments that inject scaling latencies and introduce controlled errors in processing-time and rate estimates, showing that the schemes still meet deadlines with only modest cost increases. revision: yes
-
Referee: [Evaluation] Evaluation section: The reported outperformance lacks error bars, statistical significance tests, and explicit controls for how costs are measured or how the scheduler obtains accurate runtime models under rate variations and new query arrivals. This makes it impossible to fully assess whether the schemes achieve true minimum cost while guaranteeing deadlines, as deviations from the assumptions would force either missed deadlines or over-provisioning.
Authors: We concur that greater statistical transparency and clearer controls will help readers evaluate the minimum-cost and deadline guarantees. The revised evaluation will include error bars (standard deviation over repeated runs) on all reported metrics, together with statistical significance tests comparing our algorithms against the baselines. We will also expand the experimental-setup subsection to detail the exact AWS cost-accounting method and to describe the online runtime-model update procedure. Additional controlled experiments will be added that vary input rates and inject new queries while logging model accuracy, demonstrating adaptation behavior and continued deadline compliance. revision: yes
Circularity Check
No circularity: algorithmic schemes with external experimental validation
full rationale
The paper proposes scheduling algorithms for elastic batched processing in a cluster, claims deadline guarantees and minimum cost under explicit assumptions (instantaneous scaling, accurate models), and validates them via implementation on Apache Spark with TPC-H and Yahoo Streaming datasets against external baselines such as fixed-node execution and Spark streaming. No derivation chain, equation, or central claim reduces by construction to fitted parameters, self-definitions, or load-bearing self-citations; the results are presented as properties of the new algorithms tested on independent data.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Nodes can be scaled up or down with predictable cost and negligible overhead
- domain assumption Query processing times and input rates are known or estimable in advance
Reference graph
Works this paper leans on
-
[1]
Benjamin Berg, Mor Harchol-Balter, Benjamin Moseley, Weina Wang, and Justin Whitehouse. 2020. Optimal resource allocation for elastic and inelastic jobs. InProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures. 75–87
work page 2020
-
[2]
Saranya Chandrasekaran and S Sudarshan. 2024. Scheduling of intermittent query processing. InInternational Database Engineered Applications Symposium. Springer, 363–376
work page 2024
-
[3]
Dazhao Cheng, Yuan Chen, Xiaobo Zhou, Daniel Gmach, and Dejan Milojicic
-
[4]
Adaptive scheduling of parallel jobs in Spark Streaming. InIEEE INFOCOM. 1–9
-
[5]
Stratos Dimopoulos, Chandra Krintz, and Rich Wolski. 2017. Justice: A Deadline- Aware, Fair-Share Resource Allocator for Implementing Multi-Analytics. InIEEE CLUSTER. 233–244
work page 2017
-
[6]
Xiaofei Hou, TK Ashwin Kumar, Johnson P Thomas, and Hong Liu. 2017. Dy- namic deadline-constraint scheduler for Hadoop YARN. In2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDC...
work page 2017
-
[7]
Botong Huang, Lianggui Weng, Wei Chen, Kai Zeng, Yihui Feng, Bolin Ding, Jin- gren Zhou, Zuozhi Wang, and Chen Li. 2025. Agamotto: Scheduling of Deadline- Oriented Incremental Query Execution under Uncertain Resource Price.Pro- ceedings of the VLDB Endowment18, 6 (2025), 1852–1864
work page 2025
- [8]
-
[9]
Zhengyu Ou, Ge Yu, Yaxin Yu, Shanshan Wu, Xiaochun Yang, and Qingxu Deng
-
[10]
Tick Scheduling: A Deadline Based Optimal Task Scheduling Approach for Real-Time Data Stream Systems. InAdvances in Web-Age Inf. Management W AIM, Vol. 3739. 725–730
-
[11]
Zechao Shang et al. 2020. CrocodileDB: Efficient Database Execution through Intelligent Deferment. InConf. on Innovative Data Systems Research, CIDR
work page 2020
-
[12]
Subhajit Sidhanta, Wojciech Golab, and Supratik Mukhopadhyay. 2016. OptEx: A deadline-aware cost optimization model for spark. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 193–202
work page 2016
-
[13]
Daniel Sotolongo, Daniel Mills, Tyler Akidau, Anirudh Santhiar, Attila-Péter Tóth, Botong Huang, Boyuan Zhang, Igor Belianski, Ling Geng, Matt Uhlar, et al. 2025. Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables. InCompanion of the 2025 International Conference on Management of Data. 622–634
work page 2025
-
[14]
Dixin Tang et al. 2020. CrocodileDB in Action: Resource-Efficient Query Execu- tion by Exploiting Time Slackness.Proc. VLDB Endow.13, 12 (2020), 2937–2940
work page 2020
-
[15]
Elmore, Sanjay Krishnan, and Michael J
Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2019. Intermittent Query Processing.Proc. VLDB Endow.12, 11 (2019), 1427–1441
work page 2019
-
[16]
Elmore, Sanjay Krishnan, and Michael J
Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2020. Thrifty Query Execution via Incrementability. InACM SIGMOD. ACM, 1241–1256
work page 2020
-
[17]
Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: yet another resource negotiator. InACM Symposium on Cloud Computing, SOCC. 5:1–5:16. https: //doi.org/10.1145/2523616.2523633
-
[18]
Guolu Wang, Jungang Xu, Renfeng Liu, and Shanshan Huang. 2018. A Hard Real-time Scheduler for Spark on YARN. InIEEE/ACM CCGRID. 645–652
work page 2018
-
[19]
Zuozhi Wang et al. 2020. Grosbeak: A Data Warehouse Supporting Resource- Aware Incremental Computing. InACM SIGMOD. 2797–2800
work page 2020
-
[20]
Jie Zhu, Xiaoping Li, Rubén Ruiz, and Xiaolong Xu. 2018. Scheduling stochastic multi-stage jobs to elastic hybrid cloud resources.IEEE Transactions on Parallel and Distributed Systems29, 6 (2018), 1401–1415. 15
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.