pith. machine review for the scientific record. sign in

arxiv: 2605.08601 · v1 · submitted 2026-05-09 · 💻 cs.DB

Recognition: no theorem link

Elastic Scheduling of Intermittent Query Processing in a Cluster Environment

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3

classification 💻 cs.DB
keywords elastic schedulingintermittent processingstream queriesdeadline constraintscost minimizationbatch processingcluster scaling
0
0 comments X

The pith

Scheduling intermittent batched stream processing on elastic clusters meets deadlines at minimum cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops scheduling methods for handling streams of tuples over fixed windows by processing them intermittently in batches on a cluster that can scale nodes up or down. This reduces overall resource use compared to continuous processing while still delivering results by the deadline after each window ends. The schemes also manage multiple queries running at the same time, the arrival of new queries, and changes in how fast data arrives. Implementation on Spark in a cloud environment with standard benchmarks shows the approach uses fewer resources than fixed-size clusters or standard streaming methods while still meeting all deadlines. A reader would care because many applications need timely answers from data streams without paying for resources that sit idle between batches.

Core claim

In an elastic cluster environment, scheduling the batched processing of stream windows allows all deadlines to be met while using the minimum number of nodes, and the same schedules can accommodate concurrent queries, newly arriving queries, and variations in input rates.

What carries the argument

Scheduling algorithms that decide the timing of batch processing and the number of nodes to provision or release in order to satisfy both deadline constraints and cost minimization.

If this is right

  • All queries in the system meet their deadlines under the proposed schedules.
  • Resource usage and therefore cost stay at the lowest level needed to satisfy the deadlines.
  • Multiple concurrent queries are handled together without one query causing another to miss its deadline.
  • New queries that arrive while others are running can be integrated into the schedule on the fly.
  • Changes in input arrival rates trigger adjustments that still keep deadlines and cost targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In pay-per-use clouds the method could translate directly into lower bills by releasing nodes between batches.
  • The same scheduling logic could be adapted to other batch-oriented frameworks that support dynamic node counts.
  • If scaling is not instantaneous in practice, the minimum cost would rise and a buffer of extra nodes might be required.
  • Testing the scheduler with real measured scaling latencies would show how much the cost advantage shrinks.

Load-bearing premise

The cluster can add or remove nodes instantly and at zero cost, and processing times, costs, and input rates can be predicted accurately enough for the scheduler to guarantee deadlines without extra resources.

What would settle it

An experiment in which node scaling incurs measurable delay or cost, or in which actual processing times deviate from the model's predictions, and the proposed scheduler then misses one or more deadlines.

Figures

Figures reproduced from arXiv: 2605.08601 by Saranya Chandrasekaran, S. Sudarshan.

Figure 1
Figure 1. Figure 1: Components of Custom Scheduler Also, for multiple streams, we assume that all the matching tuples from the input streams arrive together. During simulation for assessing the variations in the input rate that can be supported with the already generated schedule, we consider the same amount of variation in each of the input streams. For example, for assessing the maximum rate, we consider both orders and lin… view at source ↗
Figure 2
Figure 2. Figure 2: TPC-Q1 processing duration for different configu [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TPC-Q1 processing duration (for 4500 files) for dif [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Variable Input Rate Profiles and Node Requirements [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Release of nodes experiment 9.7 Experiments with Partial Aggregation We carried out experiments with partial aggregation enabled for 0.4D and 0.3D cases with baseline input rate. The frequency of doing partial aggregation was set as 25% i.e. to carry out partial aggregation once after every 1/4th of the total number of batches are processed [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes scheduling schemes for batched (intermittent) processing of stream queries over windows in an elastic parallel cluster environment. The schemes dynamically scale nodes up or down to meet specified deadlines at minimum cost while handling multiple concurrent queries, new query arrivals, and input rate variations. The approach is implemented on Apache Spark in the AWS EMR environment and evaluated on TPC-H and Yahoo Streaming datasets, with results claiming significant outperformance over fixed-node configurations and Spark Streaming baselines.

Significance. If the central claims hold under realistic conditions, the work would meaningfully extend intermittent query processing to elastic cloud settings, offering practical value for cost-sensitive stream analytics. The implementation on a real system (Spark/EMR) and handling of dynamic scenarios are strengths. However, the evaluation lacks error bars, detailed cost models, and full controls, which weakens support for the minimum-cost and deadline-guarantee assertions.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost' is load-bearing for the contribution but depends on unverified assumptions of instantaneous cost-free node scaling and exact a-priori knowledge of per-tuple processing times, per-node costs, and future input rates. The AWS EMR implementation does not provide instantaneous scaling, yet no sensitivity analysis or robustness experiments against scaling latency or estimation errors appear in the evaluation.
  2. [Evaluation] Evaluation section: The reported outperformance lacks error bars, statistical significance tests, and explicit controls for how costs are measured or how the scheduler obtains accurate runtime models under rate variations and new query arrivals. This makes it impossible to fully assess whether the schemes achieve true minimum cost while guaranteeing deadlines, as deviations from the assumptions would force either missed deadlines or over-provisioning.
minor comments (2)
  1. [Abstract] The abstract contains a minor grammatical issue ('ensure to meet' should be rephrased for clarity).
  2. [Evaluation] Figure captions and axis labels in the experimental results could be clarified to explicitly show cost units and deadline margins.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and indicate planned revisions to strengthen the presentation of assumptions, robustness, and evaluation rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost' is load-bearing for the contribution but depends on unverified assumptions of instantaneous cost-free node scaling and exact a-priori knowledge of per-tuple processing times, per-node costs, and future input rates. The AWS EMR implementation does not provide instantaneous scaling, yet no sensitivity analysis or robustness experiments against scaling latency or estimation errors appear in the evaluation.

    Authors: We appreciate the referee pointing out the strength of the central claim. Our scheduling formulation explicitly models scaling as occurring between batches with low overhead relative to deadlines, consistent with the intermittent processing model; the AWS EMR implementation uses standard APIs for scaling and our experiments confirm deadline satisfaction in practice. We agree that explicit robustness checks are valuable. In the revision we will add a dedicated subsection with sensitivity experiments that inject scaling latencies and introduce controlled errors in processing-time and rate estimates, showing that the schemes still meet deadlines with only modest cost increases. revision: yes

  2. Referee: [Evaluation] Evaluation section: The reported outperformance lacks error bars, statistical significance tests, and explicit controls for how costs are measured or how the scheduler obtains accurate runtime models under rate variations and new query arrivals. This makes it impossible to fully assess whether the schemes achieve true minimum cost while guaranteeing deadlines, as deviations from the assumptions would force either missed deadlines or over-provisioning.

    Authors: We concur that greater statistical transparency and clearer controls will help readers evaluate the minimum-cost and deadline guarantees. The revised evaluation will include error bars (standard deviation over repeated runs) on all reported metrics, together with statistical significance tests comparing our algorithms against the baselines. We will also expand the experimental-setup subsection to detail the exact AWS cost-accounting method and to describe the online runtime-model update procedure. Additional controlled experiments will be added that vary input rates and inject new queries while logging model accuracy, demonstrating adaptation behavior and continued deadline compliance. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic schemes with external experimental validation

full rationale

The paper proposes scheduling algorithms for elastic batched processing in a cluster, claims deadline guarantees and minimum cost under explicit assumptions (instantaneous scaling, accurate models), and validates them via implementation on Apache Spark with TPC-H and Yahoo Streaming datasets against external baselines such as fixed-node execution and Spark streaming. No derivation chain, equation, or central claim reduces by construction to fitted parameters, self-definitions, or load-bearing self-citations; the results are presented as properties of the new algorithms tested on independent data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the approach rests on domain assumptions about elasticity and accurate cost modeling rather than explicit free parameters or new entities.

axioms (2)
  • domain assumption Nodes can be scaled up or down with predictable cost and negligible overhead
    Required for the elastic scheduling to achieve minimum cost while meeting deadlines
  • domain assumption Query processing times and input rates are known or estimable in advance
    Needed to decide batching and scaling decisions ahead of deadlines

pith-pipeline@v0.9.0 · 5452 in / 1235 out tokens · 49876 ms · 2026-05-12T01:06:49.683422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Benjamin Berg, Mor Harchol-Balter, Benjamin Moseley, Weina Wang, and Justin Whitehouse. 2020. Optimal resource allocation for elastic and inelastic jobs. InProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures. 75–87

  2. [2]

    Saranya Chandrasekaran and S Sudarshan. 2024. Scheduling of intermittent query processing. InInternational Database Engineered Applications Symposium. Springer, 363–376

  3. [3]

    Dazhao Cheng, Yuan Chen, Xiaobo Zhou, Daniel Gmach, and Dejan Milojicic

  4. [4]

    InIEEE INFOCOM

    Adaptive scheduling of parallel jobs in Spark Streaming. InIEEE INFOCOM. 1–9

  5. [5]

    Stratos Dimopoulos, Chandra Krintz, and Rich Wolski. 2017. Justice: A Deadline- Aware, Fair-Share Resource Allocator for Implementing Multi-Analytics. InIEEE CLUSTER. 233–244

  6. [6]

    Xiaofei Hou, TK Ashwin Kumar, Johnson P Thomas, and Hong Liu. 2017. Dy- namic deadline-constraint scheduler for Hadoop YARN. In2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDC...

  7. [7]

    Botong Huang, Lianggui Weng, Wei Chen, Kai Zeng, Yihui Feng, Bolin Ding, Jin- gren Zhou, Zuozhi Wang, and Chen Li. 2025. Agamotto: Scheduling of Deadline- Oriented Incremental Query Execution under Uncertain Resource Price.Pro- ceedings of the VLDB Endowment18, 6 (2025), 1852–1864

  8. [8]

    Islam, S

    M. Islam, S. Karunasekera, and R. Buyya. 2017. dSpark: Deadline-Based Resource Allocation for Big Data Applications in Apache Spark. InIEEE International Conference on e-Science (e-Science). 89–98

  9. [9]

    Zhengyu Ou, Ge Yu, Yaxin Yu, Shanshan Wu, Xiaochun Yang, and Qingxu Deng

  10. [10]

    InAdvances in Web-Age Inf

    Tick Scheduling: A Deadline Based Optimal Task Scheduling Approach for Real-Time Data Stream Systems. InAdvances in Web-Age Inf. Management W AIM, Vol. 3739. 725–730

  11. [11]

    Zechao Shang et al. 2020. CrocodileDB: Efficient Database Execution through Intelligent Deferment. InConf. on Innovative Data Systems Research, CIDR

  12. [12]

    Subhajit Sidhanta, Wojciech Golab, and Supratik Mukhopadhyay. 2016. OptEx: A deadline-aware cost optimization model for spark. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 193–202

  13. [13]

    Daniel Sotolongo, Daniel Mills, Tyler Akidau, Anirudh Santhiar, Attila-Péter Tóth, Botong Huang, Boyuan Zhang, Igor Belianski, Ling Geng, Matt Uhlar, et al. 2025. Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables. InCompanion of the 2025 International Conference on Management of Data. 622–634

  14. [14]

    Dixin Tang et al. 2020. CrocodileDB in Action: Resource-Efficient Query Execu- tion by Exploiting Time Slackness.Proc. VLDB Endow.13, 12 (2020), 2937–2940

  15. [15]

    Elmore, Sanjay Krishnan, and Michael J

    Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2019. Intermittent Query Processing.Proc. VLDB Endow.12, 11 (2019), 1427–1441

  16. [16]

    Elmore, Sanjay Krishnan, and Michael J

    Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2020. Thrifty Query Execution via Incrementability. InACM SIGMOD. ACM, 1241–1256

  17. [17]

    Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: yet another resource negotiator. InACM Symposium on Cloud Computing, SOCC. 5:1–5:16. https: //doi.org/10.1145/2523616.2523633

  18. [18]

    Guolu Wang, Jungang Xu, Renfeng Liu, and Shanshan Huang. 2018. A Hard Real-time Scheduler for Spark on YARN. InIEEE/ACM CCGRID. 645–652

  19. [19]

    Zuozhi Wang et al. 2020. Grosbeak: A Data Warehouse Supporting Resource- Aware Incremental Computing. InACM SIGMOD. 2797–2800

  20. [20]

    Jie Zhu, Xiaoping Li, Rubén Ruiz, and Xiaolong Xu. 2018. Scheduling stochastic multi-stage jobs to elastic hybrid cloud resources.IEEE Transactions on Parallel and Distributed Systems29, 6 (2018), 1401–1415. 15