pith. machine review for the scientific record. sign in

cs.DC

Distributed, Parallel, and Cluster Computing

Covers fault-tolerance, distributed algorithms, stabilility, parallel computation, and cluster computing. Roughly includes material in ACM Subject Classes C.1.2, C.1.4, C.2.4, D.1.3, D.4.5, D.4.7, E.1.

0
cs.DC 2026-05-14 Recognition

Heterogeneous solvers up to 32% faster than GPU-only for big matrices

Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL

Splitting CG and Cholesky work across CPU and GPU with SYCL beats pure-GPU code on NVIDIA, AMD and Intel systems.

Figure from the paper full image
abstract click to expand
Many important real-world applications, such as System Identification with Gaussian Processes, involve solving linear systems with symmetric positive-definite matrices. The iterative CG method and direct solvers based on the Cholesky decomposition are two popular methods that can be applied in this case. Since often very large systems have to be solved when dealing with such real-world scenarios, GPUs are commonly used to accelerate the computations. However, homogeneous approaches that only leverage the GPU in the system do not take full advantage of the often powerful CPUs located in modern HPC systems. In this work, we present multi-vendor, heterogeneous implementations of the CG method and the Cholesky decomposition that leverage the CPU and GPU of a heterogeneous system simultaneously using SYCL. Furthermore, we compare their runtime behavior to traditional, homogeneous approaches. The results show that for large matrices, our heterogeneous implementation is up to 32 percent faster for the CG method and up to 29 percent faster for the Cholesky decomposition compared to the corresponding GPU-only implementations. In addition, for large matrices, our heterogeneous implementation of the Cholesky decomposition can achieve at least 12 percent faster runtimes across several systems with GPUs from NVIDIA, AMD, and Intel.
0
0
cs.DC 2026-05-13 2 theorems

Decoupled compression speeds GPU collectives up to 9.65x

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

Placing quantization at the interface and entropy coding inside NCCL primitives allows overlap with communication on scientific and training

Figure from the paper full image
abstract click to expand
Collective communication is a major bottleneck for multi-node GPU workloads in scientific computing and distributed deep learning, especially when inter-node bandwidth is limited. Although NCCL provides optimized GPU-centric collectives, large messages can still dominate end-to-end performance. Existing compression-enabled collective libraries either rely on MPI-based stacks that cannot fully exploit NCCL, omit entropy coding, or tightly couple full compressors with communication primitives, limiting compression ratio, flexibility, and communication-computation overlap. This paper presents NCCLZ, a compression-enabled GPU collectives that decouples quantization and entropy coding and integrates them at different layers of the stack. NCCLZ places quantization at the interface, embeds entropy coding into NCCL primitives, uses a lightweight device-side selector to choose coding strategies, and overlaps compression with communication to reduce exposed overhead. Experiments on scientific datasets, training gradients, and synthetic workloads show up to 9.65x speedup over NCCL and up to 3.34x improvement over prior compression-assisted collective libraries.
0
0
cs.DC 2026-05-13 Recognition

Per-head adaptive blocks improve sparse attention accuracy by 5.43%

AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

Attention heads differ in granularity needs, allowing targeted block sizing to recover accuracy lost by uniform methods without slowing down

Figure from the paper full image
abstract click to expand
As large language models scale to longer contexts, loading the growing KV cache during attention computation becomes a critical bottleneck. Previous work has shown that attention computation is dominated by a small subset of tokens. This motivates block sparse attention methods that partition the KV cache into fixed-size blocks and selectively compute attention over those blocks exhibiting high importance. However, these methods assign a uniform block size across all attention heads, implicitly assuming homogeneous behavior throughout the model. Our analysis reveals that this assumption is flawed: attention heads exhibit widely varying sensitivity to block granularity, and uniformity leads to suboptimal accuracy. We present AB-Sparse, a training-free algorithm-system co-designed framework that improves accuracy while preserving throughput. AB-Sparse introduces lightweight adaptive block size allocation across attention heads to improve accuracy. To compensate for the additional memory overhead, it further employs lossless block centroid quantization. In addition, custom GPU kernels are developed to support efficient execution with variable block sizes. Evaluation results demonstrate that AB-Sparse achieves an accuracy improvement of up to 5.43% over existing block sparse attention baselines without throughput overhead.
0
0
cs.DC 2026-05-13 2 theorems

Power capping leaves LLM decode energy untouched

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

Decode draws 137-300 W on 700 W GPUs as memory saturates first; clock locking recovers 32% energy instead.

Figure from the paper full image
abstract click to expand
Power capping is the standard GPU energy lever in LLM serving, and it appears to work: throughput drops, power readings fall, and energy budgets are met. We show the appearance is illusory for the phase that dominates production serving: autoregressive decode. Across four attention paradigms -- GQA, MLA, Gated DeltaNet, and Mamba2 -- on NVIDIA H200, decode draws only 137--300\,W on a 700\,W GPU; no cap ever triggers, because memory-bound decode saturates HBM bandwidth rather than compute and leaves power headroom untouched. Firmware-initiated clock throttling compounds the illusion: these deviations can corrupt any throughput measurement that attributes them to the cap. SM clock locking dissolves both confounds. By targeting the lever that is actually on the critical path, clock locking Pareto-dominates power capping universally, recovering up to 32\% of decode energy at minimal throughput loss. We identify three architecture-dependent DVFS behavioural classes and characterise a common energy pattern across novel attention replacements: a heavy prefill cost recouped by efficient decode, eventually halving total request energy relative to GQA at production batch sizes.
0
0
cs.DC 2026-05-13 2 theorems

Overlays trade reliability against overhead for AI agent discovery

Trade-offs in Decentralized Agentic AI Discovery Across the Compute Continuum

Benchmarks on 4096 nodes show how Chord, Pastry and Kademlia perform in stationary and churn conditions across edge to cloud

Figure from the paper full image
abstract click to expand
Agentic systems deployed across the compute continuum need discovery mechanisms that remain effective across cloud, edge, and intermittently connected domains. In some emerging agentic architectures, decentralized discovery is already an active design direction, placing DHT-based lookup on the path toward agent directories. This paper studies the trade-offs among major structured-overlay families for agent discovery, comparing Chord, Pastry, and Kademlia as candidate indexing substrates within a shared control-plane framework. Using a benchmark subset centered on a 4096-node stationary comparison and a representative 4096-node churn benchmark, the paper characterizes how discovery reliability, startup behavior, and control-plane overhead vary across these overlays. The goal is to clarify the operating points they expose for agent discovery across edge-to-cloud environments.
0
0
cs.DC 2026-05-13 2 theorems

GraphFlash hits 127x speedup in serverless graph processing

GraphFlash: Enabling Fast and Elastic Graph Processing on Serverless Infrastructure

Shared storage and targeted optimizations let graph workloads scale without fixed clusters.

Figure from the paper full image
abstract click to expand
Graph processing systems are essential for analyzing large-scale data with complex relationships, yet most existing frameworks rely on statically provisioned clusters, resulting in poor elasticity and inefficient resource utilization under dynamic workloads. Serverless computing offers automatic scaling and fine-grained billing, but existing serverless graph systems suffer from performance limitations due to inefficient state management and high communication overhead through external storage. We present GraphFlash, a fast and elastic graph processing framework built on serverless infrastructure. GraphFlash adopts a subgraph-centric programming model and leverages shared external storage for coordination and communication, enabling stateless, fine-grained function execution. It supports two execution modes: rotating mode for resource-constrained environments and pinned mode for higher performance when resources are sufficient. To address serverless limitations, GraphFlash introduces system-level optimizations, including partition-aware key aggregation, intra-function partition co-location, and superstep-aware activation. Across multiple graph algorithms and datasets, GraphFlash outperforms existing serverless-compatible systems by up to 127x in execution time and reduces resource consumption by up to 98% under higher-resource configurations, while matching the performance of traditional distributed frameworks on large workloads. Even with limited resources, it achieves up to 48x speedup and 99.97% cost reduction over prior serverless solutions, demonstrating that GraphFlash makes serverless graph processing practical and performant.
0
0
cs.DC 2026-05-13 Recognition

NAVIS speeds on-SSD vector inserts up to 2.74x

NAVIS: Concurrent Search and Update with Low Position-Seeking Overhead in On-SSD Graph-Based Vector Search

Three mechanisms cut position-seeking costs during concurrent searches and updates

Figure from the paper full image
abstract click to expand
On-disk graph-based vector search (GVS) has become the dominant approach for serving large-scale vector databases at high recall, but prior systems struggle to sustain concurrent search and update throughput on high-dimensional workloads. We find the main cause of this in position seeking, a full graph traversal that every update performs to locate neighbors before linking the new vector into the graph. Position seeking is fundamentally heavier than a search query, and its cost is further amplified by two systemic limitations of current GVS systems, packed layouts that couple every edge fetch to a full vector load, and a static entrance graph whose entry points drift away from newly inserted regions as updates accumulate. We present NAVIS, an on-SSD GVS system that drives down position-seeking overhead through (i) a layout-supported selective vector read that breaks the packed-page coupling without losing its locality benefits, (ii) a dynamic lightweight entrance graph update mechanism that reuses traversal information already produced by concurrent updates, and (iii) an entrance graph-aware edgelist cache that concentrates capacity on high-reuse paths near refreshed entry points. Across multiple large-scale high-dimensional benchmarks, NAVIS enhances average insertion throughput by up to 2.74x and average concurrent search throughput by up to 1.37x while reducing average search latency by up to 25.26%.
0
0
cs.DC 2026-05-13 Recognition

Off-chain twins let DeFi agents simulate trades without waiting for blocks

State Twins: An Off-Chain Substrate for Agentic Reasoning over Decentralized Finance Protocols

Formalizing AMM pools as dynamical systems yields replicas that support instant forking and replay while bounding divergence from the liveι“Ύ.

abstract click to expand
We introduce the State Twin: a typed, in-memory, replayable replica of an on-chain automated market maker (AMM) pool that serves as a substrate for agentic reasoning over decentralized finance (DeFi) protocols. Agentic DeFi stacks today couple reasoning to chain time, since every "what if?" query incurs a new RPC read or a real transaction, so the agent's effective action space is bounded by block confirmation latency and gas. We argue this coupling is a structural problem rather than a performance one, and that the missing layer is an off-chain substrate that preserves the protocol's exact mathematics while admitting the operations on-chain state cannot: forking, replay, branching, counterfactual rollout. We formalize each AMM family (Uniswap V2, V3, Balancer, Stableswap) as a discrete-time controlled dynamical system, prove a quantitative fidelity bound on the divergence between twin and chain, and give the open architecture used in DeFiPy v2, an open-source Python toolkit that ships the State Twin substrate and a reference Model Context Protocol server exposing typed analytical primitives as LLM tools. The same primitive (i.e., one Python class, one calling pattern) serves a notebook quant, a backtest, and an LLM agent without modification. We close with a fork-and-evaluate worked example: a single live RPC read seeds N independent in-memory twins under distinct price-shock scenarios, in sub-second wall-clock time. The contribution is the substrate, not a particular agent, which is what the specification of what an agentic DeFi substrate must look like
0
0
cs.DC 2026-05-13 Recognition

Storage offloading breaks memory wall for full-graph GNN training

GriNNder: Breaking the Memory Capacity Wall in Full-Graph GNN Training with Storage Offloading

GriNNder achieves up to 9.78x speedup on single GPU setups comparable to distributed systems

Figure from the paper full image
abstract click to expand
Full-graph training of graph neural networks (GNNs) is widely used as it enables direct validation of algorithmic improvements by preserving complete neighborhood information. However, it typically requires multiple GPUs or servers, incurring substantial hardware and inter-device communication costs. While existing single-server methods reduce infrastructure requirements, they remain constrained by GPU and host memory capacity as graph sizes increase. To address this limitation, we introduce GriNNder, which is the first work to leverage storage devices to enable full-graph training even with limited memory. Because modern NVMe SSDs offer multi-terabyte capacities and bandwidths exceeding 10 GB/s, they provide an appealing option when memory resources are scarce. Yet, directly applying storage-based methods from other domains fails to address the unique access patterns and data dependencies in full-graph GNN training. GriNNder tackles these challenges by structured storage offloading (SSO), a framework that manages the GPU-host-storage hierarchy through coordinated cache, (re)gather, and bypass mechanisms. To realize the framework, we devise (i) a partition-wise caching strategy for host memory that exploits the observation on cross-partition dependencies, (ii) a regathering strategy for gradient computation that eliminates redundant storage operations, and (iii) a lightweight partitioning scheme that mitigates the memory requirements of existing graph partitioners. In experiments performed over various models and datasets, GriNNder achieves up to 9.78x speedup over state-of-the-art baselines and throughput comparable to distributed systems, enabling previously infeasible large-scale full-graph training even on a single GPU.
0
0
cs.DC 2026-05-12 2 theorems

Chunked prefetching speeds DiT steps up to 1.28x with 49% less GPU memory

ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference

Communication-aware runtime hides prefetch latency behind compute and collectives on PCIe nodes, exposing a tunable memory-speed tradeoff.

Figure from the paper full image
abstract click to expand
Layerwise offloading reduces the GPU memory footprint of large diffusion transformer (DiT) inference by prefetching upcoming layers from host memory, but its effectiveness hinges on hiding prefetch latency behind per-layer computation. This assumption breaks down when the per-GPU compute workload is small. Moreover, on PCIe-only nodes, prefetch and inter-GPU collective communications such as all-reduce and all-to-all contend on the shared PCIe path, exposing prefetch latency even when compute would otherwise hide it. We revisit layerwise offloading as a co-scheduling problem between prefetch and communication, guided by a first-order analytical model that predicts when prefetch can be hidden by computation. Building on this model, we design ChunkFlow, a communication-aware, chunk-granular offloading runtime that adaptively yields to collective communication and smoothly trades GPU memory for prefetch volume. On three representative diffusion transformers running on two H100 GPUs over PCIe with Ulysses sequence parallelism, ChunkFlow delivers up to 1.28x step-time speedup over SGLang's existing layerwise offloading, reduces peak GPU memory by up to 49% over the no-offload baseline at near-identical step time once the workload is large enough, and exposes a tunable memory-latency tradeoff that recovers near-zero step-time overhead in the small-workload regime.
0
0
cs.DC 2026-05-12 Recognition

Chakra standardizes graph traces for AI workload benchmarking

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

A portable format encodes operations, dependencies, timing, and constraints to support simulators and hardware co-design across vendors.

Figure from the paper full image
abstract click to expand
The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.
0
0
cs.DC 2026-05-12 Recognition

Directed graphs support Byzantine consensus only under specific connectivity

Byzantine Consensus in Directed Graphs with Message Authentication

With message authentication, exact agreement works in synchronous cases and approximate agreement in asynchronous cases precisely when the d

Figure from the paper full image
abstract click to expand
We consider the problem of reaching consensus in communication networks that are modeled by directed graphs. We assume the existence of a message authentication mechanism (such as digital signatures) to verify the integrity of messages. We identify the necessary and sufficient conditions on the directed communication graph for the following problems to be solvable: (i) exact consensus in synchronous systems; and (ii) approximate consensus in asynchronous systems.
0
0
cs.DC 2026-05-12 2 theorems

ReCoVer preserves exact training trajectory after GPU losses

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

Constant microbatch counts per iteration deliver 2.23x higher throughput than restart methods when 256 GPUs fail during the run.

Figure from the paper full image
abstract click to expand
Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates $2.23\times$ higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.
0
0
cs.DC 2026-05-12 2 theorems

ShardTensor scales SciML to arbitrary spatial resolutions

ShardTensor: Domain Parallelism for Scientific Machine Learning

Domain parallelism shards input data so models can train and infer on larger problems without hardware caps or accuracy loss.

Figure from the paper full image
abstract click to expand
Scientific Machine Learning (SciML) faces unique challenges for extreme-resolution data, with mitigations that often fail to scale or degrade the accuracy of trained models. While some specialized methods have achieved remarkable results in training models or performing inference on massive spatial datasets with bespoke techniques, there is no generalized framework for parallelization over input data below batch size one per device. In this work we introduce ShardTensor: a novel paradigm of domain parallelism that enables flexible scaling of input data to arbitrary sizes. By decoupling the spatial dimensionality of input data from hardware constraints, ShardTensor enables scientific machine learning workloads to reach new levels of high fidelity training and inference. We demonstrate both strong and weak scaling of workloads during training and inference, showing improved latency with strong scaling and demonstrating the capacity to process higher data sizes with weak scaling. Additionally, we demonstrate multiple dimensions of parallelization, removing barriers to SciML on extreme-scale inputs.
0
0
cs.DC 2026-05-12 Recognition

GCC 15 outperforms LLVM 21 in four of six RISC-V vector apps

Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors

Microbenchmarks on real hardware show predication and stride loads limit current compilers, with GCC ahead except in matrix multiplies.

Figure from the paper full image
abstract click to expand
The RISC-V Vector Extension~(RVV) is a cornerstone for supporting compute throughout in scientific and machine learning workloads. Yet compiler support and performance monitoring on real RVV~1.0 hardware are still evolving. In this work, we design a suite of assembly microbenchmarks to establish performance ceilings and calibrate performance counters on RVV hardware. Leveraging the assembly benchmarks, we find that predication overhead and stride load pose performance challenges that current compiler cost models do not yet fully address. Moreover, we present the first evaluation of GCC~15 and LLVM~21 autovectorization in HPC and ML proxy applications. GCC~15 outperforms LLVM~21 in four out of six applications. LLVM~21 only outperforms GCC~15 in SGEMM and DGEMM, driven by more aggressive instruction reduction confirmed through validated \texttt{perf} counters on the RVV hardware. We further show that the default LMUL selection in compilers performs close to the optimal. To study the RVV support for product-level application, we also evaluate the state-vector quantum simulator, Google's Qsim, with both manual RVV intrinsics and compiler auto-vectorization, revealing immaturity in current RVV compiler for complicated memory access pattern.
0
0
cs.DC 2026-05-12 3 theorems

Edge micro-agent fixes failures safely with no destructive actions

An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum

Causal graphs and uncertainty checks let it repair at 62% accuracy in 3ms or escalate when unsure, preventing harm in ambiguous edge faults.

Figure from the paper full image
abstract click to expand
Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either due to a lack of causal awareness or acting under high epistemic uncertainty, risking destructive interventions. This paper presents an uncertainty-aware resilience micro-agent for causal observability (AURORA), a lightweight framework for diagnosing and mitigating grey failures in edge-tier environments. The framework employs parallel micro-agents that integrate the free-energy principle, causal do-calculus, and localized causal state-graphs to support counterfactual root-cause analysis within each fault's Markov blanket. Restricting inference to causally relevant variables reduces computational overhead while preserving diagnostic fidelity. AURORA further introduces a dual-gated execution mechanism that authorizes remediation only when causal confidence is high and predicted epistemic uncertainty is bounded; otherwise, it abstains from local intervention and escalates the diagnostic payload to the fog tier. Our experiments demonstrate that AURORA outperforms baselines, achieving a 0% destructive action rate, while maintaining 62.0% repair accuracy and a 3ms mean time to repair.
0
0
cs.DC 2026-05-12 2 theorems

Mutable membership lets MoE survive rank faults without restarts

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

Targeted repairs for reachability and coverage replace full-instance downtime with short pauses that restore near-normal speed in under a 60

Figure from the paper full image
abstract click to expand
Mixture-of-Experts (MoE) serving relies on wide expert parallelism (EP) to aggregate the memory capacity and bandwidth of many GPUs within one inference instance. This efficiency comes with a systems cost: every decoding step depends on token dispatch and combination across all active EP ranks, so even one rank failure can disrupt the entire service. Existing EP stacks handle such failures poorly because they treat membership as a fixed configuration established at initialization. The same rank set determines communicator state, expert placement, and the routing metadata baked into CUDA execution graphs, leaving the system with no way to shrink around a failure while keeping the instance valid. This paper argues that partial-failure tolerance should instead be formulated as a live EP validity problem. We present EEP, a communication and runtime substrate that represents membership as explicit, mutable runtime state. EEP repairs the specific state invalidated by a fault: it restores peer reachability without rebuilding the communication substrate, repairs lost expert coverage through a bandwidth-aware hierarchy, and reintegrates repaired ranks without forcing healthy ranks to recapture their CUDA graphs. We implement EEP in an EP serving stack integrated with SGLang and evaluate it under steady-state serving, failure recovery, and rank reintegration. The results show that explicit mutable membership preserves the steady-state fast path, staying within 4.4% of a fixed-membership DeepEP baseline under static serving, while turning a local rank fault from whole-instance downtime into two bounded interruptions. On a single-rank failure workload, EEP incurs an 11s recovery pause and an 8s reintegration pause, and restores throughput to within 95% of the pre-fault level within 52s, whereas a fixed-membership full-restart baseline remains unavailable until 348s.
0
0
cs.DC 2026-05-12 Recognition

Maestro cuts GPU use by 40% for compound LLM training

Accelerating Compound LLM Training Workloads with Maestro

Section graphs let each workload part pick its own settings; wavefront scheduling overlaps execution despite input-driven changes.

Figure from the paper full image
abstract click to expand
Compound LLM training workloads-such as knowledge distillation and multimodal LLM (MLLM) training-are gaining prominence. These typically comprise heterogeneous components differing in parameter scale, execution mode (forward-only or full forward-backward), and sequence length. Besides, component activation can be data-dependent: in MLLM training, modality-specific parts activate only when inputs contain corresponding modalities, causing dynamic computational paths and irregular runtime workloads. Conventional frameworks, designed for monolithic models, cannot handle the dual heterogeneity-static (across components) and dynamic (runtime). By enforcing one-size-fits-all training configurations across components and ignoring input-induced variations, they suffer suboptimal throughput and poor GPU utilization. In this paper, we introduce Maestro, a section-centric training framework that addresses both challenges. Maestro first restructures the workload into a coarse-grained section graph. Each section independently configures its parallelism strategy, micro-batch size, and data-parallel degree-enabling fine-grained, component-aware resource allocation to tackle static heterogeneity. To tackle runtime irregularity, Maestro introduces a wavefront scheduling algorithm that dynamically reorders input samples to orchestrate concurrent section execution while preserving cross-section data dependencies. This maximizes inter-section parallelism and minimizes stalls, boosting hardware utilization. Deployed in production for millions of GPU hours, Maestro reduces GPU consumption by ~40% on key workloads-including knowledge distillation and MLLM training-validating its real-world impact.
0
0
cs.DC 2026-05-12 Recognition

BitTorrent warm-up bounds FL source attribution to random guessing

Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning

FLTorrent keeps attribution near neighborhood levels with 6-10% overhead for large models over 100-500 peers.

Figure from the paper full image
abstract click to expand
Traditional federated learning (FL) relies on a central aggregator server, which can create performance bottlenecks and privacy risks. Decentralized mix-and-forward designs remove the server, but repeated local mixing can attenuate global information under heterogeneity and exposes peer-to-peer neighborhoods as a privacy attack surface. To preserve FedAvg-style aggregation semantics (over updates reconstructable by the round deadline) while scaling dissemination, we present FLTorrent, a BitTorrent-based dissemination layer for serverless FL with a short warm-up. Warm-up hardens within-round source unlinkability -- a dissemination-layer goal orthogonal to content protections (e.g., DP or secure aggregation) -- via (i) pre-round obfuscation, (ii) randomized lags, and (iii) coordination-only non-owner-first scheduling (tracker off the data path), before switching to vanilla BitTorrent swarming. We upper-bound the per-transfer attribution posterior by the fraction of owner chunks in a sender's eligible cover set, and derive a tighter high-probability bound that improves with early non-owner mass. A simple heuristic, GreedyFastestFirst, attains approximately 92% of a bandwidth-optimal max-flow upper bound, while warm-up remains a stable approximately 12% share of a round across 100--500 peers. Under an observation-only local adversary, FLTorrent drives attribution success close to neighborhood-level random guessing for typical nodes, improves with network size, and remains robust under collusion. In LLM-scale stress tests (Gemma-7B, DeepSeek-R1-14B, Qwen2.5-32B, and Llama-3.3-70B) over 7--10 Gbps access links, FLTorrent adds only approximately 6--10% end-to-end overhead relative to BitTorrent-only. Overall, FLTorrent shows that within-round unlinkability and BitTorrent-level efficiency can co-exist with predictable, low overheads at scale.
0
0
cs.DC 2026-05-12 2 theorems

Hierarchical RL cuts edge latency 28 percent while saving energy

HiRL: Hierarchical Reinforcement Learning for Coordinated Resource Management in Heterogeneous Edge Computing

It splits power decisions from task placement and coordinates them through queue states to handle mobility and varying loads.

Figure from the paper full image
abstract click to expand
Edge computing faces unprecedented resource orchestration challenges from multi-dimensional heterogeneity across device architectures, diverse task requirements in CPU-intensive, GPU-intensive, I/O-intensive, and dynamic network conditions. The edge environments demand real-time task processing within strict energy budgets, yet conventional approaches struggle with mixed continuous-discrete optimization while meeting deadline and energy constraints. This paper presents HiRL, a hierarchical reinforcement learning framework that decomposes complex resource orchestration into coordinated power control and task allocation decisions. Our approach separates continuous power management using the Twin Delayed Deep Deterministic Policy Gradient (TD3) and discrete task placement using Double Deep Q-Network (DDQN), unified through a coordination engine with five-dimensional queue state representation. We propose a heterogeneous assessment of resource compatibility with deadline-oriented prioritization and failure-penalized adaptive sampling to enhance decision quality under resource constraints. To improve practical applicability, the framework models comprehensive system dynamics including device mobility, queue congestion patterns, infrastructure heterogeneity, and priority-sensitive scheduling demands. Experimental results show that HiRL achieves effective latency-energy trade-offs with 28% latency reduction compared to Single-DDQN and maintains nearly 100% task completion rates under all load conditions. Compared to baseline algorithms, HiRL reduces energy consumption by up to 51% under low load while achieving 24% better latency performance than static optimization approaches under high load, establishing effective resource orchestration in heterogeneous edge environments.
0
0
cs.DC 2026-05-12 Recognition

CPU radix sort cuts bandwidth use by 6x on large data

FractalSortCPU: Bandwidth-Efficient Compressed Radix Sort on CPU

Compressed histograms and fully parallel key updates remove bucketing and pre-processing for 512 MB to 32 GB datasets.

Figure from the paper full image
abstract click to expand
Cloud database systems, particularly their middleware and query execution layers, use sorting as a core operation in query processing, indexing and join execution. Distribution-dependence and limited parallelism are key issues inherent in state-of-the-art radix sort which is preferred for large datasets due to performance advantages over comparison-based algorithms. Multi-pass bucketing, stochastic sampling and dependence graph structures are common solutions to these problems that incur the cost of data pre-processing and increased memory footprint hence they are less appropriate for large-scale workloads common in cloud environments. In-place radix sort schemes increase the number of passes as precision increases, which negatively impacts latency. Our work solves these problems by introducing a CPU-adapted histogram compression scheme for radix sorting for arbitrary-precision keys implemented on the CPU for increased accessibility, providing state-of-the-art execution time, while limiting histogram growth. Fully parallel key-based histogram updates eliminate the need for input bucketing and data pre-processing further lowering latency, mitigating distribution-dependence and reducing complexity. With a parallelized sorting architecture utilizing SIMD-accelerated operations for low latency, the algorithm demonstrates improvement over the state-of-the-art on the CPU, GPU, and FPGA by 6x, 3x and 2.5x in bandwidth efficiency on 512MB to 32GB data sets at 16-bit precision.
0
0
cs.DC 2026-05-12 Recognition

CPU radix sort reaches 6x bandwidth efficiency on large datasets

FractalSortCPU: Bandwidth-Efficient Compressed Radix Sort on CPU

Histogram compression with parallel updates skips bucketing and preprocessing, lowering latency for database sorting tasks.

Figure from the paper full image
abstract click to expand
Cloud database systems, particularly their middleware and query execution layers, use sorting as a core operation in query processing, indexing and join execution. Distribution-dependence and limited parallelism are key issues inherent in state-of-the-art radix sort which is preferred for large datasets due to performance advantages over comparison-based algorithms. Multi-pass bucketing, stochastic sampling and dependence graph structures are common solutions to these problems that incur the cost of data pre-processing and increased memory footprint hence they are less appropriate for large-scale workloads common in cloud environments. In-place radix sort schemes increase the number of passes as precision increases, which negatively impacts latency. Our work solves these problems by introducing a CPU-adapted histogram compression scheme for radix sorting for arbitrary-precision keys implemented on the CPU for increased accessibility, providing state-of-the-art execution time, while limiting histogram growth. Fully parallel key-based histogram updates eliminate the need for input bucketing and data pre-processing further lowering latency, mitigating distribution-dependence and reducing complexity. With a parallelized sorting architecture utilizing SIMD-accelerated operations for low latency, the algorithm demonstrates improvement over the state-of-the-art on the CPU, GPU, and FPGA by 6x, 3x and 2.5x in bandwidth efficiency on 512MB to 32GB data sets at 16-bit precision.
0
0
cs.DC 2026-05-12 Recognition

Amortized protocol makes async BRB messages linear in size

Amortized Asynchronous Byzantine Reliable Broadcast with Optimal Resilience

After initial rounds build guarantees, each new broadcast needs one round and O(n |m|) total cost while keeping n/3 fault tolerance.

abstract click to expand
Byzantine Reliable Broadcast (BRB) is a fundamental primitive in distributed computing and cryptographic systems. Reducing the communication complexity of BRB protocols remains an important research direction. However, most work focuses on synchronous networks, with limited attention to the more challenging setting of network \textit{asynchrony}. Achieving sub-quadratic communication for asynchronous BRB typically requires probabilistic approaches that sacrifice optimal $f=\frac{n}{3}$ resilience. In this work, we present a multi-shot BRB algorithm for asynchronous networks that maintains optimal resilience through an underutilized technique: \textit{amortization}. Our protocol structures BRB across multiple rounds, where each round provides incremental additive guarantees. Once these initial rounds complete, each subsequent BRB instance requires only a single additional round. This amortization strategy achieves asymptotic optimal $O(n|m|)$ message complexity when messages are sufficiently large, with $\Omega(n)$ round complexity in the worst case. Under favorable conditions, an optimistic delivery path reduces the round complexity to $\Omega(1)$.
0
0
cs.DC 2026-05-12 1 theorem

Amortized BRB reaches O(n|m|) messages in async networks

Amortized Asynchronous Byzantine Reliable Broadcast with Optimal Resilience

Multi-round structure shares setup so later broadcasts need one round each while keeping resilience at f < n/3 for large messages.

abstract click to expand
Byzantine Reliable Broadcast (BRB) is a fundamental primitive in distributed computing and cryptographic systems. Reducing the communication complexity of BRB protocols remains an important research direction. However, most work focuses on synchronous networks, with limited attention to the more challenging setting of network \textit{asynchrony}. Achieving sub-quadratic communication for asynchronous BRB typically requires probabilistic approaches that sacrifice optimal $f=\frac{n}{3}$ resilience. In this work, we present a multi-shot BRB algorithm for asynchronous networks that maintains optimal resilience through an underutilized technique: \textit{amortization}. Our protocol structures BRB across multiple rounds, where each round provides incremental additive guarantees. Once these initial rounds complete, each subsequent BRB instance requires only a single additional round. This amortization strategy achieves asymptotic optimal $O(n|m|)$ message complexity when messages are sufficiently large, with $\Omega(n)$ round complexity in the worst case. Under favorable conditions, an optimistic delivery path reduces the round complexity to $\Omega(1)$.
0
0
cs.DC 2026-05-12 Recognition

Vehicle screening plus federated segmentation cuts pothole data volume

Edge-Cloud Collaborative Pothole Detection via Onboard Event Screening and Federated Temporal Segmentation

Onboard filters send only candidate events while a cloud model marks exact pothole boundaries across distributed vehicle fleets.

Figure from the paper full image
abstract click to expand
Road potholes threaten driving safety and increase infrastructure maintenance costs, while large-scale and timely pothole detection remains challenging in urban road networks. Vehicle-mounted vibration sensing offers a low-cost and scalable solution, however, continuous transmission of raw acceleration streams causes high communication overhead. Also, vibration patterns induced by potholes are often confused with those caused by manholes, speed bumps, and other local road structures. To address these challenges, this paper proposes an edge-cloud collaborative pothole detection framework based on onboard vibration event screening and federated temporal segmentation. At the vehicle side, a Gaussian Mixture Model (GMM)-based module adaptively models background vibration and screens candidate abnormal events from continuous acceleration streams. The onboard module acts as a lightweight high-recall filter and uploads only compact candidate event segments with their contextual information. At the server side, pothole detection is formulated as a point-wise temporal segmentation task. A 1D Attention U-Net is developed to distinguish potholes from vibration-similar road events by capturing multi-scale temporal features and preserving event boundary information. Furthermore, the model is trained under a federated learning framework to exploit distributed multi-vehicle data while accommodating non-IID vehicle data distributions. Experiments on multi-vehicle vibration sensing data show that the proposed framework reduces unnecessary data transmission from smooth road segments and improves fine-grained pothole detection under both centralized and federated settings.
0
0
cs.DC 2026-05-12 2 theorems

Brokerless data plane delivers consistent batches for AI training

Lakestream: A Consistent and Brokerless Data Plane for Large Foundation Model Training

Object-store approach adds atomic visibility and recovery while exceeding Kafka in speed and isolation on large workloads.

Figure from the paper full image
abstract click to expand
Modern Large Foundation Model (LFM) training has transformed the data pipeline from a static ingestion layer into a dynamic component that must co-evolve with the training process. Existing systems are ill-equipped: colocated dataloaders offer no failure isolation, while message queue-based disaggregated dataloaders operate on a record/offset abstraction that cannot express the batch-level semantics required by distributed training. We present Lakestream, a brokerless, object-store-native training data plane with three key properties. First, it introduces the Transactional Global Batch (TGB), which builds on lakehouse-style ACID storage semantics and extends them with training-specific consistency, including atomic all-rank batch visibility, a globally ordered step sequence, checkpoint-aligned lifecycle management, and end-to-end exactly-once recovery. Second, it realizes recovery and retention directly in the storage layer, by inlining producer state in the manifest and tying reclamation to distributed checkpoint state. Third, its Decentralized Adaptive Commit (DAC) algorithm sustains stable ingestion throughput as the manifest grows, without any inter-producer communication. Evaluations on large-scale multimodal pre-training and SFT workloads using 64 GPUs show that Lakestream outperforms colocated dataloader throughput while providing full failure isolation, outperforms Apache Kafka in ingestion throughput, and achieves lower consumer read latency than Kafka.
0
0
cs.DC 2026-05-12 2 theorems

Ordered agents let population protocols recognize unambiguous star-free languages

Population Protocols over Ordered Agents

The immediate-observation variant restricted to order predicates exactly matches this language class with links to logic and automata.

abstract click to expand
Population protocols are a distributed computation model in which a collection of anonymous, finite-state agents interact in randomly chosen pairs and update their states according to a fixed transition function. The computation is defined by the eventual stabilization of the population to a consensus that represents the output. In practice, it is natural to allow each agent to carry a unique identifier and compare it with that of another agent before interacting. We model this extension by having agents be totally ordered and interactions between two agents to be fireable only if their pair of identifiers falls in some condition set. For instance, $\mathsf{PP}[<]$ allows for two agents to interact only if the first one appears before the second one. We study population protocols over ordered agents $\mathsf{PP}[N]$ where $N$ is a set of predicates available to restrict transition firing. We also study $\textsf{IO-PP}[N]$, the immediate observation fragment of $\mathsf{PP}[N]$ where only one agent changes state per interaction. Our main result is that $\textsf{IO-PP}[<]$ recognizes exactly the unambiguous star-free languages, which admits many other characterizations, such as two-variable first-order logic or two-way deterministic partially-ordered automata. We also provide a logic and an automaton model that fits in $\mathsf{PP}[<]$. We further show that if the successor predicate appears in a set $N$ of $\mathsf{NSPACE}(n)$-computable predicates, then $\textsf{IO-PP}[N]=\mathsf{PP}[N]=\mathsf{NSPACE}(n)$. Finally, we investigate the problem of deciding whether a given population protocol always stabilizes to a consensus. While this problem is decidable for unordered population protocols, we show that this is undecidable already for $\mathsf{PP}[<]$ and $\textsf{IO-PP}[+1]$, but conditionally decidable for $\textsf{IO-PP}[<]$.
0
0
cs.DC 2026-05-11 3 theorems

Cascade labels 8.6M orbital sequences for anomaly detection

Multi-Tier Labeling and Physics-Informed Learning for Orbital Anomaly Detection at Scale

Transformer trained on 232M TLE records reaches 55% maneuver and 63% decay recall as high-recall triage tool

abstract click to expand
Detecting orbital anomalies, such as maneuvers, atmospheric decay, and attitude upsets, across the rapidly growing population of low-Earth-orbit (LEO) satellites is a prerequisite for collision avoidance, decay forecasting, and conjunction screening. The bottleneck is not modeling capacity but labels: there is no public ground-truth corpus of orbital anomalies, manual review does not scale to approximately 10^4 active satellites, and pure rule-based detectors trade recall for precision so aggressively that they are blind to most behavioral anomalies. We present a multi-tier labeling cascade that composes three weak supervision sources of increasing fidelity: a fast physics rule set (rule_v1), an Interacting Multiple Model Unscented Kalman Filter (IMM-UKF) bank, and a supplemental-element calibration step (supGP), to produce labels at a scale unavailable from any single source. Applied to 232M Two-Line Element (TLE) records spanning 60 years, the cascade yields 8.6M labeled sequences of length 50 (430M timesteps) over 11 features that include explicit time encoding and full mean-element state. On overlapping satellites, IMM-UKF surfaces 42.6x more anomalies than rule_v1 alone. We train a 6.5M-parameter Transformer in two stages, achieving a maneuver recall of 55.4% and decay recall of 62.8% on a held-out test set. An ablation on the time-delta feature alone yields a 107% relative improvement in decay recall. We frame the resulting model as a high-recall triage classifier whose role is to surface candidate events for downstream filtering, not to issue final attributions, and discuss the path toward a Neural-ODE-based orbital world model.
0
0
cs.DC 2026-05-11 Recognition

Cloud trace decomposition predicts performance at 2% error

Cloud Performance Decomposition for Long-Term Performance Engineering: A Case Study

Hybrid and automatic methods uncover hidden weekly and quarterly cycles in serverless data, cutting latency variability by 60% on AWS.

Figure from the paper full image
abstract click to expand
Cloud performance fluctuates due to factors such as resource contention and workload changes. These factors can be short-term, seasonal, or long-term. Their effects are often intertwined in performance traces, making performance management difficult. Prior work on cloud performance engineering used time-series decomposition to separate these factors. However, existing approaches rely on basic decomposition methods that may miss key variation patterns and fail on traces with complex or intermittent patterns, limiting their usefulness across diverse cloud deployments. To address this limitation, we propose two time-series decomposition techniques for cloud performance engineering: a hybrid/manual method and a fully automatic method. Through a case study of 11 serverless functions, we show that both approaches can successfully and consistently reveal trends and seasonal cycles, such as weekly and quarterly patterns, which are otherwise obscured. As an evaluation and application of the decomposition, we used the decomposed components to predict future performance, yielding mean absolute percentage error (MAPE) values of only 1.8\% (hybrid) and 2.1\% (automatic), significantly outperforming basic time-series methods and deep learning. We further show that decomposition insights can guide practical resource allocation. Using decomposition-informed scaling on AWS, we reduced latency variability by over 60\% and maximum latency by 10\%. Similar experiments on benchmarks on AWS confirmed that seasonal patterns and performance gains generalize beyond our case study. Notably, our findings demonstrate that even a single performance trace contains rich actionable information for guiding cloud management decisions.
0
0
cs.DC 2026-05-11 Recognition

Adaptive DNN splits cut energy by 27-36% on real edge-cloud hardware

Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum

Periodic network checks let layers move between Raspberry Pi, laptop and cloud to beat fixed partitions on latency as well

Figure from the paper full image
abstract click to expand
In recent years, the use of artificial intelligence on resource-constrained IoT devices has grown significantly. However, existing approaches to DNN partitioning and offloading across the edge-cloud continuum typically rely on static methods that ignore runtime dynamics. Furthermore, they are often evaluated in simulated environments rather than on real hardware. To address this gap, we propose a framework that dynamically splits neural network layers across the heterogeneous continuum. The framework profiles the model at startup, measures network link conditions between nodes, and periodically re-evaluates the partition to adapt to environmental changes. We created a physical testbed comprising a Raspberry Pi edge device, a laptop fog, and a high-performance desktop PC as the cloud. We evaluated the framework over three widely adopted convolutional neural networks: VGG16, AlexNet, and MobileNetV2. Our results show that the framework achieves reductions in energy and end-to-end latency of 27.09--35.82% and 6.34--22.92%, respectively, compared to a static partitioning baseline. These findings confirm the superiority of adaptive to static partitioning.
0
0
cs.DC 2026-05-11 Recognition

Air quality sensors detect cooking at 99.68 percent accuracy on-device

PoHAR: Understanding Hyperlocal Human Activities with Pollution Sensor Networks

Low-cost pollution monitors share data conflict-free, cluster affected nodes without labels, and classify hyperlocal activities locally in 2

Figure from the paper full image
abstract click to expand
Low-cost air quality sensors are becoming ubiquitous in our daily lives as public awareness of air pollution continues to grow, and people take measures to monitor and improve the air they breathe indoors. Besides the standard operation of these sensors, fluctuations in environmental parameters can be leveraged to understand human behavior and activities in indoor spaces. Unlike traditional audio-visual, Radio Frequency, and inertial sensors, air quality sensors are easily scalable to a household, are privacy-preserving, and more economical. Such distributed sensor networks must jointly make decisions to monitor indoor occupants for downstream smart home and healthcare applications. However, due to low processing power, memory, and energy, they often struggle to maintain distributed data consensus and identify activity-affected sensor groups for accurate on-device inference. In this paper, we propose PoHAR framework that implements: (i) a conflict-free replicated data primitive for data sharing, (ii) a hierarchical clustering for ESP32 to detect activity-affected sensor groups with a self-supervised distance metric, and (iii) a leader-based group inference with off-the-shelf ML classifiers, enabling the sensor network to collaboratively detect hyperlocal indoor activities. Our extensive experiments demonstrated on-device activity detection, achieving 97.41% accuracy for indoor activity and 99.68% for cooking activity, using off-the-shelf ML models with latency below 34 microseconds.
0
0
cs.DC 2026-05-11 2 theorems

ATLAS cuts GNN inference time 12-30x for billion-edge graphs

ATLAS: Efficient Out-of-Core Inference for Billion-Scale Graph Neural Networks

Broadcast-based streaming replaces gather operations to handle graphs and features too large for memory using sequential disk reads.

Figure from the paper full image
abstract click to expand
Graph Neural Network (GNN) inference on billion-scale graphs is critical for domains like fintech and recommendation systems. Full-graph inference on these large graphs can be challenging due to high communication costs in distributed settings and high I/O costs in disk-backed Out-of-Core (OOC) settings. Existing OOC systems, operating across disk and memory, primarily focus on GNN training and perform poorly for full-graph inference due to massive read amplification, irregular I/O, and memory pressure. We present ATLAS, a disk-based GNN inference framework that enables efficient full-graph, layer-wise inference on graphs whose topologies, features and intermediate embeddings exceed the available memory on single machines. ATLAS replaces gather-based execution with a broadcast-based model that enables sequential, single-pass streaming reads of features and embeddings per layer. A tiered memory-disk hierarchy with minimum-pending-message eviction, graph reordering and a GPU-accelerated pipeline sustains high throughput within $128$ GiB RAM and $2$ TiB SSD. Across out-of-core graphs with up to $4$B edges and $550$ GiB features and multiple GNN architectures, ATLAS improves end-to-end inference time by $\approx12$--$30\times$ over State-of-the-Art (SOTA) OOC baselines on a single workstation, while remaining within $\approx5\%$ when features fit in memory.
0
0
cs.DC 2026-05-11 2 theorems

Multi-metric detection catches all GPU failures in 504-GPU LLM run

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

63-node B200 cluster data shows combined signals detect issues early with low false positives while NFS saturation limits checkpoint speed

Figure from the paper full image
abstract click to expand
Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however, remains scarce. This technical report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The cluster operates within a cross-organizational environment in which five parties (SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data) share a unified monitoring pipeline. This arrangement enabled joint diagnosis of a 60-node-scale storage I/O bottleneck that did not appear at 2-4-node scale, a production-scale phenomenon no single team could isolate alone. Drawing on a months-long pre-training campaign, we perform three quantitative analyses yielding four findings. First, statistical analysis over 751 Prometheus metrics and 10 XID-identified GPU failures achieves a 10/10 detection rate (2/10 pre-XID) at ~0.84 false positives per day. No single metric is consistently dominant across failure types, motivating a multi-signal detection strategy. Second, profiling 523 checkpoint events along the GPU VRAM to NFS path attributes the "bandwidth paradox" (1.4-10.4% utilization of 200 Gbps RoCE) to saturation of the 128-slot NFS RPC layer. Third, multi-node failure response shows concentrated exclusions (top 3 of 63 nodes account for >50% of all exclusions) and an auto-retry chain success rate of 33.3% over 12 chains (73 attempts), 2.7x the 12.5% manual recovery rate; the median retry interval is 11 min (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.
0
0
cs.DC 2026-05-11 Recognition

Kernel-level splits let networked MCUs run large CNNs

Split CNN Inference on Networked Microcontrollers

Distributing weights and activations across devices cuts peak RAM per unit while keeping latency practical.

Figure from the paper full image
abstract click to expand
Running deep neural networks on microcontroller units (MCUs) is severely constrained by limited memory resources. While TinyML techniques reduce model size and computation, they often fail in practice due to excessive peak Random Access Memory (RAM) usage during inference, dominated by intermediate activations. As a result, many models remain infeasible on standalone MCUs. In this work, we present a fine-grained split inference system for networked MCUs that enables collaborative inference of Convolutional Neural Networks (CNN) models across multiple devices. Our key insight is that breaking the memory bottleneck requires splitting inference at sub-layer granularity rather than at layer boundaries. We reinterpret pre-trained models to enable kernel-wise and neuron-wise partitioning, and distribute both model parameters and intermediate activations across multiple MCUs. A lightweight, resource-aware coordinator orchestrates the inference across MCU devices with heterogeneous resources. We implement the proposed system on a real testbed and evaluate it on up to 8 MCUs using MobileNetV2, a representative CNN model. Our experimental results show that CNN models infeasible on a single MCU can be executed across networked MCUs, reducing the per-MCU peak RAM usage while maintaining the practical end-to-end inference latency. All the source code of this work can be found here: https://github.com/shashsuresh/split-inference-on-MCUs.
0
0
cs.DC 2026-05-11 2 theorems

Consistency models collapse into three entangled constraints

Light Cone Consistency: Toward a Unified Theory of Consistency in Message-Passing Systems

Every active system must relax causal closure, fork resolution, or timeliness to escape the impossibility triangle.

abstract click to expand
Every distributed system -- databases, networks, postal services, CPU caches -- is a message-passing system. Every message-passing system is a growing causal log observed by a set of observers. We present Light Cone Consistency (LCC), a framework that describes every known consistency model as a configuration of three constraints on each observer's visible sub-DAG: causal closure $C(\mathrm{deps})$, fork resolution $O(\pi)$, and timeliness $R(\delta)$, plus an orthogonal return-value function $F$. We map 85 configurations, covering all 50+ named models from Viotti and Vukolic's taxonomy, with caveats for fork-based and probabilistic models. We show that three impossibility results of distributed computing -- CAP, FLP, and AFC -- each constrain exactly one pair of parameters, and prove they are minimal and independent. Our central result is the observation that these three constraints are fully entangled: violation of any one surface cascades to the other two, because restoring any parameter requires messages -- and those messages are subject to all three constraints. The three parameters and their pairwise impossibility surfaces form a fully connected triangle. Every distributed system must exit the triangle by relaxing at least one parameter. The triangle activates only when the system is in use: $C \neq \mathrm{none}$, $O \neq \mathrm{trivial}$, or $R \neq \mathrm{absent}$ each introduces a constraint that exposes the system to the surfaces. A system that demands nothing -- or writes far slower than its propagation delay -- is trivially linearizable. We identify open problems including a conjectured fourth surface (log locality), undiscovered constraints, and the universality of the safety-liveness fork as the consequence of crossing any boundary.
0
0
cs.DC 2026-05-11 Recognition

System achieves up to 7.57x faster dynamic multimodal LLM training

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

Decoupled encoder and LLM parallelism plus adaptive balancing maintain high efficiency as data lengths and modality mixes vary in production

abstract click to expand
As the foundational component of versatile AI applications, training an multimodal large language model (MLLM) relies on multimodal datasets with dynamic modality mixture proportions and sample length distributions. However, existing MLLM systems remain inefficient under dynamic workloads, due to statically coupled decisions of resource allocation and model parallelization between encoders and the LLM backbone. This paper presents MegaScale-Omni, an industrial-grade MLLM training system tailored for dynamic workload adaption and hyper-scale deployment. MegaScale-Omni is built upon the training scheme of encoder-LLM multiplexing with three key innovations: (1) Decoupled parallelism strategies with long-short sequence parallelism for encoders to process variable-length samples, and full-fledged 5D parallelism for the LLM backbone, both organized under a communication-efficient parallelization layout. (2) Unified encoder-LLM representations for flexible, extensible colocation, and a new paradigm of encoder-LLM joint pipeline with workload resilience. (3) Workload balancing techniques via decentralized grouped reordering in data loaders and adaptive resharding from encoder to LLM ranks. MegaScale-Omni is deployed as the foundation of our in-house large-scale MLLM training tasks with thousands of GPUs. Our experimental results demonstrate $1.27\times$-$7.57\times$ throughput improvement under production-grade dynamic workloads, as compared to four state-of-the-art systems.
0
0
cs.DC 2026-05-11 Recognition

Basic Verkle trees cost more than Merkle trees

TS-Verkle: A TypeScript Native Verkle Library With On-chain Verifier

TypeScript library and Solidity verifier show unoptimized Verkle setups exceed Merkle costs despite smaller proofs.

Figure from the paper full image
abstract click to expand
Blockchain systems face significant scalability challenges due to growing data volumes and increasing transaction demands, necessitating more efficient data structures and verification mechanisms. Verkle trees, a novel data structure combining the efficiency of Merkle trees with the compactness of vector commitments, have gained attention for their potential to optimize blockchain storage and improve scalability. However, their practical implementation, especially at the smart contract level, has remained unexplored. To address these challenges, we present TS-verkle, the first known TypeScript-native implementation of Verkle trees designed for web3 backend compatibility, coupled with a corresponding on-chain verifier written in Solidity. Our work bridges this gap by providing a concrete implementation of Verkle trees and demonstrating their feasibility for on-chain verification. While previous literature suggests Verkle trees should outperform Merkle trees due to their succinct proof size, our empirical evaluation reveals that basic implementations of Verkle trees actually incur higher costs than Merkle trees without advanced optimization techniques. This finding represents a crucial insight for blockchain developers and researchers considering Verkle tree adoption. The paper discusses implementation strategies and performance characteristics while exploring implications for scaling and data availability in decentralized blockchain systems.
0
0
cs.DC 2026-05-11 Recognition

Generative model compresses Earth data by up to 10,000 times

Transforming the Use of Earth Observation Data: Exascale Training of a Generative Compression Model with Historical Priors for up to 10,000x Data Reduction

Learning from historical archives turns massive observations into task-adaptive compressed representations.

Figure from the paper full image
abstract click to expand
Earth observation is becoming one of the largest data-producing activities in science, yet current pipelines still treat compression as a storage and transmission tool rather than a new way to use data. We present a generative compression framework that learns from historical Earth observation archives and enables on-demand 100x to 10,000x data reduction across downstream tasks. Unlike general visual data, Earth observation repeatedly measures the same evolving planet, making historical-prior learning feasible for extreme compression. To realize this paradigm, we train large generative compression models at exascale on the LineShine Armv9 CPU supercomputer, with co-optimization across model design, kernels, memory hierarchy, runtime, and parallelism. Our implementation sustains 1.54 EFLOP/s and peaks at 2.16 EFLOP/s in end-to-end training. This work shows that historical-prior generative compression can turn Earth observation data into an active, task-adaptive foundation for acquisition, delivery, storage, and scientific use.
0
0
cs.DC 2026-05-11 2 theorems

Same code runs in abstract rounds and real sockets for distributed algorithms

QUANTAS 2 An Abstract, Concrete and Byzantine Simulator

QUANTAS 2 adds concrete network mode and composable fault strategies so researchers can explore then validate without rewriting anything.

Figure from the paper full image
abstract click to expand
We present QUANTAS 2: a new distributed algorithm simulator and quantitative performance analysis tool. We use the original QUANTAS as a foundation. QUANTAS 2 can perform fast abstract exploration, concrete validation, and adversarial fault injection while preserving a compact implementation model for distributed algorithm researchers. The original QUANTAS was designed as an abstract, round-based simulator, which allows researchers to separate algorithmic behavior from the artifacts of a particular operating system, network stack, or physical deployment. QUANTAS 2 extends that design in two directions. First, QUANTAS 2 supports a concrete socket-based execution mode, allowing the same algorithm implementations and JSON experiment descriptions to run across local or distributed computers. Second, QUANTAS 2 adds a reusable Byzantine-fault interface in which Byzantine behavior is encoded as composable fault strategy that substitutes correct sends, receives, and local computation. This allows researchers to simulate crash, equivocation, selfish-mining, and other adversarial behaviors without rewriting the simulated algorithm. We demonstrate the resulting platform on blockchain, consensus, distributed hash table, and reliable data link algorithms. We perform parasite-chain sweeps for proof-of-work blockchains, PBFT equivocation experiments, Raft crash experiments, and Chord/Kademlia scale experiments over both abstract and concrete modes.
0
0
cs.DC 2026-05-11 Recognition

Concurrent RL fine-tunes match single-task quality at 4.3x efficiency

MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service

LoRA sharing and decoupled async stages let 32 tasks run together while cutting end-to-end time 85 percent.

Figure from the paper full image
abstract click to expand
Reinforcement Learning from Verifiable Rewards (RLVR) has significantly improved the reasoning capabilities of large language models (LLMs), particularly in multi-turn agentic settings involving environment interaction like tool use. However, fine-tuning such models remains prohibitively expensive due to high computational requirements, limiting accessibility. We propose MARLaaS (Multi-tenant Asynchronous RL as a Service), a system for concurrent RL fine-tuning across multiple users and tasks. Our approach is based on two key ideas: (1) sharing a base model across tenants using lightweight LoRA adapters, and (2) a disaggregated asynchronous architecture that decouples rollout generation, environment interaction, and policy training into independently scheduled stages. This design enables tasks to progress through the RL pipeline at their own pace in an event-driven manner, reducing cross-task interference, idle time, and end-to-end latency. In multi-task settings (we report up to 32 concurrent tasks), MARLaaS achieves single-task state-of-the-art performance while improving accelerator utilization by up to 4.3x and reducing end-to-end training time by 85%.
0
0
cs.DC 2026-05-11 Recognition

Block-level sharding scales context parallelism to 256 GPUs

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

Flexible communication and bin-packing balance short and long sequences, lifting attention MFU by 1.13x to 2.21x.

Figure from the paper full image
abstract click to expand
Context parallelism (CP) has been widely adopted to support the growing context length in foundation model pretraining. However, existing designs fail to handle the large variation in sequence length from training datasets, resulting in suboptimal performance. These methods often over-shard short sequences, leading to compute inefficiency and excessive communication, or process long and short sequences separately without proper bin-packing, causing workload imbalance. In this paper, we propose FCP, a flexible context parallelism paradigm that shards and schedules sequences at block-level granularity. Instead of relying on rigid communication topologies such as ring, FCP enables arbitrary peer-to-peer communication, allowing flexible placement of sequence blocks across workers. By bin-packing blocks from both short and long sequences, FCP achieves both high compute efficiency and balanced workload distribution. Extensive evaluations show that FCP attains near-linear scalability on up to 256 NVIDIA GPUs, with 1.13x-2.21x improvement in the attention MFU.
0
0
cs.DC 2026-05-11 Recognition

Dooly reuses LLM op profiles across configs to cut costs 56%

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

One inference pass labels dimensions to build a reusable database, holding simulation error to 5-8% MAPE on time metrics.

Figure from the paper full image
abstract click to expand
Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach.
0
0
cs.DC 2026-05-11 Recognition

Stencil kernels run up to 342x faster on wafer-scale engine

Stencil Computations on Cerebras Wafer-Scale Engine

CStencil uses on-chip SRAM and mesh interconnect to remove memory bottlenecks that limit GPUs for scientific workloads.

Figure from the paper full image
abstract click to expand
Stencil computations are a fundamental kernel in scientific computing, critical for simulations in domains such as fluid dynamics and climate modeling. However, these computations are often memory-bound on traditional High-Performance Computing architectures like GPUs, struggling against the "Memory Wall". Simultaneously, the rise of AI-oriented hardware, such as the Cerebras Wafer-Scale Engine, offers massive core parallelism and high-bandwidth on-chip memory, though typically optimized for lower-precision workloads. This work investigates the viability of bridging this divergence by mapping stencil algorithms onto the Cerebras WSE-3. The study introduces CStencil, a novel framework designed to implement two-dimensional stencil computations on the WSE-3. To ensure a rigorous and fair performance evaluation, the research also adapts ConvStencil, a state-of-the-art GPU stencil solver, porting it from its original double-precision design to single-precision for execution on an NVIDIA A100 GPU. Experimental results show that the WSE-3's distributed SRAM and mesh interconnect effectively eliminate the off-chip memory bottlenecks common in GPU implementations. CStencil achieves speedups of up to 342x over the adapted ConvStencil version. A roofline model analysis further confirms that CStencil saturates the available compute and memory resources, demonstrating that the WSE dataflow architecture can be successfully repurposed for traditional scientific algorithms. These findings highlight the potential of the WSE-3 to deliver hardware utilization levels unattainable on conventional systems, offering a promising path toward overcoming the memory limitations of current HPC architectures.
0
0
cs.DC 2026-05-11 Recognition

175B models trained at 10% peak FLOPs with standard parallelism

A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

Tensor, pipeline and sharded data parallelism yield 93% weak scaling on 128 nodes using unmodified software.

Figure from the paper full image
abstract click to expand
Large Language Models (LLMs) continue to demonstrate superior performance with increasing scale, yet training models with billions to trillions of parameters requires staggering computational resources, e.g. a one-trillion-parameter GPT-style model requires an estimated 120 million exaflops. This challenge necessitates efficient distributed training strategies on cutting-edge High-Performance Computing (HPC) infrastructure. In this work, we explore the SuperMUC-NG Phase 2 (SMNG-P2) system at the Leibniz Supercomputing Centre (LRZ) in Garching, Germany, equipped with Intel Data Center GPU Max 1550 accelerators to extract the necessary computational power. We enable and investigate a comprehensive recipe of parallel training techniques, including tensor parallelism, pipeline parallelism, and sharded data parallelism, essential for facilitating the training of LLMs up to 175 billion-parameter scale on SMNG-P2. Through empirical assessment and extensive hyperparameter tuning, we analyze the complex interplay among these techniques and determine their impact on GPU computational efficiency. We identify an optimized combined strategy that yields high throughput and enables the efficient training of LLMs of varying sizes. Specifically, for the 175B model, we achieved per-tile throughput of 10% of theoretical peak per-tile bf16 FLOPs, employing an out-of-the-box publicly available software stack, utilizing standard distributions without further modification. This approach ensures broad accessibility, as our methodology can be replicated by any user on SMNG-P2 system without need for porting or specialized software engineering. Furthermore, we achieved 93% weak scaling efficiency and strong scaling efficiency of 82% on 128 nodes. This scalable recipe provides a crucial blueprint for efficiently utilizing advanced exascale systems for next-generation foundational model development.
0
0
cs.DC 2026-05-11 Recognition

Wormhole stencil kernels match CPU speed but lose to transfers

Stencil Computations on Tenstorrent Wormhole

Axpy version uses less energy on large inputs while profiling isolates PCIe and setup costs as the main gap.

Figure from the paper full image
abstract click to expand
As investment in AI-focused accelerators grows and their deployment in supercomputing facilities expands, understanding whether these architectures can efficiently support traditional scientific kernels is critical for the future of High-Performance Computing. We investigate the mapping of 2D 5-point stencil computations onto the Tenstorrent Wormhole, a RISC-V AI dataflow accelerator. We develop two heterogeneous implementations: Axpy, which decomposes the stencil into element-wise submatrix operations, and MatMul, which reformulates it as a matrix multiplication. While the CPU baseline remains 3x faster end-to-end, profiling reveals that the isolated Wormhole kernel is competitive with CPU execution, with the gap driven by PCIe transfers, device initialization, and host-side preprocessing. Despite slower runtime, Axpy achieves lower energy consumption than the CPU baseline for large inputs. Through detailed profiling and theoretical analysis, we identify key architectural and software limitations of the current platform and outline concrete hardware and software directions that could make AI accelerators competitive for HPC workloads.
0
0
cs.DC 2026-05-11 Recognition

HexiSeq trains long-context LLMs 1.36x faster on mixed GPU clusters

HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

Asymmetric partitioning of sequences and heads to match device capabilities delivers gains on both real and simulated heterogeneous setups.

abstract click to expand
Long-context training of large language models (LLMs) is commonly distributed with Context Parallelism (CP) and Head Parallelism (HP), but existing training systems largely assume homogeneous GPU meshes. This paper extends CP and HP to heterogeneous GPU clusters with mixed GPU models and non-uniform network bandwidths, a common setting in production training. We introduce HexiSeq, a system that supports fully asymmetric CP--HP partitioning by assigning sequence shards and attention heads according to device compute, memory, and communication capabilities. We formalize heterogeneous CP--HP allocation as a constrained optimization problem and develop an efficient hierarchical scheduler for finding optimal schedules. We evaluate HexiSeq against state-of-the-art CP and HP baselines on both real and simulated heterogeneous clusters. Across models from 3B to 70B parameters and context lengths up to one million tokens, HexiSeq improves throughput by $1.11\times$ on average and up to $1.19\times$ on mixed H100--A100 testbeds, and by $1.36\times$ on average and up to $1.72\times$ in simulations with 32--128 GPUs spanning up to four GPU models. On FLOP-comparable pairs against homogeneous clusters, HexiSeq reaches throughput close to the strongest homogeneous baseline, showing that heterogeneous clusters can be used efficiently for long-context LLM training.
0
0
cs.DC 2026-05-11 2 theorems

Hierarchical agents lift AI-RAN SLO fulfillment to 90%

Deadline-Driven Hierarchical Agentic Resource Sharing for AI Services and RAN Functions in AI-RAN

An LLM placement layer plus fast convex allocation meets deadlines while the critic blocks costly migrations.

Figure from the paper full image
abstract click to expand
AI-RAN consolidates AI services and Radio Access Network (RAN) functions onto a unified, GPU-accelerated infrastructure at the network edge. However, compute sharing between real-time RAN functions and highly heterogeneous AI services requires coordination of scheduling decisions at mismatched timescales, and placement adaptation may require service migration across nodes with non-negligible interruptions. This paper proposes a hierarchical agentic framework (HAF) for compute sharing in AI-RAN that combines a large language model (LLM)-based agent for slow-timescale placement of AI services and RAN functions with a closed-form, deadline-aware convex algorithm for fast-timescale GPU/CPU allocation. The LLM agent is further equipped with a predictive critic that filters out migrations when the induced service interruption outweighs the expected service-level objective (SLO) benefit. Experimental results show that HAF reaches 90.0% overall SLO fulfillment, a 20.5% improvement over the strongest baseline, and raises AI service request fulfillment from 51% to 85.3%. Further evaluations show that HAF retains its advantage under diverse load conditions, while the critic consistently improves SLO fulfillment across multiple open-source LLM agents.
0
0
cs.DC 2026-05-11 Recognition

RcLLM cuts TTFT 1.31x-9.51x for generative recommendation

RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching

Beyond-prefix block caching plus selective attention lets LLMs serve real-time personalized outputs from long prompts

Figure from the paper full image
abstract click to expand
Large Language Models (LLMs) are transforming recommendation from ranking into a generative task, but industrial deployment remains limited by the high latency of processing long, personalized prompts. Standard prefix caching provides limited benefit because reuse in recommendation workloads is often non-contiguous across user histories and item contexts. We present RcLLM, a distributed inference system for generative recommendation with Beyond-Prefix KV Caching. RcLLM decomposes prompts into reusable blocks and supports large item catalogs with a stratified distributed storage design: compact user-history caches are replicated for zero-latency retrieval, while massive item caches are sharded using similarity-aware placement. To reduce redundant quadratic attention computation, RcLLM combines an affinity-based global scheduler that improves data locality with a selective attention mechanism that corrects approximation errors. Experiments on real-world datasets show that RcLLM reduces Time-To-First-Token (TTFT) by 1.31x-9.51x compared with state-of-the-art prefix caching systems, enabling real-time serving with negligible impact on recommendation accuracy.
0
0
cs.DC 2026-05-11 2 theorems

MERBIT speeds irregular SpMV 27 percent on GPUs

MERBIT: A GPU-Based SpMV Method for Iterative Workloads

Merge-path and bit-fields balance nonzeros and memory traffic for repeated graph matrix work.

Figure from the paper full image
abstract click to expand
Sparse Matrix-Vector Multiplication (SpMV) is the cornerstone in many iterative workloads, including large-scale graph analytics and sparse iterative solvers. Accelerating SpMV on real-world graphs remains challenging due to highly irregular sparsity patterns. In this paper, we propose MERBIT, a GPU SpMV method designed for repeated SpMV on irregular, graph-like sparse matrices, with PageRank as a representative motivating workload. MERBIT combines two key ideas from existing GPU SpMV methods. At the global level, it uses merge-path partitioning to balance work over nonzeros and row boundaries. At the local level, it encodes each merge-path segment using a compact bit-field descriptor. MERBIT improves workload balance and promotes coalesced memory access for both matrix loading and output writes; moreover, three optimization strategies are incorporated to further enhance performance. Experiments on 50 large irregular datasets demonstrate that MERBIT outperforms competitive baselines, including cuSPARSE, Ginkgo, and academic approaches, achieving average speedups of 1.27 and 1.25 over cuSPARSE COO in single and double precision, respectively.
0
0
cs.DC 2026-05-11 Recognition

Future-state scheduler cuts LLM workflow makespan by 32 percent

FATE: Future-State-Aware Scheduling for Heterogeneous LLM Workflows

By scoring placements on both immediate cost and induced downstream state, FATE beats round-robin and classical DAG methods on real and test

Figure from the paper full image
abstract click to expand
Large language model (LLM) applications are increasingly executed as heterogeneous multi-stage workflows rather than isolated inference calls. In these workflow directed acyclic graphs (DAGs), scheduling decisions affect not only the currently ready stage, but also the execution state inherited by downstream stages, including model residency, parent-output locality, prefix reuse, and future device reachability. Existing serving and DAG-scheduling policies mainly optimize immediate queue state, placement cost, or reuse signals in isolation, which can fragment useful state and increase end-to-end latency. We present FATE, a future-state-aware scheduler for heterogeneous LLM workflows. FATE combines a CP-SAT-backed frontier planner, horizon-aware candidate scoring, bounded multi-device shard execution, and state-conditional cost estimation. Rather than solving a monolithic full-DAG problem, FATE repeatedly plans over the current ready frontier and scores assignments by both immediate cost and the downstream state they induce. Across real-DAG and controlled prefix-reuse benchmarks, FATE outperforms practical heuristics, classical DAG scheduling, and proxy adaptations of recent workflow-serving policies. On the real-DAG benchmark, it achieves normalized makespan and normalized P95 latency of 0.675 and 0.677, reducing them by 32.5% and 32.3% over RoundRobin and by 8.9% and 8.8% over the strongest non-FATE baseline. Mechanism analysis and ablations show that these gains arise from jointly preserving multiple dimensions of future execution state rather than prefix reuse alone. These results indicate that future-state preservation should be treated as a first-class scheduling objective for heterogeneous LLM workflow serving.
0
0
cs.DC 2026-05-08 Recognition

Hardware usage metrics match Kripke kernel to RAJA proxy

On Similarity of Computational Kernels in our Codes and Proxies

New scores based on resource patterns automate checks on whether benchmarks represent real HPC codes on CPU and GPU systems.

Figure from the paper full image
abstract click to expand
As high-performance computing (HPC) systems rapidly evolve, with increasing on-node parallelism and widespread use of accelerators, understanding how the code maps to hardware is essential for reaching optimal performance. Benchmarks are commonly used for early assessment of emerging architectures (as well as for informing the design of future hardware), but it is often unknown how well the benchmarks represent the performance characteristics of simulation codes. Existing methods for evaluating how well our benchmarks represent our HPC codes are manual, labor intensive, and challenging to scale to many benchmarks. In this paper, we propose performance similarity metrics based on how the code uses the compute hardware. We define and characterize two broad categories of kernels that exhibit similar performance characteristics. We evaluate the pairwise similarity metrics on kernels in the Kripke proxy application and the RAJA Performance Suite, using both a CPU-only system and a GPU-accelerated system. We validate that our similarity metrics correctly match a kernel in the Kripke proxy app to a kernel in the RAJA Performance Suite. Our proposed similarity metrics enable assessment of the similarity of computational kernels in our codes and the proxy applications we use to represent the codes.
0
0
cs.DC 2026-05-08 2 theorems

Per-step slack regulator raises LLM goodput 1.77x

Regulating Branch Parallelism in LLM Serving

Extra output branches admitted only when their added latency fits current batch slack, keeping SLOs above 95%.

Figure from the paper full image
abstract click to expand
Recent methods expose intra-request parallelism in LLM outputs, allowing independent branches to decode concurrently. Existing serving systems execute these branches eagerly or under fixed caps. We show that both are brittle: eager admission inflates the shared decode step, degrading co-batched requests in serial stages, while conservative fixed caps forgo the throughput that motivated exposing branches in the first place. We call the excess step latency caused by admitted branches the branch externality and show that the safe width depends on batch composition, context lengths, and accumulated slack, all of which change continuously over a workload trace. We introduce TAPER, a per-step admission controller that treats extra branches as opportunistic work, admitted only when the predicted branch externality fits within the batch's current slack budget. Per-step regulation is practical because branch-level scheduling decouples compute from memory: branches share the request's prefix KV, so expanding or contracting width requires no memory reclamation. On Qwen3-32B, TAPER improves goodput by $1.77\times$ over IRP-Off and by $1.48\times$ over IRP-Eager, while maintaining over $95\%$ SLO attainment.
0
0
cs.DC 2026-05-08

Traces reveal LLM setups 3x slower on identical hardware

CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure

A benchmark that records full execution traces exposes why some configurations underperform despite better overlap or hardware.

Figure from the paper full image
abstract click to expand
Evaluative claims about LLM infrastructure -- ``workload X is fastest on hardware Y with software Z'' -- depend on a complex configuration space spanning hardware accelerators, interconnect bandwidth, software frameworks, parallelism plans, and communication libraries. Current infrastructure evaluation benchmarks publish a small set of end-to-end numbers that do not explain why one configuration outperforms another. We present CCL-Bench, a trace-based benchmark that addresses the limitations of existing benchmarks by recording reusable evidence for every ML workload. Each contributed data point in CCL-Bench packages an execution trace, a YAML workload card, and the launch scripts. We have developed a community-extensible toolkit to compute fine-grained compute, memory, and communication efficiency metrics from this evidence. Using CCL-Bench, we surface three claims that summary-statistic benchmarks cannot support: (i) higher compute-communication overlap can coincide with longer training step time and reveal inefficient parallelization choices, (ii) doubling TPU interconnect bandwidth yields a much higher end-to-end improvement in step time than doubling GPU interconnect bandwidth on small and medium workloads, and (iii) the best-tuned configuration on one training framework can run up to 3$\times$ slower than the best-tuned configuration on a peer framework on identical hardware.
0
0
cs.DC 2026-05-08

Serving GPUs accelerate agentic RL rollouts up to 3.3x

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

ROSE safely reuses spare capacity in production clusters for faster multi-turn agent training without breaking service guarantees

Figure from the paper full image
abstract click to expand
Agentic reinforcement learning (RL) has emerged as a key driver for improving the multi-step reasoning and tool-use capabilities of LLMs. However, its efficiency is bottlenecked by long-tail rollouts with multi-turn environment interactions, making static GPU provisioning a poor fit: overprovisioning wastes GPUs on stragglers, while underprovisioning increases contention and slows training. We observe that production serving clusters routinely leave substantial GPU compute and memory headroom. Based on this observation, we argue for cooperative elasticity: opportunistically repurposing underutilized serving GPUs to execute rollouts. Realizing cooperative elasticity is non-trivial because it must preserve serving Service Level Objectives (SLOs) under bursty traffic and minimize communication overhead. To address these challenges, we present ROSE, a cooperative, resource-elastic post-training system that safely harvests idle compute and memory on serving GPUs to accelerate agentic RL rollouts. ROSE consists of three components: (1) an SLO-safe co-serving executor that improves rollout throughput while preserving serving SLOs through efficient GPU memory and compute sharing; (2) a cross-cluster weight transfer engine that leverages weight shards and sparsity for fast weight synchronization across clusters; and (3) an elastic rollout scheduler that dynamically provisions cooperative capacity and routes trajectory rollouts across dedicated rollout GPUs and opportunistic serving GPUs. Experiments across multiple model sizes and cluster scales show that ROSE improves average end-to-end throughput by 1.20-3.31 x compared with state-of-the-art resource-fixed and elastic baselines.
0
0
cs.DC 2026-05-08

AD replaces finite differences in INLA for 4-8x gradient speedups

ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations

Structure-exploiting multi-GPU backward pass cuts energy 5-8x and enables reliable runs on models with 1.9 million latent variables.

Figure from the paper full image
abstract click to expand
Spatio-temporal Bayesian inference drives environmental and health sciences using latent Gaussian models. Integrated Nested Laplace Approximations (INLA) enable inference for these models at HPC scale but rely on derivative-based optimization over $d$ hyperparameters. State-of-the-art INLA implementations approximate derivatives via central finite differences (FD), requiring $2d{+}1$ evaluations. These evaluations are embarrassingly parallel, but total work and energy grow with $d$, limiting time-to-solution under fixed budgets. Reverse-mode automatic differentiation (AD) computes exact gradients independently of $d$, but its efficient application to INLA's structured-sparse kernels is an open challenge. We present ADELIA, the first AD-enabled INLA implementation with a structure-exploiting multi-GPU backward pass leveraging model sparsity. We evaluate ADELIA on ten benchmark models, including real-world air-pollution monitoring. We achieve $4.2$--$7.9\times$ per-gradient speedups and reliable convergence on production-scale models with up to 1.9M latent variables, where FD struggles. Even when scaled to 16--32 GPUs to match ADELIA's wall-clock time, FD consumes $5$--$8\times$ more energy.
0
0
cs.DC 2026-05-08

ResiHP lifts LLM training speed 1–4Γ— under real GPU failures

ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

Workload-aware detection plus dynamic group resizing keeps hybrid-parallel training efficient when devices slow or drop out.

Figure from the paper full image
abstract click to expand
Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures. Experiments show that ResiHP improves training throughput by 1.04-4.39$\times$ compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.
0
0
cs.DC 2026-05-08 Recognition

ResiHP keeps LLM training fast by adapting to GPU failures

ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

Workload predictor separates real slowdowns from data variation, then scheduler resizes parallelism groups for 1.04-4.39x higher speed on 16

Figure from the paper full image
abstract click to expand
Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures. Experiments show that ResiHP improves training throughput by 1.04-4.39$\times$ compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.
0
0
cs.DC 2026-05-08

TACO toolsuite verifies threshold automata for distributed algorithms

TACO: A Toolsuite for the Verification of Threshold Automata

Implements three decidable model checkers and two semi-decision procedures to check fault-tolerant algorithms automatically.

Figure from the paper full image
abstract click to expand
We present TACO, a toolsuite for the development and automatic verification of fault-tolerant and threshold-based distributed algorithms. Our toolsuite implements three approaches for model checking threshold automata in different decidable fragments known from the literature and two semi-decision procedures going beyond these decidable fragments. Moreover, TACO is a modular, extensible, and well-documented framework for developing algorithms and tools for threshold automata. We present important features, give an overview of the implemented algorithms, and evaluate their performance experimentally.
0
0
cs.DC 2026-05-08

Router cuts data-parallel imbalance in LLM clusters

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

Asymmetric scoring of request assignments prevents the slowest worker from setting the pace at every decode step.

Figure from the paper full image
abstract click to expand
Data-parallel (DP) load balancing has emerged as a first-order bottleneck in large-scale LLM serving. When a model is sharded across devices via tensor parallelism (TP) or expert parallelism (EP) and replicated across many DP workers, every decode step ends in a synchronization barrier whose latency is set by the most heavily loaded worker; even modest persistent imbalance across DP workers compounds, step after step, into a substantial fraction of wasted compute. The problem is hard for reasons specific to LLM decoding: assignments are sticky (migrating KV caches has a high cost), per-request loads grow over time, arrivals are non-stationary, and the router must decide within a sub-100\,ms decode budget over hundreds of waiting requests and tens of workers. We present \textbf{BalanceRoute}, a family of practical online routing algorithms that target this bottleneck. The first, \textbf{BR-0}, requires no prediction infrastructure and uses a piecewise-linear F-score that captures the sharp asymmetry between admissions that fill safe margin and those that overflow into the envelope; a two-stage decomposition keeps per-step cost compatible with millisecond-scale scheduling. The second, \textbf{BR-H}, generalizes BR-0 with a short, constant lookahead $H$ and a lightweight termination-classifier interface, extending the F-score to a horizon-discounted form. We deploy BalanceRoute on a 144-NPU cluster and evaluate against vLLM baselines on both a proprietary production trace and the public Azure-2024 trace. Across both workloads, BalanceRoute substantially reduces average DP imbalance and improves end-to-end serving throughput.
0
0
cs.DC 2026-05-08 2 theorems

BalanceRoute cuts DP imbalance in LLM serving

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

Piecewise-linear F-scores guide sticky assignments inside millisecond budgets and raise cluster throughput on production traces.

Figure from the paper full image
abstract click to expand
Data-parallel (DP) load balancing has emerged as a first-order bottleneck in large-scale LLM serving. When a model is sharded across devices via tensor parallelism (TP) or expert parallelism (EP) and replicated across many DP workers, every decode step ends in a synchronization barrier whose latency is set by the most heavily loaded worker; even modest persistent imbalance across DP workers compounds, step after step, into a substantial fraction of wasted compute. The problem is hard for reasons specific to LLM decoding: assignments are sticky (migrating KV caches has a high cost), per-request loads grow over time, arrivals are non-stationary, and the router must decide within a sub-100\,ms decode budget over hundreds of waiting requests and tens of workers. We present \textbf{BalanceRoute}, a family of practical online routing algorithms that target this bottleneck. The first, \textbf{BR-0}, requires no prediction infrastructure and uses a piecewise-linear F-score that captures the sharp asymmetry between admissions that fill safe margin and those that overflow into the envelope; a two-stage decomposition keeps per-step cost compatible with millisecond-scale scheduling. The second, \textbf{BR-H}, generalizes BR-0 with a short, constant lookahead $H$ and a lightweight termination-classifier interface, extending the F-score to a horizon-discounted form. We deploy BalanceRoute on a 144-NPU cluster and evaluate against vLLM baselines on both a proprietary production trace and the public Azure-2024 trace. Across both workloads, BalanceRoute substantially reduces average DP imbalance and improves end-to-end serving throughput.
0
0
cs.DC 2026-05-08

Automated low-complexity matrix multiplies beat hardware peaks

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

FalconGEMM selects and optimizes these algorithms to deliver 8-18 percent gains over cuBLAS on GPUs and CPUs for LLM tasks.

Figure from the paper full image
abstract click to expand
Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware.
0
0
cs.DC 2026-05-08 Recognition

FalconGEMM exceeds GEMM speeds by 7-17% via lower-complexity algorithms

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

The framework automates LCMA selection and group-parallel execution to beat cuBLAS and AlphaTensor on GPUs and CPUs.

Figure from the paper full image
abstract click to expand
Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware.
0
0
cs.DC 2026-05-08

MoE cuts relay buffers with direct expert-window access

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

On pooled high-bandwidth memory, placing tokens straight into remote expert windows trims dispatch latency and widens practical serving head

Figure from the paper full image
abstract click to expand
Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems. The design reorganizes dispatch and combine around direct placement into destination expert windows and direct reading from remote expert windows. Built on globally pooled high-bandwidth memory and symmetric-memory allocation, it removes most intermediate relay and reordering buffers while retaining only lightweight control state, including counts, offsets, and synchronization metadata. We instantiate the design as two schedules for the main phases of MoE inference: a prefill schedule with richer planning state for throughput-oriented execution, and a compact decode schedule for latency-sensitive execution. Experiments on Ascend-based MoE workloads show reduced dispatch and combine latency in both settings. At the serving level, the implementation improves time to first token (TTFT), preserves competitive time per output token (TPOT), and enlarges the feasible scheduling space under practical latency constraints. These results indicate that, on platforms with globally addressable device memory, reducing intermediate buffering and output restoration around expert execution is an effective direction for accelerating MoE inference.
0
0
cs.DC 2026-05-08 2 theorems

Direct expert-window access removes relay buffers in MoE inference

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

Pooled HBM on Ascend lets dispatch and combine skip intermediate storage, lowering latency and widening feasible schedules.

Figure from the paper full image
abstract click to expand
Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems. The design reorganizes dispatch and combine around direct placement into destination expert windows and direct reading from remote expert windows. Built on globally pooled high-bandwidth memory and symmetric-memory allocation, it removes most intermediate relay and reordering buffers while retaining only lightweight control state, including counts, offsets, and synchronization metadata. We instantiate the design as two schedules for the main phases of MoE inference: a prefill schedule with richer planning state for throughput-oriented execution, and a compact decode schedule for latency-sensitive execution. Experiments on Ascend-based MoE workloads show reduced dispatch and combine latency in both settings. At the serving level, the implementation improves time to first token (TTFT), preserves competitive time per output token (TPOT), and enlarges the feasible scheduling space under practical latency constraints. These results indicate that, on platforms with globally addressable device memory, reducing intermediate buffering and output restoration around expert execution is an effective direction for accelerating MoE inference.
0
0
cs.DC 2026-05-08

Differential privacy keeps edge ML fast and harder to steal

A Privacy-Preserving Machine Learning Framework for Edge Intelligence: An Empirical Analysis

Real tests find DP matches plaintext speeds with accuracy tradeoffs, while SMC depends on bandwidth and FHE slows by 1000 times.

abstract click to expand
As Edge Intelligence (EI) becomes increasingly prevalent in domains such as smart healthcare, manufacturing, and critical infrastructure, ensuring data privacy while maintaining system efficiency is a growing challenge. This paper presents a new privacy-preserving machine learning (PPML) framework tailored for EI applications, including a four-layer system architecture and training and inference algorithms. We focus on three leading approaches: Differential Privacy (DP), Secure Multi-party Computation (SMC), and Fully Homomorphic Encryption (FHE), and assess their impact on key performance metrics, including model accuracy, response time, and energy consumption. Results from real implementation and extensive trace-based simulations of inference tasks show that DP generally preserves throughput and latency close to plaintext baselines, while accuracy drops with model complexity (up to 35 percent on AlexNet and under 18 percent on LeNet for FordA). SMC performance is driven by communication; network bandwidth and round complexity determine end-to-end latency. For AlexNet, increasing link capacity from 250 Mbps to 500 Mbps reduces latency by about 30 percent. FHE is highly sensitive to model structure and numerical precision bit width, with tighter parameters imposing substantial compute overhead; we observe roughly a 1000 times increase in response time compared to DP. Beyond efficiency, DP shifts the privacy-utility-extractability frontier by reducing the attacker's data efficiency in black-box model stealing, whereas SMC and FHE, while protecting inputs and parameters during inference, require complementary output controls to achieve similar resistance to extraction. These findings provide critical insights into the trade-offs between privacy, performance, and resource efficiency in edge computing scenarios.
0
0
cs.DC 2026-05-08

LLM priors raise DRL task offloading success by over 17%

LLM-Enhanced Deep Reinforcement Learning for Task Offloading in Collaborative Edge Computing

Structured prompts and outcome reflection let the language model guide reinforcement learning to manage unpredictable node failures on low-p

Figure from the paper full image
abstract click to expand
Collaborative edge computing uses edge nodes in different locations to execute tasks, necessitating dynamic task offloading decisions to maintain low latency and high reliability, especially under unpredictable node failures. Although deep reinforcement learning (DRL) and large language models (LLMs) have shown promise for task offloading, DRL often suffers from high sample inefficiency and local optima, whereas LLMs struggle with real-time decision-making. To address these limitations, we propose \textbf{LeDRL}, a hybrid decision framework that couples a \emph{lightweight LLM} with self-attention-enhanced DRL for real-time task offloading. LeDRL constructs structured, context-aware prompts capturing node status, task semantics, and link dynamics to derive high-level strategy priors. These are selectively processed by a self-attention-based alignment module for context-aware policy optimization. A reflective evaluator distills semantic feedback from past trajectories to guide future prompts, enabling more informative and temporally generalizable LLM queries. Extensive experiments show that LeDRL outperforms baselines in task success rate, convergence speed, and real-time responsiveness across diverse network scales, achieving over 17\% improvement in success rate. Furthermore, we deploy LeDRL on Jetson-based edge devices using our prototype system \textit{CoEdgeSys}, demonstrating its robustness and feasibility under resource constraints. Our code is available at:https://github.com/GalleyG5/LeDRL.git.
0
0
cs.DC 2026-05-08

MLA cache recovers 83% tokens despite position shifts

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

Irminsul hashes content directly and rotates only a 64-dim key slice to deliver 63% prefill energy savings on agentic traffic.

Figure from the paper full image
abstract click to expand
Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full $d_K$-dimensional key, an architectural cost imposed by GQA, not by caching itself. Multi-Head Latent Attention, deployed at scale in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3, factors each KV row into a position-free $c_{KV}$ and a 64-dim $k_r$ correctable in closed form; this structure motivates content-addressed caching as a natural fit rather than a GQA workaround. We present Irminsul, which extends SGLang's radix cache with content-hash keying over CDC-chunked segments and a $\delta$-rotation rule for $k_r$. We evaluate three native MLA-MoE deployments - DeepSeek-V2-Lite (16B/2.4B), Kimi Moonlight-16B-A3B, and JoyAI-Flash (48B/3B) - with output-consistency on all three and recovery measured on the two endpoints; Irminsul recovers up to ~83% of prompt tokens above exact-prefix on agentic traffic while delivering 63% prefill energy savings per cache hit. We argue that content-addressed caching belongs in the serving stack as a first-class primitive, not a retrofit over prefix matching.
0
0
cs.DC 2026-05-08

Digital twin framework cuts data center power use with predictions

A Scalable Digital Twin Framework for Energy Optimization in Data Centers

IoT sensors and LSTM forecasts support real-time energy management and better PUE in tested setups.

Figure from the paper full image
abstract click to expand
This study proposes a scalable Digital Twin framework for energy optimization in data centers.The framework integrates IoT-based data acquisition, cloud computing, and machine learning techniques to enable real-time monitoring, forecasting, and intelligent energy management. A controlled small-scale data center environment was developed to monitor variables such as power consumption, temperature, and computational workload. Long Short-Term Memory (LSTM) models were employed to predict energy demand and support operational decision-making. Experimental results demonstrated improvements in energy efficiency, including reductions in power consumption and enhancements in Power Usage Effectiveness (PUE). Despite being evaluated in a constrained environment, the proposed framework demonstrates strong potential as a scalable and cost-effective solution for sustainable data center management.
0
0
cs.DC 2026-05-08

EdgeServing cuts SLO violations for multi-DNN edge serving

EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge

A stability score and early-exit choices expand the scheduler's options to lower deadline misses and P95 latency on shared GPUs.

Figure from the paper full image
abstract click to expand
As edge computing expands, serving multiple deep neural network (DNN) models on a single shared GPU has become a common yet challenging scenario, where each scheduling decision affects the tail latency of all concurrent queues. Existing schedulers rely on local heuristics and fail to capture this global impact, while GPU spatial-sharing approaches sacrifice latency predictability. In this paper, we propose EdgeServing, a deadline-aware multi-DNN serving system for edge devices. EdgeServing adopts time-division GPU sharing with early-exit inference for high inference predictability, and introduces a stability score to quantify how each candidate scheduling decision impacts the future queue status. At runtime, it cohesively selects the model, exit point, and batch size to minimize predicted system-wide SLO impact. Experimental results on multiple hardware platforms show that EdgeServing consistently outperforms representative baselines in both SLO violation ratio and P95 latency, enabled by early-exit mechanism, which expands the scheduling action space under tight latency constraints.
0
0
cs.DC 2026-05-07

Dynamic tensor parallelism raises LLM goodput up to 5.3x

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

Nitsum reconfigures parallelism degree and GPU splits at runtime to meet mixed latency and throughput targets on fixed hardware.

Figure from the paper full image
abstract click to expand
LLM serving is increasingly multi-tenant: the same deployment must handle latency-critical interactive requests and more relaxed background workloads under a fixed GPU budget. This creates a tiered-SLO setting where maximizing overall goodput (requests that satisfy both TTFT and TPOT targets) is challenging because workload mix, request lengths, and load intensity vary over time. Existing systems mainly optimize request-level controls (e.g., queuing and batching) while keeping execution configuration largely static, which limits adaptation under multi-tier contention. We present Nitsum, a distributed LLM serving system that treats tensor parallelism (TP) as a first-class runtime control surface rather than a static deployment choice. Nitsum jointly optimizes TP level, prefill/decode GPU split, and request scheduling. To make frequent TP adaptation practical, Nitsum introduces TP-aware weight reuse and fast KV migration. Experiments on real traces and targeted microbenchmarks show that Nitsum improves SLO-compliant goodput over SoTA by up to 5.3 times.
0
0
cs.DC 2026-05-07

Nine-dimension model explains root causes in five of twelve DeFi incidents

Toward a Risk Assessment Framework for Institutional DeFi: A Nine-Dimension Approach

Extending prior taxonomies with composability, comprehension debt, and temporal dynamics captures systemic events that six-dimension methods

abstract click to expand
Decentralized finance (DeFi) protocols now intermediate over USD 100 billion in value, including regulated stablecoins and tokenized assets deployed as collateral, yet no widely adopted framework operationalizes risk assessment at the rigor institutional adoption demands. Existing approaches emphasize protocol-specific parameter optimization or conceptual taxonomies without providing explainable, composability-aware, and structurally independent assessment methodologies. We propose a nine-dimension DeFi risk assessment framework extending the six-dimension taxonomy introduced by Moody's Analytics and Gauntlet with three novel dimensions: composability risk, comprehension debt, and temporal risk dynamics. We additionally introduce a transparency confidence modifier separating assessment reliability from risk severity. The framework is grounded in structural analysis of protocol dependencies conducted through an ontology-based protocol intelligence infrastructure covering more than 8,000 DeFi protocols. We retrospectively analyze 12 major DeFi-related incidents from 2024-2026 representing approximately USD 2.5 billion in direct losses. Five of the 12 incidents require at least one novel dimension for complete root-cause characterization, including the two highest-systemic-impact events in the dataset.
0
0
cs.DC 2026-05-07

Resource model lifts MoE training efficiency 2-3.5X

Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

By quantifying memory, compute and communication needs, Piper selects pipelined schedules that cut idle time on large HPC clusters.

Figure from the paper full image
abstract click to expand
Frontier models increasingly adopt Mixture-of-Experts (MoE) architectures to achieve large-model performance at reduced cost. However, training MoE models on HPC platforms is hindered by large memory footprints, frequent large-scale communication across heterogeneous networks, and severe workload imbalance. To characterize these challenges, we develop a mathematical model that quantifies memory, compute, and communication requirements for MoE configurations under various parallelization schemes, verified through micro-benchmarking, code instrumentation, and hardware profiling. Our analysis identifies performance bottlenecks: all-to-all latency at scale from expert parallelism, insufficient compute-communication overlap, low GPU utilization from imbalanced skinny GEMMs, and the absence of platform-aware hybrid parallelization strategies. To address these, we introduce Piper, a framework that leverages resource modeling to identify efficient training strategies for MoE models on target HPC platforms, applying pipeline parallelism with optimized schedules. Piper achieves 2-3.5X higher MFU than state-of-the-art frameworks such as X-MoE, and a novel all-to-all algorithm delivers 1.2-9X bandwidth over vendor implementation.
0
0
cs.DC 2026-05-07

DPU offload delivers 1.55x speedup when memory-to-comm ratio is high

Communication Offloading on SmartNIC DPUs: A Quantitative Approach

Buddy engine frees host CPU cycles in five applications but reveals 625x DRAM traffic spike without cache-access hardware.

Figure from the paper full image
abstract click to expand
SmartNIC Data Processing Units (DPUs) offer a promising solution for saving high-end CPU resources by offloading tasks to programmable cores near the network interface. In this work, we explore the feasibility of SmartNIC DPUs in supporting an asynchronous communication model called "fire-and-forget", particularly its core message routing service. We design a communication offloading engine called Buddy that decouples communication tasks from the application process. Buddy runs flexibly on SmartNIC DPUs such as the Nvidia BlueField-3 DPU and generic x86 CPUs. Our evaluation results in five applications identify the memory-to-communication ratio as a key predictor of the offloading performance. Host-dominated workloads, such as Quicksilver and Sparse Matrix Transpose, achieved up to 1.55x speedup with communication offloaded to the DPU. We further identify a 625x increase in DRAM traffic due to the absence of Direct Cache Access support on the DPU, highlighting a critical need in future SmartNIC designs.
1 0
0
cs.DC 2026-05-07

Satellite AI cuts delays 32 percent with model collaboration

Delay-Aware Large-Small Model Collaboration over LEO Satellite Networks

Small models run locally on remote sensing satellites while large models process on computing satellites using smart offloading and routing.

Figure from the paper full image
abstract click to expand
In this paper, we introduce a delay-aware largesmall model collaboration scheme for low Earth orbit (LEO) satellite networks, which can balance the computational load among satellites and the communication load across inter-satellite links. Specifically, computational resource constrained remote sensing satellites are responsible for data collection and local processing using small models, while collaborating with computing satellites that provide large model processing. To minimize the service delay, we formulate a joint optimization problem for offloading decision and routing strategy design, which is transformed into a decentralized partially observable Markov decision process. To solve the problem, we develop a multi-agent reinforcement learning (MARL)-based algorithm with offline policy training and online bisection search. The offline trained policy determines routing strategies, while online bisection search iteratively adjusts the offloading decisions. Simulation results demonstrate that the proposed scheme can reduce the service delay by up to 31.85% compared with the benchmarks.
0
0
cs.DC 2026-05-07

CCL-D pinpoints slow and hang anomalies in 4000-GPU clusters within 6 minutes

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

The probe and analyzer combination reduces diagnosis time from hours or days to minutes by monitoring cross-layer metrics in real time.

Figure from the paper full image
abstract click to expand
As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.
0
0
cs.DC 2026-05-07 3 theorems

Adaptive HBM split cuts recommender P99 latency 24-38%

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

PPO controller tracks optimal embedding-KV cache ratio at 32 ΞΌs latency, delivering 93-99% SLOs on 32-node production clusters without H2D b

Figure from the paper full image
abstract click to expand
Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the other. Existing systems optimize them in isolation, overlooking that the optimal EMB-KV allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30\% latency improvement unrealized. While online reallocation is required to close this gap, naive approaches introduce H2D refill traffic on the critical path, causing P99 SLO violations. To address this, we present HELM, which jointly manages HBM allocation and request routing at runtime through two key components: (1) Adaptive Memory Allocation, a three-layer PPO-based controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that achieves $32\,\mathrm{\mu s}$ decision latency while staying within 0.024-0.029 of the offline-optimal ratio; and (2) EMB-KV-Aware Scheduling, which routes requests by jointly considering KV residency, embedding locality, and node load to avoid routing inefficiencies under heterogeneous allocations. Evaluations on three production-scale datasets over a 32-node A100 cluster show that HELM reduces P99 latency by 24-38\% over the best static policy and achieves 93.5-99.6\% SLO satisfaction across Steady, Trend, and Burst workloads, significantly outperforming state-of-the-art baselines without sacrificing throughput.
0
0
cs.DC 2026-05-06

Coral cuts multi-LLM serving costs by up to 2.79x on mixed GPUs

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Joint optimization of allocation and strategies adapts to demand shifts for higher goodput when hardware is limited.

Figure from the paper full image
abstract click to expand
The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\times$ over the best baseline, and delivers up to 2.39$\times$ higher goodput under scarce resource availability.
0
0
cs.DC 2026-05-06

Serverless orchestration breaks in LEO continua

Orchestrating Serverless Applications in the Edge Cloud Space Continuum: What Breaks and What is Next?

Ten assumptions fail under time-varying contacts and resource limits, requiring a new architecture for edge-space-cloud execution

abstract click to expand
Serverless computing has matured into an effective execution model for edge cloud environments, enabling function level decomposition, demand driven scaling, and workflow execution across stable, well provisioned infrastructure. This success motivates extending it to the edge cloud space continuum, where Low Earth Orbit (LEO) constellations are increasingly explored as distributed compute substrates. However, existing serverless orchestration is not directly applicable in this setting, where LEO systems impose time varying contact graphs, intermittent link availability, and strict feasibility constraints on energy, memory, communication, and operational cost. This article identifies ten broken assumptions in existing serverless orchestration and organizes them into three core challenges: spatiotemporal execution over dynamic graphs, constraint aware function placement and scaling, and correctness and progress under decentralized and delayed state. It then proposes an architecture that enables robust and efficient serverless execution across the continuum, grounded in these challenges and demonstrated through a representative flood response use case.
0
0
cs.DC 2026-05-06

ClusterLess cuts edge workflow times by up to 40%

ClusterLess: Deadline-Aware Serverless Workflow Orchestration on Federated Edge Clusters

A coordination layer across federated clusters lifts deadline satisfaction from under 50% to over 90% for concurrent serverless workloads.

Figure from the paper full image
abstract click to expand
The recent convergence of edge computing, serverless execution, and Kubernetes (K8s) based container orchestration has enabled the processing of application workflows close to data sources. While effective within a single edge cluster, existing schemes do not generalize to federated multi edge environments, where multiple workflows execute concurrently under strict end to end (E2E) deadline constraints. This paper introduces ClusterLess, a deadline aware serverless workflow orchestration method for federated multi edge K8s clusters. ClusterLess manages the E2E lifecycle of workflow execution, including dependency analysis, execution mode selection, and resource aware placement. To this end, it integrates structured intra cluster orchestration with a leader selected, super master driven intercluster coordination layer, determining where and how each workflow function should be executed across the federated edge clusters. We implement ClusterLess using OpenFaaS as the serverless execution substrate and Argo for workflow management, and deploy it on a realistic testbed of six edge clusters comprising 64 heterogeneous edge nodes. Experimental results with concurrent serverless workflows, spanning 18 workload configurations across different input sizes and deadline classes, show that ClusterLess reduces workflow completion time by up to 40 %, increases deadline satisfaction from below 50 % to over 90 %, and confines deadline violations to single digit seconds compared to four baseline methods.
0
0
cs.DC 2026-05-06

Control plane unifies physical neural networks across materials

phys-MCP: A Control Plane for Heterogeneous Physical Neural Networks

It registers diverse material-based computers as standard resources while preserving their speed, reset behavior, and plasticity for edge-to

Figure from the paper full image
abstract click to expand
Physical neural networks (PNNs) embed computation directly in material dynamics, including molecular, chemical, biological, photonic, memristive, and mechanical substrates. They are attractive for edge computing, especially at the extreme edge, where computation can be placed at the interface to sensing, actuation, or the physical process itself. However, PNNs are difficult to integrate into edge-cloud software stacks because each substrate exposes distinct interfaces, timing behavior, observability limits, and lifecycle requirements. This paper argues that the missing systems component is a common control plane for heterogeneous PNNs. We present phys-MCP, a substrate-aware orchestration architecture that exposes physical neural substrates as discoverable and invocable resources for edge, fog, and cloud workflows, while preserving their possible placement at the extreme edge. phys-MCP defines a capability model, lifecycle semantics, telemetry interfaces, and digital-twin bindings that retain substrate-specific properties such as latency, resetability, plasticity, and I/O modality. We instantiate the architecture through a prototype with three representative backend classes, an HTTP-backed execution path, and an integrated Cortical Labs adapter exposing a wetware-facing API path through the same control model. The evaluation combines controlled experiments on representative backends with end-to-end validation of the Cortical Labs path. Results show descriptor-portable integration across heterogeneous backends, improved runtime-aware matching over simpler baselines, telemetry-aware recovery under representative faults, successful execution against the API-backed wetware path, and small local control-path overhead. Overall, results provide prototype-level evidence that substrate-aware control can span heterogeneous physical AI resources, twin-backed backends, and a wetware-facing API path.
0

browse all of cs.DC β†’ full archive Β· search Β· sub-categories