hub Canonical reference

Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects · 2014 · physics.flu-dyn · arXiv 1406.2024

Canonical reference. 92% of citing Pith papers cite this work as background.

36 Pith papers citing it

Background 92% of classified citations

open full Pith review browse 36 citing papers arXiv PDF

abstract

We propose a Hilfer advection-diffusion equation of order $0<\alpha<1$ and type $0\leq\beta\leq1$, and find the power series solution by using variational iteration method. Power series solutions are expressed in a form that is easy to implement numerically and in some particular cases, solutions are expressed in terms of Mittag-Leffler function. Absolute convergence of power series solutions is proved and the sensitivity of the solutions is discussed with respect to changes in the values of different parameters. For power law initial conditions it is shown that the Hilfer advection-diffusion PDE gives the same solutions as the Caputo and Riemann-Liouville advection-diffusion PDE. To leading order, the fractional solution compared to the non-fractional solution increases rapidly with $\alpha$ for $\alpha > 0.7$ at a given time $t$; but for $\alpha<0.7$ this factor is weakly sensitive to $\alpha$. We also show that the truncation errors, arising when using the partial sum as approximate solutions, decay exponentially fast with the number of terms $n$ used. We find that for $\alpha< 0.7$ the number of terms needed is weakly sensitive to the accuracy level and to the fractional order, $n\approx 20$; but for $\alpha>0.7$ the required number of terms increases rapidly with the accuracy level and also with the fractional order $\alpha$.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 baseline 1

citation-polarity summary

background 12 baseline 1

representative citing papers

Concepts in Practice: C++ MPI Bindings for the HPC Ecosystem. From a Standardizable Core to a Composable Interface

cs.DC · 2026-06-08 · unverdicted · novelty 7.0

This work provides a concrete layered C++ MPI binding using C++20 concepts, with a core extensible layer and adapters for GPU and portability libraries, backed by an open-source implementation.

Iceberg Beyond the Tip: Co-Compilation of a Quantum Error Detection Code and a Quantum Algorithm

quant-ph · 2025-04-29 · unverdicted · novelty 7.0

Co-optimization of flexible Iceberg error-detection gadgets with QAOA via tree search improves success probability and post-selection on Quantinuum H2-1 hardware up to 34 algorithmic qubits.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

Unfolding an Atomistic World: Atomistic Simulation of Reactor Pressure Vessel Steel Across Year-and-Meter Scales

cs.DC · 2026-04-27 · unverdicted · novelty 7.0 · 2 refs

AtomWorld enables the first direct atomistic simulation of RPV steel at year-and-meter scales, handling ten-quintillion-atom systems and simulating one service year in 1.71 days with 92-97% scaling efficiency on leadership supercomputers.

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

cs.PF · 2026-04-11 · unverdicted · novelty 7.0 · 5 refs

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

Multi-Agent Orchestration for High-Throughput Materials Screening on a Leadership-Class System

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

A planner-executor multi-agent system using gpt-oss-120b and Parsl orchestrates scalable high-throughput MOF screening on the Aurora supercomputer with low overhead.

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

cs.DC · 2026-04-08 · unverdicted · novelty 7.0 · 2 refs

NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

cs.DC · 2026-04-02 · unverdicted · novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

General circuit mapping algorithm for neutral atom quantum computers

quant-ph · 2026-06-18 · unverdicted · novelty 6.0

A graph-theoretic nonlinear integer program solved via genetic algorithm reduces qubit transfers in neutral atom quantum circuit compilation compared to prior zoned-architecture compilers.

Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs

cs.DC · 2026-06-01 · unverdicted · novelty 6.0 · 2 refs

A communication-aware multi-GPU distribution approach for tensor network contraction reports 7-173x extra speedup over slicing on 8 H100 GPUs and 42x to 67,869x on 1024 GPUs.

Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

cs.DC · 2026-05-15 · unverdicted · novelty 6.0

Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.

Co-Design Optimization for Data Center Cooling System via Digital Twin

eess.SY · 2026-05-15 · accept · novelty 6.0

A three-layer co-design optimization using digital twins and surrogate modeling for CDU partitioning and flow control in HPC cooling plants achieves 35.48% annual energy savings, nearly matching the current Frontier design while reducing assignment sensitivity by 93%.

CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling

cs.DC · 2026-02-25 · unverdicted · novelty 6.0

CCCL delivers 1.34-1.94x faster cross-node GPU collectives via CXL memory pooling than 200 Gbps InfiniBand RDMA, with 1.11x LLM training speedup and 2.75x hardware cost reduction.

EditFlow: Benchmarking and Optimizing Code Edit Recommendation Systems via Reconstruction of Developer Flows

cs.SE · 2026-02-25 · unverdicted · novelty 6.0

EditFlow reconstructs temporal developer editing flows from code changes to benchmark and optimize AI code edit recommenders so they align with natural incremental reasoning rather than static snapshots.

SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication

cs.DC · 2025-12-23 · unverdicted · novelty 6.0

SHIRO achieves geometric mean speedups of 221.5x to 8.8x over four baselines in distributed SpMM on up to 128 GPUs by exploiting sparsity patterns and two-tier network topologies.

PICO: Performance Insights for Collective Operations

cs.DC · 2025-08-22 · unverdicted · novelty 6.0

PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.

Tensor-Parallel Emulation of Quantum Circuits with Block-Cyclic Distributed Matrix Product States

cs.DC · 2025-05-09 · unverdicted · novelty 6.0

Presents a tensor-parallel distributed MPS method with block-cyclic partitioning and pivoted QR that emulates Google's RCS benchmark at bond dimension 16384 on 32 nodes, claiming three orders of magnitude better accuracy than prior methods.

Fast MoE Inference via Predictive Prefetching and Expert Replication

cs.LG · 2026-05-12 · conditional · novelty 6.0

Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.

Stencil Computations on Cerebras Wafer-Scale Engine

cs.DC · 2026-05-08 · unverdicted · novelty 6.0

CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

cs.DC · 2026-05-06 · unverdicted · novelty 6.0

HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.

Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels

cs.CE · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.

Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs

cs.DC · 2026-04-17 · unverdicted · novelty 6.0

A new distributed framework for graph transformer training auto-selects parallel strategies and optimizes sparse operations to deliver up to 6x speedup on 8 GPUs and 78% memory reduction.

Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads

cs.DC · 2026-06-10 · unverdicted · novelty 5.0

Eidola is a gem5 extension that emulates cycle-level peer-to-peer GPU writes via real-application timing profiles to simulate traffic and synchronization in multi-GPU AI systems.

citing papers explorer

Showing 11 of 11 citing papers after filters.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding cs.DC · 2026-05-12 · unverdicted · none · ref 7
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
Unfolding an Atomistic World: Atomistic Simulation of Reactor Pressure Vessel Steel Across Year-and-Meter Scales cs.DC · 2026-04-27 · unverdicted · none · ref 47 · 2 links
AtomWorld enables the first direct atomistic simulation of RPV steel at year-and-meter scales, handling ten-quintillion-atom systems and simulating one service year in 1.71 days with 92-97% scaling efficiency on leadership supercomputers.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding cs.PF · 2026-04-11 · unverdicted · none · ref 40 · 5 links
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining cs.DC · 2026-04-08 · unverdicted · none · ref 18 · 2 links
NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods cs.DC · 2026-04-02 · unverdicted · none · ref 31
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Fast MoE Inference via Predictive Prefetching and Expert Replication cs.LG · 2026-05-12 · conditional · none · ref 23
Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.
Stencil Computations on Cerebras Wafer-Scale Engine cs.DC · 2026-05-08 · unverdicted · none · ref 18
CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.
One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving cs.DC · 2026-05-06 · unverdicted · none · ref 12
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels cs.CE · 2026-04-20 · unverdicted · none · ref 46 · 2 links
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
Preserving Clusters in Error-Bounded Lossy Compression of Particle Data cs.LG · 2026-04-20 · unverdicted · none · ref 27
A clustering-aware correction algorithm using spatial partitioning and projected gradient descent preserves single-linkage clusters in lossy-compressed particle data while keeping competitive compression ratios.
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning eess.SY · 2026-04-08 · unverdicted · none · ref 30
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.

Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer