hub

Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna · 2023 · arXiv 7527.2023

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

cs.AR · 2026-02-16 · unverdicted · novelty 8.0

TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.

PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer

cs.DC · 2026-06-05 · unverdicted · novelty 7.0

PCCL synthesizes near-optimal topology-aware collective algorithms for arbitrary patterns while being process group-aware and scalable to subsets of devices.

Bridge: Optimizing Collective Communication Schedules in Reconfigurable Networks with Reusable Subrings

cs.NI · 2026-05-12 · conditional · novelty 7.0

Bridge reduces All-to-All completion time by typically 3x to 10x and improves AllReduce by up to 6.6x over Ring by reusing optical subrings across multiple steps in reconfigurable networks.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

cs.DC · 2026-04-02 · unverdicted · novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge

cs.AR · 2025-09-09 · unverdicted · novelty 7.0

FlexiFlow optimizes carbon footprint for item-level intelligence on flexible electronics by modeling lifetime variation, delivering 1.62X microarchitectural and 14.5X algorithmic reductions plus a 30.9 kHz tape-out.

ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

cs.AR · 2026-05-21 · unverdicted · novelty 6.0

ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

cs.DC · 2026-05-15 · conditional · novelty 6.0

PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.

EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

EnergyLens predicts multi-GPU LLM inference energy consumption with 9-13% MAPE and identifies configurations with up to 52x energy efficiency differences.

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

cs.DC · 2026-04-13 · unverdicted · novelty 6.0

R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.

DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

cs.AR · 2026-04-06 · conditional · novelty 6.0

DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.

ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling

cs.DC · 2026-06-09 · unverdicted · novelty 5.0

ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.

Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM

cs.CR · 2026-05-19 · unverdicted · novelty 5.0

Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

cs.DC · 2026-05-16 · unverdicted · novelty 5.0 · 2 refs

Charon is a unified modular simulator that predicts LLM training and inference performance with under 5.35% error and identifies throughput improvements over baselines in a real deployment case.

Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML

cs.DC · 2026-04-19 · unverdicted · novelty 5.0

Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.

Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads

cs.DC · 2026-06-08 · unverdicted · novelty 4.0

A method using shared-memory occupancy shaping and elevated communication priority achieves up to 25.5% faster multi-GPU ML execution on NVIDIA and AMD GPUs.

Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training

cs.PF · 2026-05-18 · unverdicted · novelty 3.0

Discrete-event simulation finds optimal 10-100 km separation between AI clusters where hollow-core fiber provides 25% higher compute-communication overlap in geo-distributed data-parallel training.

Evaluating SYCL as a Unified Programming Model for Heterogeneous Systems

cs.DC · 2026-04-17 · unverdicted · novelty 3.0

Current SYCL implementations show inconsistencies in memory management (USM vs buffers) and kernel models (NDRange vs hierarchical) that reduce cross-platform reliability.

The EDGE Language: Extended General Einsums for Graph Algorithms

cs.DS · 2024-04-17

citing papers explorer

Showing 18 of 18 citing papers.

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design cs.AR · 2026-02-16 · unverdicted · none · ref 26
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer cs.DC · 2026-06-05 · unverdicted · none · ref 61
PCCL synthesizes near-optimal topology-aware collective algorithms for arbitrary patterns while being process group-aware and scalable to subsets of devices.
Bridge: Optimizing Collective Communication Schedules in Reconfigurable Networks with Reusable Subrings cs.NI · 2026-05-12 · conditional · none · ref 34
Bridge reduces All-to-All completion time by typically 3x to 10x and improves AllReduce by up to 6.6x over Ring by reusing optical subrings across multiple steps in reconfigurable networks.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods cs.DC · 2026-04-02 · unverdicted · none · ref 110
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge cs.AR · 2025-09-09 · unverdicted · none · ref 86
FlexiFlow optimizes carbon footprint for item-level intelligence on flexible electronics by modeling lifetime variation, delivering 1.62X microarchitectural and 14.5X algorithmic reductions plus a 30.9 kHz tape-out.
ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration cs.AR · 2026-05-21 · unverdicted · none · ref 30
ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM cs.DC · 2026-05-15 · conditional · none · ref 32
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization cs.LG · 2026-05-14 · unverdicted · none · ref 15
EnergyLens predicts multi-GPU LLM inference energy consumption with 9-13% MAPE and identifies configurations with up to 52x energy efficiency differences.
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search cs.DC · 2026-04-13 · unverdicted · none · ref 41
R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators cs.AR · 2026-04-06 · conditional · none · ref 108
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.
ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling cs.DC · 2026-06-09 · unverdicted · none · ref 51
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM cs.CR · 2026-05-19 · unverdicted · none · ref 26
Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.
Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference cs.DC · 2026-05-16 · unverdicted · none · ref 17 · 2 links
Charon is a unified modular simulator that predicts LLM training and inference performance with under 5.35% error and identifies throughput improvements over baselines in a real deployment case.
Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML cs.DC · 2026-04-19 · unverdicted · none · ref 32
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads cs.DC · 2026-06-08 · unverdicted · none · ref 19
A method using shared-memory occupancy shaping and elevated communication priority achieves up to 25.5% faster multi-GPU ML execution on NVIDIA and AMD GPUs.
Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training cs.PF · 2026-05-18 · unverdicted · none · ref 3
Discrete-event simulation finds optimal 10-100 km separation between AI clusters where hollow-core fiber provides 25% higher compute-communication overlap in geo-distributed data-parallel training.
Evaluating SYCL as a Unified Programming Model for Heterogeneous Systems cs.DC · 2026-04-17 · unverdicted · none · ref 25
Current SYCL implementations show inconsistencies in memory management (USM vs buffers) and kernel models (NDRange vs hierarchical) that reduce cross-platform reliability.
The EDGE Language: Extended General Einsums for Graph Algorithms cs.DS · 2024-04-17 · unreviewed · ref 31

Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer