hub

Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna · 2023 · arXiv 7527.2023

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

cs.AR · 2026-02-16 · unverdicted · novelty 8.0

TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.

PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer

cs.DC · 2026-06-05 · unverdicted · novelty 7.0

PCCL synthesizes near-optimal topology-aware collective algorithms for arbitrary patterns while being process group-aware and scalable to subsets of devices.

Bridge: Optimizing Collective Communication Schedules in Reconfigurable Networks with Reusable Subrings

cs.NI · 2026-05-12 · conditional · novelty 7.0

Bridge reduces All-to-All completion time by typically 3x to 10x and improves AllReduce by up to 6.6x over Ring by reusing optical subrings across multiple steps in reconfigurable networks.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

cs.DC · 2026-04-02 · unverdicted · novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge

cs.AR · 2025-09-09 · unverdicted · novelty 7.0

FlexiFlow optimizes carbon footprint for item-level intelligence on flexible electronics by modeling lifetime variation, delivering 1.62X microarchitectural and 14.5X algorithmic reductions plus a 30.9 kHz tape-out.

KernelSight-LM: A Kernel-Level LLM Inference Simulator

cs.PF · 2026-06-26 · unverdicted · novelty 6.0

KernelSight-LM simulates token-level LLM inference to predict per-kernel latencies and end-to-end metrics (TTFT, TPOT, throughput) with 12.1% and 3.8% kernel errors in cross-generation and target-measured tiers.

ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

cs.AR · 2026-05-21 · unverdicted · novelty 6.0

ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

cs.DC · 2026-05-15 · conditional · novelty 6.0

PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.

EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

EnergyLens predicts multi-GPU LLM inference energy consumption with 9-13% MAPE and identifies configurations with up to 52x energy efficiency differences.

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

cs.DC · 2026-04-13 · unverdicted · novelty 6.0

R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.

DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

cs.AR · 2026-04-06 · conditional · novelty 6.0

DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.

ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling

cs.DC · 2026-06-09 · unverdicted · novelty 5.0

ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.

Revisiting Bruck: Phase-Efficient All-to-All Communication in Reconfigurable Networks

cs.DC · 2026-05-26 · unverdicted · novelty 5.0

ReTri achieves all-to-all in ⌈log₃ n⌉ phases for ORNs by co-designing bidirectional exchanges and reconfiguration strategy, with simulations showing up to 10× improvement over static and 2.1× over prior reconfigurable Bruck.

Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM

cs.CR · 2026-05-19 · unverdicted · novelty 5.0

Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

cs.DC · 2026-05-16 · unverdicted · novelty 5.0 · 2 refs

Charon is a unified modular simulator that predicts LLM training and inference performance with under 5.35% error and identifies throughput improvements over baselines in a real deployment case.

Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML

cs.DC · 2026-04-19 · unverdicted · novelty 5.0

Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.

Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads

cs.DC · 2026-06-08 · unverdicted · novelty 4.0

A method using shared-memory occupancy shaping and elevated communication priority achieves up to 25.5% faster multi-GPU ML execution on NVIDIA and AMD GPUs.

Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training

cs.PF · 2026-05-18 · unverdicted · novelty 3.0

Discrete-event simulation finds optimal 10-100 km separation between AI clusters where hollow-core fiber provides 25% higher compute-communication overlap in geo-distributed data-parallel training.

Evaluating SYCL as a Unified Programming Model for Heterogeneous Systems

cs.DC · 2026-04-17 · unverdicted · novelty 3.0

Current SYCL implementations show inconsistencies in memory management (USM vs buffers) and kernel models (NDRange vs hierarchical) that reduce cross-platform reliability.

The EDGE Language: Extended General Einsums for Graph Algorithms

cs.DS · 2024-04-17

citing papers explorer

Showing 1 of 1 citing paper after filters.

Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge cs.AR · 2025-09-09 · unverdicted · none · ref 86
FlexiFlow optimizes carbon footprint for item-level intelligence on flexible electronics by modeling lifetime variation, delivering 1.62X microarchitectural and 14.5X algorithmic reductions plus a 30.9 kHz tape-out.

Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer