hub Canonical reference

Tacos: Topology-aware collective algorithm synthesizer for distributed machine learning

William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Swati Gupta, Tushar Krishna · 2024 · arXiv 1859.2024

Canonical reference. 100% of citing Pith papers cite this work as background.

26 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 26 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8

citation-polarity summary

background 8

representative citing papers

Enabling AI ASICs for Zero Knowledge Proof

cs.AR · 2026-04-20 · conditional · novelty 8.0

MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.

SegFold: Accelerating Sparse GEMM with a Fine-Grained Dynamic Dataflow

cs.AR · 2026-06-25 · unverdicted · novelty 7.0

SegFold achieves 1.95× geometric-mean speedup over prior SpGEMM accelerators via fine-grained dynamic scheduling and remapping in its Segment dataflow.

Unprivileged Topology Certificates for Cloud GPU Attestation

cs.CR · 2026-06-22 · unverdicted · novelty 7.0

CUDA latency matrix measurements produce unprivileged certificates that fingerprint individual GPU dies, recover cross-generation topology, and bind to datacenter location via public network probes.

Bifrost: Hybrid TEE-FHE Inference for Privacy-Preserving Transformer and LLM Serving

cs.CR · 2026-06-16 · unverdicted · novelty 7.0

Bifrost achieves significant latency reductions in privacy-preserving transformer inference through a hybrid CPU TEE and accelerator FHE design, with Bifrost+ further optimizing via prefill/decode split.

ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions

cs.AR · 2026-05-15 · unverdicted · novelty 7.0 · 2 refs

ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.

Enhancing Instruction Prefetching via Cache and TLB Management

cs.AR · 2026-05-12 · unverdicted · novelty 7.0

IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

cs.AR · 2026-03-30 · unverdicted · novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

cs.AR · 2026-03-27 · unverdicted · novelty 7.0

A new NoC with Direct Compute Access delivers 5.3x multicast and 2.8x reduction speedups, yielding up to 3.8x performance gains and 1.17x energy savings versus baseline unicast designs in GEMM workloads.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

Residual GPU Cache State on Apple M4 Pro

cs.AR · 2026-06-25 · unverdicted · novelty 6.0

Characterizes a reproducible post-GPU cache-displacement window on M4 Pro and quantifies a one-pass CPU recovery mechanism via Metal experiments and PMU data.

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

cs.DC · 2026-06-11 · unverdicted · novelty 6.0

GF-DiT introduces elastic GPU parallelism scheduling for DiT serving via asynchronous trajectory tasks and group-free collectives, reporting up to 6.01x throughput gains over static configurations.

WHET: Welding Homomorphic Encryption to Accelerator Architectures

cs.CR · 2026-06-10 · unverdicted · novelty 6.0

WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.

ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

cs.AR · 2026-05-21 · unverdicted · novelty 6.0

ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

cs.DC · 2026-05-15 · conditional · novelty 6.0

PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

cs.DC · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

cs.AR · 2026-05-10 · unverdicted · novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.

Stencil Computations on Cerebras Wafer-Scale Engine

cs.DC · 2026-05-08 · unverdicted · novelty 6.0

CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.

LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing

cs.DC · 2026-04-21 · unverdicted · novelty 6.0

LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.

Proxics: an efficient programming model for far memory accelerators

cs.OS · 2026-04-20 · conditional · novelty 6.0

Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.

The xPU-athalon: Quantifying the Competition of AI Acceleration

cs.AR · 2026-04-12 · unverdicted · novelty 6.0

Quantitative benchmarks across recent AI accelerators reveal that optimal hardware choice varies with workload parameters and that several platforms incur substantially higher idle power than GPUs.

Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models

cs.AR · 2026-04-04 · unverdicted · novelty 6.0

Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.

AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems

cs.CR · 2026-04-03 · unverdicted · novelty 6.0

AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.

ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling

cs.DC · 2026-06-09 · unverdicted · novelty 5.0

ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.

Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML

cs.DC · 2026-04-19 · unverdicted · novelty 5.0

Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.

citing papers explorer

Showing 26 of 26 citing papers.

Enabling AI ASICs for Zero Knowledge Proof cs.AR · 2026-04-20 · conditional · none · ref 34
MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.
SegFold: Accelerating Sparse GEMM with a Fine-Grained Dynamic Dataflow cs.AR · 2026-06-25 · unverdicted · none · ref 11
SegFold achieves 1.95× geometric-mean speedup over prior SpGEMM accelerators via fine-grained dynamic scheduling and remapping in its Segment dataflow.
Unprivileged Topology Certificates for Cloud GPU Attestation cs.CR · 2026-06-22 · unverdicted · none · ref 8
CUDA latency matrix measurements produce unprivileged certificates that fingerprint individual GPU dies, recover cross-generation topology, and bind to datacenter location via public network probes.
Bifrost: Hybrid TEE-FHE Inference for Privacy-Preserving Transformer and LLM Serving cs.CR · 2026-06-16 · unverdicted · none · ref 12
Bifrost achieves significant latency reductions in privacy-preserving transformer inference through a hybrid CPU TEE and accelerator FHE design, with Bifrost+ further optimizing via prefill/decode split.
ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions cs.AR · 2026-05-15 · unverdicted · none · ref 13 · 2 links
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
Enhancing Instruction Prefetching via Cache and TLB Management cs.AR · 2026-05-12 · unverdicted · none · ref 43
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network cs.AR · 2026-03-30 · unverdicted · none · ref 69
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators cs.AR · 2026-03-27 · unverdicted · none · ref 4
A new NoC with Direct Compute Access delivers 5.3x multicast and 2.8x reduction speedups, yielding up to 3.8x performance gains and 1.17x energy savings versus baseline unicast designs in GEMM workloads.
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators cs.AI · 2025-11-05 · unverdicted · none · ref 19
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
Residual GPU Cache State on Apple M4 Pro cs.AR · 2026-06-25 · unverdicted · none · ref 10
Characterizes a reproducible post-GPU cache-displacement window on M4 Pro and quantifies a one-pass CPU recovery mechanism via Metal experiments and PMU data.
GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving cs.DC · 2026-06-11 · unverdicted · none · ref 41
GF-DiT introduces elastic GPU parallelism scheduling for DiT serving via asynchronous trajectory tasks and group-free collectives, reporting up to 6.01x throughput gains over static configurations.
WHET: Welding Homomorphic Encryption to Accelerator Architectures cs.CR · 2026-06-10 · unverdicted · none · ref 105
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration cs.AR · 2026-05-21 · unverdicted · none · ref 2
ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM cs.DC · 2026-05-15 · conditional · none · ref 1
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces cs.DC · 2026-05-11 · unverdicted · none · ref 101 · 2 links
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving cs.AR · 2026-05-10 · unverdicted · none · ref 35
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
Stencil Computations on Cerebras Wafer-Scale Engine cs.DC · 2026-05-08 · unverdicted · none · ref 3
CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.
LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing cs.DC · 2026-04-21 · unverdicted · none · ref 50
LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.
Proxics: an efficient programming model for far memory accelerators cs.OS · 2026-04-20 · conditional · none · ref 28
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
The xPU-athalon: Quantifying the Competition of AI Acceleration cs.AR · 2026-04-12 · unverdicted · none · ref 31
Quantitative benchmarks across recent AI accelerators reveal that optimal hardware choice varies with workload parameters and that several platforms incur substantially higher idle power than GPUs.
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models cs.AR · 2026-04-04 · unverdicted · none · ref 29
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems cs.CR · 2026-04-03 · unverdicted · none · ref 21
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling cs.DC · 2026-06-09 · unverdicted · none · ref 50
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML cs.DC · 2026-04-19 · unverdicted · none · ref 31
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory cs.ET · 2026-03-24 · unverdicted · none · ref 38
PIM-CACHE reduces mandatory coarse-grained transfers in UPMEM-style PIM by dynamically staging only non-redundant data via content-aware copy that exploits workload similarity.
ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories cs.DC · 2026-06-10 · unverdicted · none · ref 7
ITME uses CXL-hybrid memories for byte-addressable remote memory expansion in LLM inference, achieving up to 35.7% throughput improvement over conventional CPU-offloading.

Tacos: Topology-aware collective algorithm synthesizer for distributed machine learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer