MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.
hub Canonical reference
Tacos: Topology-aware collective algorithm synthesizer for distributed machine learning
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 8polarities
background 8representative citing papers
SegFold achieves 1.95× geometric-mean speedup over prior SpGEMM accelerators via fine-grained dynamic scheduling and remapping in its Segment dataflow.
CUDA latency matrix measurements produce unprivileged certificates that fingerprint individual GPU dies, recover cross-generation topology, and bind to datacenter location via public network probes.
Bifrost achieves significant latency reductions in privacy-preserving transformer inference through a hybrid CPU TEE and accelerator FHE design, with Bifrost+ further optimizing via prefill/decode split.
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
A new NoC with Direct Compute Access delivers 5.3x multicast and 2.8x reduction speedups, yielding up to 3.8x performance gains and 1.17x energy savings versus baseline unicast designs in GEMM workloads.
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
Characterizes a reproducible post-GPU cache-displacement window on M4 Pro and quantifies a one-pass CPU recovery mechanism via Metal experiments and PMU data.
GF-DiT introduces elastic GPU parallelism scheduling for DiT serving via asynchronous trajectory tasks and group-free collectives, reporting up to 6.01x throughput gains over static configurations.
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.
LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
Quantitative benchmarks across recent AI accelerators reveal that optimal hardware choice varies with workload parameters and that several platforms incur substantially higher idle power than GPUs.
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
citing papers explorer
-
Enhancing Instruction Prefetching via Cache and TLB Management
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
-
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
Stencil Computations on Cerebras Wafer-Scale Engine
CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.
-
LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing
LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.
-
Proxics: an efficient programming model for far memory accelerators
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
-
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
-
AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.