MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.
hub
IEEE Computer Society, 338–351
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
years
2026 12representative citing papers
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.
LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
Quantitative benchmarks across recent AI accelerators reveal that optimal hardware choice varies with workload parameters and that several platforms incur substantially higher idle power than GPUs.
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
citing papers explorer
-
AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.