hub

IEEE Computer Society, 338–351

William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Swati Gupta, Tushar Krishna · 2024 · arXiv 1859.2024

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Enabling AI ASICs for Zero Knowledge Proof

cs.AR · 2026-04-20 · conditional · novelty 8.0

MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.

Enhancing Instruction Prefetching via Cache and TLB Management

cs.AR · 2026-05-12 · unverdicted · novelty 7.0

IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

cs.AR · 2026-03-30 · unverdicted · novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

cs.AR · 2026-05-10 · unverdicted · novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.

Stencil Computations on Cerebras Wafer-Scale Engine

cs.DC · 2026-05-08 · unverdicted · novelty 6.0

CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.

LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing

cs.DC · 2026-04-21 · unverdicted · novelty 6.0

LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.

Proxics: an efficient programming model for far memory accelerators

cs.OS · 2026-04-20 · conditional · novelty 6.0

Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.

The xPU-athalon: Quantifying the Competition of AI Acceleration

cs.AR · 2026-04-12 · unverdicted · novelty 6.0

Quantitative benchmarks across recent AI accelerators reveal that optimal hardware choice varies with workload parameters and that several platforms incur substantially higher idle power than GPUs.

Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models

cs.AR · 2026-04-04 · unverdicted · novelty 6.0

Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.

AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems

cs.CR · 2026-04-03 · unverdicted · novelty 6.0

AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.

Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML

cs.DC · 2026-04-19 · unverdicted · novelty 5.0

Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.

citing papers explorer

Showing 1 of 1 citing paper after filters.

AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems cs.CR · 2026-04-03 · unverdicted · none · ref 21
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.

IEEE Computer Society, 338–351

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer