Canonical reference

Symp.High-Perform.Comput.Archit.(HPCA),2025,pp.409–422

· 2025 · arXiv 1900.2025

Canonical reference. 75% of citing Pith papers cite this work as background.

34 Pith papers citing it

Background 75% of classified citations

read on arXiv browse 34 citing papers

citation-role summary

background 6 dataset 1 method 1

citation-polarity summary

background 6 use dataset 1 use method 1

representative citing papers

Enabling AI ASICs for Zero Knowledge Proof

cs.AR · 2026-04-20 · conditional · novelty 8.0

MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.

Bifrost: Hybrid TEE-FHE Inference for Privacy-Preserving Transformer and LLM Serving

cs.CR · 2026-06-16 · unverdicted · novelty 7.0

Bifrost achieves significant latency reductions in privacy-preserving transformer inference through a hybrid CPU TEE and accelerator FHE design, with Bifrost+ further optimizing via prefill/decode split.

Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving

cs.DC · 2026-06-01 · unverdicted · novelty 7.0

Conversation-level scheduling in ConServe observes first-turn input length and KV occupancy to route prefill once and pin decoders, cutting p95 time-to-first-effective-token by 51% and improving energy efficiency by 7.5% versus per-turn prediction baselines.

ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions

cs.AR · 2026-05-15 · unverdicted · novelty 7.0

ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.

Enhancing Instruction Prefetching via Cache and TLB Management

cs.AR · 2026-05-12 · unverdicted · novelty 7.0

IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.

Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU

cs.DC · 2026-04-21 · unverdicted · novelty 7.0

Ocean uses HyperLogLog estimators to skip the costly symbolic phase of GPU SpGEMM, pairs it with dynamic workflow choice and a shared-plus-global hash accumulator, and reports 1.4-2.8x speedups over prior GPU implementations.

Design automation and space-time reduction for surface-code logical operations using a SAT-based EDA kernel compatible with general encodings

quant-ph · 2026-04-14 · unverdicted · novelty 7.0

KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.

Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge

cs.AR · 2025-09-09 · unverdicted · novelty 7.0

FlexiFlow optimizes carbon footprint for item-level intelligence on flexible electronics by modeling lifetime variation, delivering 1.62X microarchitectural and 14.5X algorithmic reductions plus a 30.9 kHz tape-out.

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

cs.DC · 2025-05-29 · conditional · novelty 7.0

GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.

DiLaServe: High SLO Attainment Serving for Diffusion Language Models

cs.LG · 2026-06-27 · unverdicted · novelty 6.0

DiLaServe improves SLO attainment for diffusion language models by up to 56.6 percentage points and reduces latency by up to 46% with less than 1% accuracy drop via deadline-aware scheduling and dynamic reconfiguration.

HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval

cs.LG · 2026-06-19 · unverdicted · novelty 6.0

HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.

General circuit mapping algorithm for neutral atom quantum computers

quant-ph · 2026-06-18 · unverdicted · novelty 6.0

A graph-theoretic nonlinear integer program solved via genetic algorithm reduces qubit transfers in neutral atom quantum circuit compilation compared to prior zoned-architecture compilers.

WHET: Welding Homomorphic Encryption to Accelerator Architectures

cs.CR · 2026-06-10 · unverdicted · novelty 6.0 · 5 refs

WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.

Don't Let a Few Network Failures Slow the Entire AllReduce

cs.DC · 2026-06-01 · unverdicted · novelty 6.0

OptCC is a pipelined AllReduce algorithm that completes within 2-6% of fault-free NCCL performance under up to 50% bandwidth loss by approaching a new lower bound showing O(1/p) unavoidable overhead for p GPUs.

Designing Datacenter Power Delivery Hierarchies for the AI Era

cs.DC · 2026-05-15 · unverdicted · novelty 6.0

Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.

ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

cs.LG · 2026-04-16 · unverdicted · novelty 6.0 · 2 refs

ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.

EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

cs.OS · 2026-04-10 · unverdicted · novelty 6.0

EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.

The Energy Cost of Execution-Idle in GPU Clusters

cs.DC · 2026-04-06 · unverdicted · novelty 6.0

Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.

AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems

cs.CR · 2026-04-03 · unverdicted · novelty 6.0

AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.

PICO: Performance Insights for Collective Operations

cs.DC · 2025-08-22 · unverdicted · novelty 6.0

PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.

Concentrated siting of AI data centers drives regional power-system stress under rising global compute demand

cs.CY · 2026-03-13 · unverdicted · novelty 5.0

AI data center electricity demand will reach 1% of global power use by 2030, with concentrated siting causing high power stress in specific regions like Oregon, Virginia, and Ireland.

DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

cs.DC · 2026-02-21 · unverdicted · novelty 5.0

DualScale reduces energy by up to 39% in prefill and 48% in decode for disaggregated LLM serving while meeting TTFT and TPOT SLOs on a 16x H100 cluster.

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

cs.DC · 2025-10-17 · unverdicted · novelty 5.0

PRISM introduces a probabilistic performance modeling framework that quantifies guarantees on training time for large-scale distributed systems under runtime variability.

RAP: Runtime Adaptive Pruning for LLM Inference

cs.LG · 2025-05-22 · unverdicted · novelty 5.0

RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Compiling Code LLMs into Lightweight Executables cs.SE · 2026-03-31 · unreviewed · ref 44

Symp.High-Perform.Comput.Archit.(HPCA),2025,pp.409–422

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer