MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.
Canonical reference
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency , url=
Canonical reference. 75% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
Bifrost achieves significant latency reductions in privacy-preserving transformer inference through a hybrid CPU TEE and accelerator FHE design, with Bifrost+ further optimizing via prefill/decode split.
Conversation-level scheduling in ConServe observes first-turn input length and KV occupancy to route prefill once and pin decoders, cutting p95 time-to-first-effective-token by 51% and improving energy efficiency by 7.5% versus per-turn prediction baselines.
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
Ocean uses HyperLogLog estimators to skip the costly symbolic phase of GPU SpGEMM, pairs it with dynamic workflow choice and a shared-plus-global hash accumulator, and reports 1.4-2.8x speedups over prior GPU implementations.
KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.
FlexiFlow optimizes carbon footprint for item-level intelligence on flexible electronics by modeling lifetime variation, delivering 1.62X microarchitectural and 14.5X algorithmic reductions plus a 30.9 kHz tape-out.
GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.
HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.
A graph-theoretic nonlinear integer program solved via genetic algorithm reduces qubit transfers in neutral atom quantum circuit compilation compared to prior zoned-architecture compilers.
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
OptCC is a pipelined AllReduce algorithm that completes within 2-6% of fault-free NCCL performance under up to 50% bandwidth loss by approaching a new lower bound showing O(1/p) unavoidable overhead for p GPUs.
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.
Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.
AI data center electricity demand will reach 1% of global power use by 2030, with concentrated siting causing high power stress in specific regions like Oregon, Virginia, and Ireland.
DualScale reduces energy by up to 39% in prefill and 48% in decode for disaggregated LLM serving while meeting TTFT and TPOT SLOs on a 16x H100 cluster.
PRISM introduces a probabilistic performance modeling framework that quantifies guarantees on training time for large-scale distributed systems under runtime variability.
RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.
CoLLM unifies FL PEFT and inference on shared edge replicas via intra-replica model sharing and two-timescale inter-replica coordination, achieving up to 3x higher goodput than prior LLM systems.
citing papers explorer
-
Enabling AI ASICs for Zero Knowledge Proof
MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.
-
Bifrost: Hybrid TEE-FHE Inference for Privacy-Preserving Transformer and LLM Serving
Bifrost achieves significant latency reductions in privacy-preserving transformer inference through a hybrid CPU TEE and accelerator FHE design, with Bifrost+ further optimizing via prefill/decode split.
-
Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving
Conversation-level scheduling in ConServe observes first-turn input length and KV occupancy to route prefill once and pin decoders, cutting p95 time-to-first-effective-token by 51% and improving energy efficiency by 7.5% versus per-turn prediction baselines.
-
ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
-
Enhancing Instruction Prefetching via Cache and TLB Management
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
-
Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU
Ocean uses HyperLogLog estimators to skip the costly symbolic phase of GPU SpGEMM, pairs it with dynamic workflow choice and a shared-plus-global hash accumulator, and reports 1.4-2.8x speedups over prior GPU implementations.
-
Design automation and space-time reduction for surface-code logical operations using a SAT-based EDA kernel compatible with general encodings
KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.
-
Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge
FlexiFlow optimizes carbon footprint for item-level intelligence on flexible electronics by modeling lifetime variation, delivering 1.62X microarchitectural and 14.5X algorithmic reductions plus a 30.9 kHz tape-out.
-
Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.
-
HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval
HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.
-
General circuit mapping algorithm for neutral atom quantum computers
A graph-theoretic nonlinear integer program solved via genetic algorithm reduces qubit transfers in neutral atom quantum circuit compilation compared to prior zoned-architecture compilers.
-
WHET: Welding Homomorphic Encryption to Accelerator Architectures
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
-
Don't Let a Few Network Failures Slow the Entire AllReduce
OptCC is a pipelined AllReduce algorithm that completes within 2-6% of fault-free NCCL performance under up to 50% bandwidth loss by approaching a new lower bound showing O(1/p) unavoidable overhead for p GPUs.
-
Designing Datacenter Power Delivery Hierarchies for the AI Era
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
-
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
-
EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices
EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.
-
The Energy Cost of Execution-Idle in GPU Clusters
Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.
-
AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
-
PICO: Performance Insights for Collective Operations
PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44% end-to-end training time reductions in simulator replays.
-
Concentrated siting of AI data centers drives regional power-system stress under rising global compute demand
AI data center electricity demand will reach 1% of global power use by 2030, with concentrated siting causing high power stress in specific regions like Oregon, Virginia, and Ireland.
-
DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS
DualScale reduces energy by up to 39% in prefill and 48% in decode for disaggregated LLM serving while meeting TTFT and TPOT SLOs on a 16x H100 cluster.
-
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training
PRISM introduces a probabilistic performance modeling framework that quantifies guarantees on training time for large-scale distributed systems under runtime variability.
-
RAP: Runtime Adaptive Pruning for LLM Inference
RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.
-
CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters
CoLLM unifies FL PEFT and inference on shared edge replicas via intra-replica model sharing and two-timescale inter-replica coordination, achieving up to 3x higher goodput than prior LLM systems.
-
Resource Estimation for VQE on Small Molecules: Impact of Fermion Mappings and Hamiltonian Reductions
Fermion mappings combined with Z2 tapering and frozen-core approximations reduce qubit counts by up to 50%, gate counts by up to 27.5x, and Pauli strings by up to 2.75x for VQE on small molecules.
- A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks
- SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
- Compiling Code LLMs into Lightweight Executables
- ABI: A tightly integrated, unified, sparsity-aware, reconfigurable, compute near-register file/cache GPU architecture with light-weight softmax for deep learning, linear algebra, and Ising compute
- PureMagic: A Dynamic Scheduler for Lattice Surgery
- Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores