A new heuristic compiler for multi-qubit iceberg patches reduces circuit depth by 34 percent, cuts gate counts, and improves fidelity metrics on 71 benchmarks compared with naive mapping.
hub Canonical reference
EVE: Ephemeral Vector Engines
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
VIPIR introduces two new PIR protocols, ExpPack compression, and GPU optimizations for NTT and GEMM that deliver orders-of-magnitude higher throughput than prior systems.
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
Proposes Distributed Persistence Domain and Persistent CXL Switch to enable low-latency persistence operations at CXL switch level while maintaining crash consistency in disaggregated memory.
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
DISCA achieves 3.59 TOPS/W per bit energy efficiency for matrix multiplication at 500 MHz in 180 nm CMOS using a compressed Bent-Pyramid stochastic format.
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
Duon eliminates TLB shootdown and cache invalidation costs during page migration in flat-address hybrid memory systems by updating mappings in-place, delivering 3.87% IPC gains over prior methods.
citing papers explorer
-
Logical Compilation for Multi-Qubit Iceberg Patches
A new heuristic compiler for multi-qubit iceberg patches reduces circuit depth by 34 percent, cuts gate counts, and improves fidelity metrics on 71 benchmarks compared with naive mapping.
-
The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
-
VIPIR: A Versatile GPU Framework for Integrating Private Information Retrieval Protocols
VIPIR introduces two new PIR protocols, ExpPack compression, and GPU optimizations for NTT and GEMM that deliver orders-of-magnitude higher throughput than prior systems.
-
ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design
FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.
-
WHET: Welding Homomorphic Encryption to Accelerator Architectures
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
-
Distributed Persistence Domain for Persistent Memory Pooling
Proposes Distributed Persistence Domain and Persistent CXL Switch to enable low-latency persistence operations at CXL switch level while maintaining crash consistency in disaggregated memory.
-
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
-
The Energy Cost of Execution-Idle in GPU Clusters
Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.
-
AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
-
DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format
DISCA achieves 3.59 TOPS/W per bit energy efficiency for matrix multiplication at 500 MHz in 180 nm CMOS using a compressed Bent-Pyramid stochastic format.
-
ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
-
Efficient Page Migration in Hybrid Memory Systems
Duon eliminates TLB shootdown and cache invalidation costs during page migration in flat-address hybrid memory systems by updating mappings in-place, delivering 3.87% IPC gains over prior methods.
- A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks