A new heuristic compiler for multi-qubit iceberg patches reduces circuit depth by 34 percent, cuts gate counts, and improves fidelity metrics on 71 benchmarks compared with naive mapping.
hub Canonical reference
EVE: Ephemeral Vector Engines
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 16roles
background 5polarities
background 5representative citing papers
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
VIPIR introduces two new PIR protocols, ExpPack compression, and GPU optimizations for NTT and GEMM that deliver orders-of-magnitude higher throughput than prior systems.
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
Proposes Distributed Persistence Domain and Persistent CXL Switch to enable low-latency persistence operations at CXL switch level while maintaining crash consistency in disaggregated memory.
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
DISCA achieves 3.59 TOPS/W per bit energy efficiency for matrix multiplication at 500 MHz in 180 nm CMOS using a compressed Bent-Pyramid stochastic format.
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
NEM-GNN proposes a scalable DAC/ADC-less PIM architecture for GNNs with early termination and CAR execution, claiming 80-230x performance and 850-1134x energy gains over prior accelerators.
Duon eliminates TLB shootdown and cache invalidation costs during page migration in flat-address hybrid memory systems by updating mappings in-place, delivering 3.87% IPC gains over prior methods.
The paper reviews energy-aware computing literature and constructs a taxonomy organized by hardware/software aspects, measurement, optimizations, scheduling, scaling, consolidation, federated learning, and cooling.
citing papers explorer
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
-
The Energy Cost of Execution-Idle in GPU Clusters
Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.
-
AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
-
Efficient Page Migration in Hybrid Memory Systems
Duon eliminates TLB shootdown and cache invalidation costs during page migration in flat-address hybrid memory systems by updating mappings in-place, delivering 3.87% IPC gains over prior methods.