IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
hub
John, and Jaydeep P
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 12representative citing papers
Sublime generalizes Count-Min and Count Sketch with dynamically elongating counters and expanding counter arrays to deliver sublinear error growth and lower memory use on skewed unbounded streams.
CLIPGen is a framework for automated generation of chiplet interconnect IP with PPA estimates to support 2.5D SiP architecture exploration.
TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.
Affinity Tailor improves per-CPU throughput by 12% on chiplet systems and 3% on non-chiplet systems over Linux CFS by using dynamic compact affinity hints derived from online demand estimates.
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.
NEM-GNN proposes a scalable DAC/ADC-less PIM architecture for GNNs with early termination and CAR execution, claiming 80-230x performance and 850-1134x energy gains over prior accelerators.
ROA brick topology supplies PVT-robust 2.31 GHz SHIL that preserves 93-97% accuracy in 324-node OIM max-cut while ROSC-SHIL loses locking.
PIM-CACHE reduces mandatory coarse-grained transfers in UPMEM-style PIM by dynamically staging only non-redundant data via content-aware copy that exploits workload similarity.
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
The paper reviews energy-aware computing literature and constructs a taxonomy organized by hardware/software aspects, measurement, optimizations, scheduling, scaling, consolidation, federated learning, and cooling.
citing papers explorer
-
Enhancing Instruction Prefetching via Cache and TLB Management
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
-
Sublime: Sublinear Error & Space for Unbounded Skewed Streams
Sublime generalizes Count-Min and Count Sketch with dynamically elongating counters and expanding counter arrays to deliver sublinear error growth and lower memory use on skewed unbounded streams.
-
CLIPGen: A Chiplet Link IP Modeling and Generation Framework for 2.5D Architecture Exploration
CLIPGen is a framework for automated generation of chiplet interconnect IP with PPA estimates to support 2.5D SiP architecture exploration.
-
TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments
TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.
-
Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale
Affinity Tailor improves per-CPU throughput by 12% on chiplet systems and 3% on non-chiplet systems over Linux CFS by using dynamic compact affinity hints derived from online demand estimates.
-
ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling
ASTRA-sim 3.0 introduces cache-line load-store simulation, a detailed GPU execution model, and InfraGraph to support high-fidelity distributed machine learning infrastructure simulations.
-
Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM
Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.
-
A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks
NEM-GNN proposes a scalable DAC/ADC-less PIM architecture for GNNs with early termination and CAR execution, claiming 80-230x performance and 850-1134x energy gains over prior accelerators.
-
ROA-Based Subharmonic Injection Locking for Oscillator-Based Ising Machines
ROA brick topology supplies PVT-robust 2.31 GHz SHIL that preserves 93-97% accuracy in 324-node OIM max-cut while ROSC-SHIL loses locking.
-
PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory
PIM-CACHE reduces mandatory coarse-grained transfers in UPMEM-style PIM by dynamically staging only non-redundant data via content-aware copy that exploits workload similarity.
-
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
-
Energy-Aware Computing in the Year 2026
The paper reviews energy-aware computing literature and constructs a taxonomy organized by hardware/software aspects, measurement, optimizations, scheduling, scaling, consolidation, federated learning, and cooling.