IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
In: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8verdicts
UNVERDICTED 8roles
background 4polarities
background 4representative citing papers
Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.
ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.
FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.
Exploits special structural features in tensor decompositions to lower the matrix multiplication exponent for 6x6 matrices from 2.8075 to 2.8019.
On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.
The EPAC chip integrates three RISC-V tiles connected by a CHI network-on-chip and has been successfully taped out and validated in GF22FDX technology as part of the European Processor Initiative.
Review chapter summarizing advances in parallel sparse direct solvers along communication reduction and data-sparse compression axes.
citing papers explorer
-
Enhancing Instruction Prefetching via Cache and TLB Management
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
-
Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference
Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.
-
ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration
ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.
-
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.
-
Exploiting the Structure in Tensor Decompositions for Matrix Multiplication
Exploits special structural features in tensor decompositions to lower the matrix multiplication exponent for 6x6 matrices from 2.8075 to 2.8019.
-
Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips
On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.
-
EPAC: The Last Dance
The EPAC chip integrates three RISC-V tiles connected by a CHI network-on-chip and has been successfully taped out and validated in GF22FDX technology as part of the European Processor Initiative.
-
Parallel Sparse and Data-Sparse Factorization-based Linear Solvers
Review chapter summarizing advances in parallel sparse direct solvers along communication reduction and data-sparse compression axes.