In: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp

Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K · 2024 · arXiv 7955.2024

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Enhancing Instruction Prefetching via Cache and TLB Management

cs.AR · 2026-05-12 · unverdicted · novelty 7.0

IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.

ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

cs.AR · 2026-05-21 · unverdicted · novelty 6.0

ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

cs.DC · 2026-05-07 · unverdicted · novelty 6.0

FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.

Exploiting the Structure in Tensor Decompositions for Matrix Multiplication

cs.SC · 2026-02-11 · unverdicted · novelty 5.0

Exploits special structural features in tensor decompositions to lower the matrix multiplication exponent for 6x6 matrices from 2.8075 to 2.8019.

Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

cs.DC · 2026-05-03 · unverdicted · novelty 4.0

On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.

EPAC: The Last Dance

cs.AR · 2026-04-14 · unverdicted · novelty 4.0

The EPAC chip integrates three RISC-V tiles connected by a CHI network-on-chip and has been successfully taped out and validated in GF22FDX technology as part of the European Processor Initiative.

Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

cs.MS · 2026-02-15 · unverdicted · novelty 1.0

Review chapter summarizing advances in parallel sparse direct solvers along communication reduction and data-sparse compression axes.

citing papers explorer

Showing 8 of 8 citing papers.

Enhancing Instruction Prefetching via Cache and TLB Management cs.AR · 2026-05-12 · unverdicted · none · ref 68
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference cs.LG · 2026-05-31 · unverdicted · none · ref 13
Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.
ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration cs.AR · 2026-05-21 · unverdicted · none · ref 17
ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication cs.DC · 2026-05-07 · unverdicted · none · ref 33
FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.
Exploiting the Structure in Tensor Decompositions for Matrix Multiplication cs.SC · 2026-02-11 · unverdicted · none · ref 31
Exploits special structural features in tensor decompositions to lower the matrix multiplication exponent for 6x6 matrices from 2.8075 to 2.8019.
Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips cs.DC · 2026-05-03 · unverdicted · none · ref 24
On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.
EPAC: The Last Dance cs.AR · 2026-04-14 · unverdicted · none · ref 17
The EPAC chip integrates three RISC-V tiles connected by a CHI network-on-chip and has been successfully taped out and validated in GF22FDX technology as part of the European Processor Initiative.
Parallel Sparse and Data-Sparse Factorization-based Linear Solvers cs.MS · 2026-02-15 · unverdicted · none · ref 136
Review chapter summarizing advances in parallel sparse direct solvers along communication reduction and data-sparse compression axes.

In: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer