hub Canonical reference

Roofline: An insightful visual performance model for multicore architectures

Samuel Williams, Andrew Waterman, David Patterson · 2009 · arXiv 8765.149878

Canonical reference. 80% of citing Pith papers cite this work as background.

31 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 31 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

Apple Neural Engine: Architecture, Programming, and Performance

cs.AR · 2026-06-21 · unverdicted · novelty 8.0

The paper delivers a reverse-engineered documentation of the Apple Neural Engine architecture, dispatch mechanisms, weight compression, and roofline performance based on measurements from M1 and M5 chips and analysis of private runtime components.

Enabling AI ASICs for Zero Knowledge Proof

cs.AR · 2026-04-20 · conditional · novelty 8.0

MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

cs.DC · 2026-05-15 · unverdicted · novelty 7.0

HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.

Efficient and Accurate Graph Classification with Hyperdimensional Computing on FPGA

cs.AR · 2025-12-08 · conditional · novelty 7.0

HyperX is the first end-to-end FPGA accelerator for Nyström-based HDC graph classification, delivering 6.85× speedup and 169× energy efficiency over CPU baselines plus 3.4% average accuracy gain on TUDataset benchmarks.

Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

cs.DC · 2024-05-21 · unverdicted · novelty 7.0

Introduces Distributed Level-Blocked MPK combining RACE cache blocking with MPI, reporting substantial speedups up to 4x on 832 cores for matrix power kernels across scientific sparse matrices.

OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters

cs.DC · 2026-07-02 · unverdicted · novelty 6.0

OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.

KernelSight-LM: A Kernel-Level LLM Inference Simulator

cs.PF · 2026-06-26 · unverdicted · novelty 6.0 · 2 refs

KernelSight-LM simulates LLM inference at kernel granularity with cross-generation (12.1% per-kernel error) and target-measured (3.8% error) tiers, yielding end-to-end median errors of 15.4%/12.8%/3.0% and 14.3%/6.2%/2.7% for TTFT/TPOT/throughput across six model families.

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

cs.AI · 2026-06-24 · unverdicted · novelty 6.0 · 2 refs

Quantized reasoning models produce longer chains of thought, inflating token usage and negating per-token speedups from low-bit quantization across multiple benchmarks.

NektarIR: A Domain-Specific Compiler for High-Order Finite Element Operations on Heterogeneous Hardware

cs.MS · 2026-06-18 · unverdicted · novelty 6.0

NektarIR is an MLIR-based domain-specific compiler that enables just-in-time compilation of finite element operators for spectral/hp element solvers on heterogeneous hardware.

When More Cores Hurts: The Vector Database Scaling Paradox in HPC

cs.DC · 2026-06-08 · unverdicted · novelty 6.0

Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

cs.RO · 2026-06-06 · conditional · novelty 6.0

vla.cpp is a unified C++ runtime that serves multiple VLA architectures with flow-matching and diffusion patterns, matching SOTA performance on LIBERO while running on low-memory embedded hardware.

FusionRCG: Orchestrating Recursive Computation Graphs across GPU Memory Hierarchies

physics.comp-ph · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

FusionRCG uses liveness-aware graph orchestration, Cartesian-to-spherical fusion, and multi-tier kernels to cut intermediate data by up to 7.7x and deliver 3.09x SCF speedup on A100 GPUs.

A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture

cs.PF · 2026-05-09 · conditional · novelty 6.0 · 2 refs

Quantum circuit simulations on Apple M4 Pro show a reproducible 4.46x timing discontinuity at 29 qubits and access-pattern-dependent speedups (3.1-10x) that exceed peak bandwidth predictions.

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.

Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels

cs.CE · 2026-04-20 · unverdicted · novelty 6.0

A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.

Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models

cs.AR · 2026-04-04 · unverdicted · novelty 6.0

Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.

Floating-point consistent cross-verification methodology for reproducible and interoperable DDA solvers with fair benchmarking

physics.comp-ph · 2026-03-03 · conditional · novelty 6.0

A unified methodology achieves floating-point consistent results across DDSCAT, ADDA, and IFDDA solvers and enables fair CPU/GPU benchmarking with provided equivalence tables and software.

PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction

cs.PF · 2026-01-21 · unverdicted · novelty 6.0

PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

cs.CL · 2023-05-22 · unverdicted · novelty 6.0

Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.

ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters

cs.DC · 2026-06-17 · unverdicted · novelty 5.0

ShuntServe reports 1.42x and 1.35x higher throughput than baselines plus 31.9 percent and 31.2 percent cost-efficiency gains over on-demand instances for Llama-3.1-70B and Qwen3-32B on heterogeneous AWS spot clusters.

Hybrid Digital-Analog Approximate Inverse Preconditioning for Krylov Methods

math.NA · 2026-06-15 · unverdicted · novelty 5.0

Analog-aware block Jacobi schemes in flexible GMRES maintain convergence under simulated device non-idealities when block size, damping, and approximation accuracy are chosen to account for analog scaling, noise, quantization, and clipping.

On the Limits of Performance Portability in Directive-Based GPU Programming

cs.DC · 2026-06-10 · unverdicted · novelty 5.0

OpenMP port of gPLUTO achieves comparable performance to OpenACC on NVIDIA but is 3x slower at application level and up to 10x at kernel level on AMD MI250X, driven by strided memory accesses, latency bounds, and C++ abstraction overheads.

citing papers explorer

Showing 31 of 31 citing papers.

Apple Neural Engine: Architecture, Programming, and Performance cs.AR · 2026-06-21 · unverdicted · none · ref 10
The paper delivers a reverse-engineered documentation of the Apple Neural Engine architecture, dispatch mechanisms, weight compression, and roofline performance based on measurements from M1 and M5 chips and analysis of private runtime components.
Enabling AI ASICs for Zero Knowledge Proof cs.AR · 2026-04-20 · conditional · none · ref 41
MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation cs.DC · 2026-04-11 · unverdicted · none · ref 26
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics cs.DC · 2026-05-31 · unverdicted · none · ref 31
On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling cs.DC · 2026-05-15 · unverdicted · none · ref 44
HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.
Efficient and Accurate Graph Classification with Hyperdimensional Computing on FPGA cs.AR · 2025-12-08 · conditional · none · ref 66
HyperX is the first end-to-end FPGA accelerator for Nyström-based HDC graph classification, delivering 6.85× speedup and 169× energy efficiency over CPU baselines plus 3.4% average accuracy gain on TUDataset benchmarks.
Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels cs.DC · 2024-05-21 · unverdicted · none · ref 28
Introduces Distributed Level-Blocked MPK combining RACE cache blocking with MPI, reporting substantial speedups up to 4x on 832 cores for matrix power kernels across scientific sparse matrices.
OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters cs.DC · 2026-07-02 · unverdicted · none · ref 14
OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.
KernelSight-LM: A Kernel-Level LLM Inference Simulator cs.PF · 2026-06-26 · unverdicted · none · ref 48 · 2 links
KernelSight-LM simulates LLM inference at kernel granularity with cross-generation (12.1% per-kernel error) and target-measured (3.8% error) tiers, yielding end-to-end median errors of 15.4%/12.8%/3.0% and 14.3%/6.2%/2.7% for TTFT/TPOT/throughput across six model families.
Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models cs.AI · 2026-06-24 · unverdicted · none · ref 219 · 2 links
Quantized reasoning models produce longer chains of thought, inflating token usage and negating per-token speedups from low-bit quantization across multiple benchmarks.
NektarIR: A Domain-Specific Compiler for High-Order Finite Element Operations on Heterogeneous Hardware cs.MS · 2026-06-18 · unverdicted · none · ref 25
NektarIR is an MLIR-based domain-specific compiler that enables just-in-time compilation of finite element operators for spectral/hp element solvers on heterogeneous hardware.
When More Cores Hurts: The Vector Database Scaling Paradox in HPC cs.DC · 2026-06-08 · unverdicted · none · ref 93
Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models cs.RO · 2026-06-06 · conditional · none · ref 32
vla.cpp is a unified C++ runtime that serves multiple VLA architectures with flow-matching and diffusion patterns, matching SOTA performance on LIBERO while running on low-memory embedded hardware.
FusionRCG: Orchestrating Recursive Computation Graphs across GPU Memory Hierarchies physics.comp-ph · 2026-05-11 · unverdicted · none · ref 9 · 2 links
FusionRCG uses liveness-aware graph orchestration, Cartesian-to-spherical fusion, and multi-tier kernels to cut intermediate data by up to 7.7x and deliver 3.09x SCF speedup on A100 GPUs.
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture cs.PF · 2026-05-09 · conditional · none · ref 11 · 2 links
Quantum circuit simulations on Apple M4 Pro show a reproducible 4.46x timing discontinuity at 29 qubits and access-pattern-dependent speedups (3.1-10x) that exceed peak bandwidth predictions.
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding cs.LG · 2026-04-23 · unverdicted · none · ref 86
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels cs.CE · 2026-04-20 · unverdicted · none · ref 16
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models cs.AR · 2026-04-04 · unverdicted · none · ref 44
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
Floating-point consistent cross-verification methodology for reproducible and interoperable DDA solvers with fair benchmarking physics.comp-ph · 2026-03-03 · conditional · none · ref 63
A unified methodology achieves floating-point consistent results across DDSCAT, ADDA, and IFDDA solvers and enables fair CPU/GPU benchmarking with provided equivalence tables and software.
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction cs.PF · 2026-01-21 · unverdicted · none · ref 74
PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints cs.CL · 2023-05-22 · unverdicted · none · ref 60
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters cs.DC · 2026-06-17 · unverdicted · none · ref 24
ShuntServe reports 1.42x and 1.35x higher throughput than baselines plus 31.9 percent and 31.2 percent cost-efficiency gains over on-demand instances for Llama-3.1-70B and Qwen3-32B on heterogeneous AWS spot clusters.
Hybrid Digital-Analog Approximate Inverse Preconditioning for Krylov Methods math.NA · 2026-06-15 · unverdicted · none · ref 30
Analog-aware block Jacobi schemes in flexible GMRES maintain convergence under simulated device non-idealities when block size, damping, and approximation accuracy are chosen to account for analog scaling, noise, quantization, and clipping.
On the Limits of Performance Portability in Directive-Based GPU Programming cs.DC · 2026-06-10 · unverdicted · none · ref 49
OpenMP port of gPLUTO achieves comparable performance to OpenACC on NVIDIA but is 3x slower at application level and up to 10x at kernel level on AMD MI250X, driven by strided memory accesses, latency bounds, and C++ abstraction overheads.
Instant GPU Efficiency Visibility at Fleet Scale cs.DC · 2026-05-20 · unverdicted · none · ref 46
OFU is a hardware-counter metric that approximates application MFU to within 2 percentage points after tile correction and shows r=0.78 correlation on 608 production jobs.
Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM cs.CR · 2026-05-19 · unverdicted · none · ref 76
Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production cs.CE · 2026-05-12 · unverdicted · none · ref 24
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads cs.AR · 2026-04-22 · unverdicted · none · ref 49
EnergAIzer predicts module-level GPU utilization from structured kernel patterns and feeds it into a power model to estimate dynamic power with 8% error on Ampere GPUs and 7% on H100 forecasts.
Exploiting repeated matrix block structures for more efficient CFD on modern supercomputers physics.flu-dyn · 2025-08-08 · unverdicted · none · ref 1
Exploiting repeated block structures converts SpMV to SpMM in CFD operators while an inline coarse-to-fine mesh strategy reduces time to statistically steady state, producing speed-ups up to over 50 percent on tested cases.
ZONOS2 Technical Report cs.SD · 2026-06-23 · unverdicted · none · ref 137 · 2 links
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.
The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model cs.LG · 2026-06-22 · unverdicted · none · ref 42
A scaling law model derived from roofline analysis and a speedup-based efficiency factor predicts training energy for BERT models across GPU parallelism configurations.

Roofline: An insightful visual performance model for multicore architectures

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer