The paper delivers a reverse-engineered documentation of the Apple Neural Engine architecture, dispatch mechanisms, weight compression, and roofline performance based on measurements from M1 and M5 chips and analysis of private runtime components.
hub Canonical reference
Roofline: An insightful visual performance model for multicore architectures
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.
HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.
HyperX is the first end-to-end FPGA accelerator for Nyström-based HDC graph classification, delivering 6.85× speedup and 169× energy efficiency over CPU baselines plus 3.4% average accuracy gain on TUDataset benchmarks.
Introduces Distributed Level-Blocked MPK combining RACE cache blocking with MPI, reporting substantial speedups up to 4x on 832 cores for matrix power kernels across scientific sparse matrices.
OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.
KernelSight-LM simulates LLM inference at kernel granularity with cross-generation (12.1% per-kernel error) and target-measured (3.8% error) tiers, yielding end-to-end median errors of 15.4%/12.8%/3.0% and 14.3%/6.2%/2.7% for TTFT/TPOT/throughput across six model families.
Quantized reasoning models produce longer chains of thought, inflating token usage and negating per-token speedups from low-bit quantization across multiple benchmarks.
NektarIR is an MLIR-based domain-specific compiler that enables just-in-time compilation of finite element operators for spectral/hp element solvers on heterogeneous hardware.
Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.
vla.cpp is a unified C++ runtime that serves multiple VLA architectures with flow-matching and diffusion patterns, matching SOTA performance on LIBERO while running on low-memory embedded hardware.
FusionRCG uses liveness-aware graph orchestration, Cartesian-to-spherical fusion, and multi-tier kernels to cut intermediate data by up to 7.7x and deliver 3.09x SCF speedup on A100 GPUs.
Quantum circuit simulations on Apple M4 Pro show a reproducible 4.46x timing discontinuity at 29 qubits and access-pattern-dependent speedups (3.1-10x) that exceed peak bandwidth predictions.
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
A unified methodology achieves floating-point consistent results across DDSCAT, ADDA, and IFDDA solvers and enables fair CPU/GPU benchmarking with provided equivalence tables and software.
PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
ShuntServe reports 1.42x and 1.35x higher throughput than baselines plus 31.9 percent and 31.2 percent cost-efficiency gains over on-demand instances for Llama-3.1-70B and Qwen3-32B on heterogeneous AWS spot clusters.
Analog-aware block Jacobi schemes in flexible GMRES maintain convergence under simulated device non-idealities when block size, damping, and approximation accuracy are chosen to account for analog scaling, noise, quantization, and clipping.
OpenMP port of gPLUTO achieves comparable performance to OpenACC on NVIDIA but is 3x slower at application level and up to 10x at kernel level on AMD MI250X, driven by strided memory accesses, latency bounds, and C++ abstraction overheads.
citing papers explorer
-
Apple Neural Engine: Architecture, Programming, and Performance
The paper delivers a reverse-engineered documentation of the Apple Neural Engine architecture, dispatch mechanisms, weight compression, and roofline performance based on measurements from M1 and M5 chips and analysis of private runtime components.
-
Enabling AI ASICs for Zero Knowledge Proof
MORPH reformulates ZKP MSM and NTT kernels into GEMM operations for TPUs using a new Big-T complexity model, achieving up to 10x NTT throughput over GZKP.
-
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
-
Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics
On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.
-
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.
-
Efficient and Accurate Graph Classification with Hyperdimensional Computing on FPGA
HyperX is the first end-to-end FPGA accelerator for Nyström-based HDC graph classification, delivering 6.85× speedup and 169× energy efficiency over CPU baselines plus 3.4% average accuracy gain on TUDataset benchmarks.
-
Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels
Introduces Distributed Level-Blocked MPK combining RACE cache blocking with MPI, reporting substantial speedups up to 4x on 832 cores for matrix power kernels across scientific sparse matrices.
-
OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters
OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.
-
KernelSight-LM: A Kernel-Level LLM Inference Simulator
KernelSight-LM simulates LLM inference at kernel granularity with cross-generation (12.1% per-kernel error) and target-measured (3.8% error) tiers, yielding end-to-end median errors of 15.4%/12.8%/3.0% and 14.3%/6.2%/2.7% for TTFT/TPOT/throughput across six model families.
-
Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models
Quantized reasoning models produce longer chains of thought, inflating token usage and negating per-token speedups from low-bit quantization across multiple benchmarks.
-
NektarIR: A Domain-Specific Compiler for High-Order Finite Element Operations on Heterogeneous Hardware
NektarIR is an MLIR-based domain-specific compiler that enables just-in-time compilation of finite element operators for spectral/hp element solvers on heterogeneous hardware.
-
When More Cores Hurts: The Vector Database Scaling Paradox in HPC
Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.
-
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
vla.cpp is a unified C++ runtime that serves multiple VLA architectures with flow-matching and diffusion patterns, matching SOTA performance on LIBERO while running on low-memory embedded hardware.
-
FusionRCG: Orchestrating Recursive Computation Graphs across GPU Memory Hierarchies
FusionRCG uses liveness-aware graph orchestration, Cartesian-to-spherical fusion, and multi-tier kernels to cut intermediate data by up to 7.7x and deliver 3.09x SCF speedup on A100 GPUs.
-
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture
Quantum circuit simulations on Apple M4 Pro show a reproducible 4.46x timing discontinuity at 29 qubits and access-pattern-dependent speedups (3.1-10x) that exceed peak bandwidth predictions.
-
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
-
Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
-
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
-
Floating-point consistent cross-verification methodology for reproducible and interoperable DDA solvers with fair benchmarking
A unified methodology achieves floating-point consistent results across DDSCAT, ADDA, and IFDDA solvers and enables fair CPU/GPU benchmarking with provided equivalence tables and software.
-
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction
PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
-
ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters
ShuntServe reports 1.42x and 1.35x higher throughput than baselines plus 31.9 percent and 31.2 percent cost-efficiency gains over on-demand instances for Llama-3.1-70B and Qwen3-32B on heterogeneous AWS spot clusters.
-
Hybrid Digital-Analog Approximate Inverse Preconditioning for Krylov Methods
Analog-aware block Jacobi schemes in flexible GMRES maintain convergence under simulated device non-idealities when block size, damping, and approximation accuracy are chosen to account for analog scaling, noise, quantization, and clipping.
-
On the Limits of Performance Portability in Directive-Based GPU Programming
OpenMP port of gPLUTO achieves comparable performance to OpenACC on NVIDIA but is 3x slower at application level and up to 10x at kernel level on AMD MI250X, driven by strided memory accesses, latency bounds, and C++ abstraction overheads.
-
Instant GPU Efficiency Visibility at Fleet Scale
OFU is a hardware-counter metric that approximates application MFU to within 2 percentage points after tile correction and shows r=0.78 correlation on 608 production jobs.
-
Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM
Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads
EnergAIzer predicts module-level GPU utilization from structured kernel patterns and feeds it into a power model to estimate dynamic power with 8% error on Ampere GPUs and 7% on H100 forecasts.
-
Exploiting repeated matrix block structures for more efficient CFD on modern supercomputers
Exploiting repeated block structures converts SpMV to SpMM in CFD operators while an inline coarse-to-fine mesh strategy reduces time to statistically steady state, producing speed-ups up to over 50 percent on tested cases.
-
ZONOS2 Technical Report
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.
-
The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model
A scaling law model derived from roofline analysis and a speedup-based efficiency factor predicts training energy for BERT models across GPU parallelism configurations.