archive

Every paper Pith has read. Search by title, abstract, or pith.

115 papers in cs.PF · page 1

quant-ph 2026-05-14 reviewed

Cache reorganization lifts GPU speedups for 28-qubit simulations on laptops
Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation

Eduarda Rodrigues Monteiro +4
cs.OS 2026-05-14 reviewed

LLM tunes Linux knobs for 72 percent stable gain over defaults
SemaTune: Semantic-Aware Online OS Tuning with Large Language Models

Georgios Liargkovas +3
cs.DC 2026-05-13 reviewed

Heterogeneous solvers up to 32% faster than GPU-only for big matrices
Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL

Alexander Strack +2
cs.LG 2026-05-12 reviewed

Block-scale search cuts quantization error 27% in BFP
Search Your Block Floating Point Scales!

Austin Silveria +12
cs.PF 2026-05-12 reviewed

Adaptive packed layouts enable efficient VLA ML code
Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation

Ege Beysel +2
cs.AR 2026-05-12 reviewed

Joint TLB-cache tweaks boost instruction prefetching 8.7%
Enhancing Instruction Prefetching via Cache and TLB Management

Alexandre Valentin Jamet +4
cs.IT 2026-05-12 reviewed

Node failures scale wireless capacity and delay with sqrt of reliable nodes
On Capacity and Delay of Wireless Networks with Node Failures

Jiandong Li +3
cs.DC 2026-05-12 reviewed

Power capping leaves LLM decode energy untouched
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

Ayesha Afzal +3
cs.DC 2026-05-11 reviewed

Chakra standardizes graph traces for AI workload benchmarking
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Andy Balogh +27
cs.LG 2026-05-11 reviewed

DMI-Lib cuts LLM internal observability overhead to 0.4-6.8 percent
Enabling Performant and Flexible Model-Internal Observability for LLM Inference

Nengneng Yu +4
cs.DC 2026-05-11 reviewed

Edge micro-agent fixes failures safely with no destructive actions
An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum

Alaa Saleh +4
cs.GR 2026-05-11 reviewed

Inverted culling speeds dynamic LiDAR ray tracing
Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation

Joonas Haapala +2
cs.CR 2026-05-11 reviewed

KEM-IES upgrades ECIES with PQC KEM and Ascon
Key Encapsulation Mechanism-Based Integrated Encryption Scheme (KEM-IES)

Abel C. H. Chen
cs.RO 2026-05-11 reviewed

Caching reuses diffusion steps for 4.6x faster robot plans
Muninn: Your Trajectory Diffusion Model But Faster

Gokul Puthumanaillam +6
cs.CR 2026-05-11 reviewed

Mamba-2 classifies network bursts directly from raw bytes
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining

Gayan K. Kulatilleke +3
cs.DC 2026-05-10 reviewed

Cloud trace decomposition predicts performance at 2% error
Cloud Performance Decomposition for Long-Term Performance Engineering: A Case Study

Donald Lien +4
cs.DC 2026-05-10 reviewed

Adaptive DNN splits cut energy by 27-36% on real edge-cloud hardware
Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum

Akuen Akoi Deng +3
cs.LG 2026-05-09 reviewed

Apple MPS shows 21x latency spikes in narrow decoding ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria
cs.LG 2026-05-09 reviewed

MPS decoding latency spikes up to 21x in narrow ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria
cs.PF 2026-05-09 reviewed

4.46× jump in quantum sim time at 29 qubits on M4 Pro
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture

Gyan Pratipat
cs.PF 2026-05-09 reviewed

GPU speedups reach 10x despite 1.85x bandwidth limit in quantum simulation
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture

Gyan Pratipat
cs.PF 2026-05-09 reviewed

Single-thread JPEG benchmarks misrank decoders for DataLoaders
Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders

Vladimir Iglovikov
cs.AR 2026-05-09 reviewed

DDR5 single sub-channel matches cache lines but loses 40-60% bandwidth
Single 32-bit Sub-Channel DDR5 DIMMs: Architecture, Performance Bounds, and Standardisation

Chih-Hua Ke
cs.LG 2026-05-08 reviewed

Cyclic tuning raises RAG quality by up to 54 percent
CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG

Pengzhou Chen +1
cs.LG 2026-05-08 reviewed

Unified runtime delivers 2.55x decode speedup for low-rank transformers
FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

Danyang Zhuo +7
cs.LG 2026-05-08 reviewed

Fluxion speeds long-context inference 1.5x-3.7x via CPU-GPU hybrid sparse attention
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Feiyu Yao +5
cs.LG 2026-05-08 reviewed

First benchmark supplies real data for LLM hyperparameter tuning
LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems

Gangda Xiong +5
cs.DC 2026-05-07 reviewed

AD replaces finite differences in INLA for 4-8x gradient speedups
ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations

Afif Boudaoud +8
cs.AR 2026-05-07 reviewed

Pipeline speeds power-of-two DNNs on edge FPGAs by up to 3.6x
PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs

David Kaeli +4
cs.AR 2026-05-07 reviewed

LLMs automate FPGA accelerator design space exploration
LLM-Driven Design Space Exploration of FPGA-based Accelerators

Jos\'e Cano +3
cs.PF 2026-05-07 reviewed

Int4 KV cache outruns fp16 on Apple Silicon
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

Mohamed Amine Bergach
cs.LG 2026-05-06 reviewed

Task category explains 3x more variance than method in LLM kernel correctness
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Han Wang +5
cs.LG 2026-05-06 reviewed

Task category predicts LLM kernel success far better than generation method
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Han Wang +5
cs.GR 2026-05-06 reviewed

Algebraic coarsening delivers 3x speedup in GPU contact solves
AGIPC: Adaptive In-Solve Algebraic Coarsening for GPU IPC

Kemeng Huang +4
cs.PF 2026-05-06 reviewed

LLM agents turn GPU profiles into optimization advice
KEET: Explaining Performance of GPU Kernels Using LLM Agents

Aadit Nilay +7
cs.GT 2026-05-05 reviewed

Light storage limits turn content-provider competition into a potential game
Decentralized Edge Caching under Budget and Storage Constraints: A Game-Theoretic Approach

Danilo Ardagna +3
cs.AR 2026-05-05 reviewed

4-5 workloads preserve 96-99% of SPEC CPU2026 behavior
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

Andrew Jacob +3
cs.AR 2026-05-05 reviewed

SPEC CPU2026 increases instruction volume and cache pressure
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

Andrew Jacob +3
cs.DC 2026-05-05 reviewed

GPU speeds exascale trace analysis by 314 times
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics

Dragana Grbic (Department of Computer Science +1
cs.DC 2026-05-05 reviewed

GPU layer speeds exascale trace analysis by up to 314x
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics

Dragana Grbic (Department of Computer Science +1
cs.PF 2026-05-04 reviewed

Same LLM name produces different services by host
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

Dongsheng Liu +9
cs.PF 2026-05-04 reviewed

Same model name yields different speed
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

Dongsheng Liu +9
cs.LG 2026-05-04 reviewed

Streaming top-k runs CSA indexer to 1M tokens on 6 GB
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

Jaber Jaber +1
cs.CR 2026-05-04 reviewed

Two post-quantum signatures pass Australia's payment speed test
Post-Quantum Cryptography Migration in Australian Real-Time Payment Infrastructure: A Monte Carlo Simulation Study of the New Payments Platform

Nazmus Salehin Sammo
cs.PF 2026-05-02 reviewed

SPEC CPU 2026 standardizes mixed-workload CPU benchmarking
SPEC CPU: The Next Generation

Allen Lee +33
cs.PF 2026-05-02 reviewed

Response time distributions derived for priority queues with preemption overhead
Priority Scheduling in the M/G/1 with Preemption Overhead

Edwin Peng +2
cs.PL 2026-05-01 reviewed

Compiler splits recursive datatypes into separate field buffers
SoCal: A Language for Memory-Layout Factorization of Recursive Datatypes

Artem Pelenitsyn +5
cs.DC 2026-05-01 reviewed

Fixed-core approach yields 211x higher efficiency for edge GEMM
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

J. N\'u\~nez-Y\'a\~nez +1
cs.PF 2026-05-01 reviewed

Apple Silicon runs 80B LLMs at 23x Nvidia energy efficiency
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

Abdurrahman Javat +1
stat.ME 2026-05-01 reviewed

Workflow turns raw measurements into defensible ECE/CS results
How to Do Statistical Evaluations in ECE/CS Papers: A Practical Playbook for Defensible Results

Bhaskar Krishnamachari