archive
Every paper Pith has read. Search by title, abstract, or pith.
115 papers in cs.PF · page 1
-
Cache reorganization lifts GPU speedups for 28-qubit simulations on laptops
Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation
-
LLM tunes Linux knobs for 72 percent stable gain over defaults
SemaTune: Semantic-Aware Online OS Tuning with Large Language Models
-
Heterogeneous solvers up to 32% faster than GPU-only for big matrices
Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL
-
Block-scale search cuts quantization error 27% in BFP
Search Your Block Floating Point Scales!
-
Adaptive packed layouts enable efficient VLA ML code
Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation
-
Joint TLB-cache tweaks boost instruction prefetching 8.7%
Enhancing Instruction Prefetching via Cache and TLB Management
-
Node failures scale wireless capacity and delay with sqrt of reliable nodes
On Capacity and Delay of Wireless Networks with Node Failures
-
Power capping leaves LLM decode energy untouched
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
-
Chakra standardizes graph traces for AI workload benchmarking
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
-
DMI-Lib cuts LLM internal observability overhead to 0.4-6.8 percent
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
-
Edge micro-agent fixes failures safely with no destructive actions
An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum
-
Inverted culling speeds dynamic LiDAR ray tracing
Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation
-
KEM-IES upgrades ECIES with PQC KEM and Ascon
Key Encapsulation Mechanism-Based Integrated Encryption Scheme (KEM-IES)
-
Caching reuses diffusion steps for 4.6x faster robot plans
Muninn: Your Trajectory Diffusion Model But Faster
-
Mamba-2 classifies network bursts directly from raw bytes
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining
-
Cloud trace decomposition predicts performance at 2% error
Cloud Performance Decomposition for Long-Term Performance Engineering: A Case Study
-
Adaptive DNN splits cut energy by 27-36% on real edge-cloud hardware
Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum
-
Apple MPS shows 21x latency spikes in narrow decoding ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
-
MPS decoding latency spikes up to 21x in narrow ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
-
4.46× jump in quantum sim time at 29 qubits on M4 Pro
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture
-
GPU speedups reach 10x despite 1.85x bandwidth limit in quantum simulation
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture
-
Single-thread JPEG benchmarks misrank decoders for DataLoaders
Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders
-
DDR5 single sub-channel matches cache lines but loses 40-60% bandwidth
Single 32-bit Sub-Channel DDR5 DIMMs: Architecture, Performance Bounds, and Standardisation
-
Cyclic tuning raises RAG quality by up to 54 percent
CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG
-
Unified runtime delivers 2.55x decode speedup for low-rank transformers
FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast
-
Fluxion speeds long-context inference 1.5x-3.7x via CPU-GPU hybrid sparse attention
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
-
First benchmark supplies real data for LLM hyperparameter tuning
LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems
-
AD replaces finite differences in INLA for 4-8x gradient speedups
ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations
-
Pipeline speeds power-of-two DNNs on edge FPGAs by up to 3.6x
PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs
-
LLMs automate FPGA accelerator design space exploration
LLM-Driven Design Space Exploration of FPGA-based Accelerators
-
Int4 KV cache outruns fp16 on Apple Silicon
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
-
Task category explains 3x more variance than method in LLM kernel correctness
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
-
Task category predicts LLM kernel success far better than generation method
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
-
Algebraic coarsening delivers 3x speedup in GPU contact solves
AGIPC: Adaptive In-Solve Algebraic Coarsening for GPU IPC
-
LLM agents turn GPU profiles into optimization advice
KEET: Explaining Performance of GPU Kernels Using LLM Agents
-
Light storage limits turn content-provider competition into a potential game
Decentralized Edge Caching under Budget and Storage Constraints: A Game-Theoretic Approach
-
4-5 workloads preserve 96-99% of SPEC CPU2026 behavior
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
-
SPEC CPU2026 increases instruction volume and cache pressure
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
-
GPU speeds exascale trace analysis by 314 times
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
-
GPU layer speeds exascale trace analysis by up to 314x
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
-
Same LLM name produces different services by host
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
-
Same model name yields different speed
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
-
Streaming top-k runs CSA indexer to 1M tokens on 6 GB
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
-
Two post-quantum signatures pass Australia's payment speed test
Post-Quantum Cryptography Migration in Australian Real-Time Payment Infrastructure: A Monte Carlo Simulation Study of the New Payments Platform
-
SPEC CPU 2026 standardizes mixed-workload CPU benchmarking
SPEC CPU: The Next Generation
-
Response time distributions derived for priority queues with preemption overhead
Priority Scheduling in the M/G/1 with Preemption Overhead
-
Compiler splits recursive datatypes into separate field buffers
SoCal: A Language for Memory-Layout Factorization of Recursive Datatypes
-
Fixed-core approach yields 211x higher efficiency for edge GEMM
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
-
Apple Silicon runs 80B LLMs at 23x Nvidia energy efficiency
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference
-
Workflow turns raw measurements into defensible ECE/CS results
How to Do Statistical Evaluations in ECE/CS Papers: A Practical Playbook for Defensible Results