archive
Every paper Pith has read. Search by title, abstract, or pith.
344 papers in cs.AR · page 1
-
Cache reorganization lifts GPU speedups for 28-qubit simulations on laptops
Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation
-
Time-domain near-memory MAC reaches 7.62 TOPS/W
Time Domain Near Memory Computing Engine
-
ViTs reach 84% accuracy by replacing layer norm with evolved scalars
Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation
-
End-to-end DVS-memristor system is the missing piece for low-power vision
Memristor Technologies for Dynamic Vision Sensors: A Critical Assessment and Research Roadmap
-
FPGA accelerator skips sparse beams for 2x faster MIMO localization
Efficient Implementation of an Adaptive Transformer Accelerator for Massive MIMO Outdoor Localization
-
7B model surpasses 671B baselines on SVA generation
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
-
FPGA lock agents boost OLTP throughput 51X over CPUs
FPGA-Accelerated Lock Management and Transaction Processing: Architecture, Optimization, and Design Space Exploration
-
PoisonCap gives CHERI strict use-after-free at zero overhead
PoisonCap: Efficient Hierarchical Temporal Safety for CHERI
-
Block-scale search cuts quantization error 27% in BFP
Search Your Block Floating Point Scales!
-
Joint TLB-cache tweaks boost instruction prefetching 8.7%
Enhancing Instruction Prefetching via Cache and TLB Management
-
FPGA SoC matches silicon SNN accuracy for neuromorphic edge tasks
Heterogeneous SoC Integrating an Open-Source Recurrent SNN Accelerator for Neuromorphic Edge Computing on FPGA
-
Calibration feedback control cuts optimization gaps in local and tight-loop regimes
Runtime Calibration as State-Trajectory Feedback Control in Quantum-Classical Workflows
-
Cumulative updates fix gradient flow in low-power RNNs
Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications
-
Dynamic scheduler lifts MoE inference 1.3-1.6x on PIM hardware
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
-
Triton gains direct warp-group control for modern GPU hardware
TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments
-
TLX adds MIMW warp-group control to Triton for modern GPUs
TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments
-
LLMs automate chip design but create security risks
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
-
LLMs generate hardware code but introduce security risks
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
-
Hybrid chip runs GNN at 2.94M events/sec for physics triggers
Reconfigurable Computing Challenge: Real-Time Graph Neural Networks for Online Event Selection in Big Science
-
Error profiles detect stolen approximate circuit IP despite mimicry
ObfAx: Obfuscation and IP Piracy Detection in Approximate Circuits
-
Piezoelectric sensors turn desk vibrations into six-gesture commands
Towards an End-To-End System for Real-Time Gesture Recognition from Surface Vibrations
-
Hardware assertion sets reduced by 76 percent
Arcane: An Assertion Reduction Framework through Semantic Clustering and MCTS-Guided Rule Exploring
-
LLM agents size RF amplifiers via resource allocation
RFAmpDesigner: A Self-Evolving Multi-Agent LLM Framework for Automated Radio Frequency Amplifier Design
-
KV-cache movement regularization cuts static-graph LLM latency spikes
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
-
Wafer integration of three 2D devices decides next computing decade
Emerging 2D Materials for Beyond von Neumann Computing: A Perspective
-
LLM accuracy depends only on evicted tokens
Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning
-
ReRAM-on-logic chip reaches 14-136 tokens per second on LLMs
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
-
Memoized heuristics scale ion-trap qubit mapping
Scaling Qubit Mapping and Routing With Position Graph Abstraction and Memoization
-
Apple MPS shows 21x latency spikes in narrow decoding ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
-
MPS decoding latency spikes up to 21x in narrow ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
-
New cache bypass method meets deadlines while boosting heterogeneous system speed
HyDRA: Deadline and Reuse-Aware Cacheability for Hardware Accelerators
-
HyDRA balances accelerator deadlines with cache reuse via clustering
HyDRA: Deadline and Reuse-Aware Cacheability for Hardware Accelerators
-
Low-complexity denoiser matches heavy mmWave MIMO methods
Low-Complexity Beamspace Channel Denoiser for mmWave Massive MIMO with Low-Resolution ADCs
-
Reconfigurable multiplier cuts power 44-68% in RISC-V core
A Reconfigurable Multiplier Architecture for Error-Resilient Applications in RISC-V Core
-
DDR5 single sub-channel matches cache lines but loses 40-60% bandwidth
Single 32-bit Sub-Channel DDR5 DIMMs: Architecture, Performance Bounds, and Standardisation
-
Edge processor hits 109 TFLOPS/W on DeepSeek
DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing
-
Coprime test vectors localize faulty rows in systolic arrays after one pass
FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors
-
Static checker decides barrier sufficiency for accelerator races
AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs
-
Model runs 1024-core chip sims 115x faster at under 7% error
Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling
-
Plasma simulations need three post-Moore tech tiers
Post-Moore Technologies for Plasma Simulation: A Community Roadmap
-
GNNs for EDA succeed when matched to each task's native algebra
Graph Computation Meets Circuit Algebra: A Task-Aligned Analysis of Graph Neural Networks for Electronic Design Automation
-
Bit-hardening methods surpass ECC for reliable DNNs with no memory cost
Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs
-
TREA accelerator reduces edge detection latency up to 9x
TREA: Low-precision Time-Multiplexed, Resource-Efficient Edge Accelerator for Object Detection and Classification
-
Reconfigurable FPU gives up to 8x throughput for low-precision dot products
TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines
-
Open schema and datasets released for ML benchmarks in chip design
EDA-Schema-V2: A Multimodal Schema, Open Datasets, and Benchmarks for Machine Learning in Digital Physical Design
-
Agents solve only 37% of practical chip design rule problems
Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing
-
CORDIC iteration depth trims 33 percent of inference cycles
CARMEN: CORDIC-Accelerated Resource-Efficient Multi-Precision Inference Engine for Deep Learning
-
Posit engine cuts ADAS power by 72 percent with near full accuracy
EULER-ADAS: Energy-Efficient & SIMD-Unified Logarithmic-Posit Engine for Precision-Reconfigurable Approximate ADAS Acceleration
-
FPGA YOLOv3-Tiny system detects in 0.211 seconds
Development of embedded target detection system based on FPGA and YOLOv3-Tiny
-
Self-supervised pretraining yields tiny wildfire spotters for satellites
On-Orbit Real-Time Wildfire Detection Under On-Board Constraints