MANOJAVAM unifies matrix multiplication and SVD for PCA on FPGA with block-streaming systolic arrays and pipelined Jacobi-CORDIC, delivering up to 22.75x SVD speedup and 42.14x lower energy than an NVIDIA A6000 GPU.
In-datacenter performance analysis of a tensor processing unit
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 across multiple SLMs.
M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.
Magnetization dynamics in MTJ-based CMOS+X neurons support nonlinear computation in spiking networks via threshold activation, response latency, and absolute refraction, as shown in XOR classification simulations.
Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.
citing papers explorer
-
MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis
MANOJAVAM unifies matrix multiplication and SVD for PCA on FPGA with block-streaming systolic arrays and pipelined Jacobi-CORDIC, delivering up to 22.75x SVD speedup and 42.14x lower energy than an NVIDIA A6000 GPU.
-
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 across multiple SLMs.
-
M100: An Orchestrated Dataflow Architecture Powering General AI Computing
M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.
-
Biologically Realistic Dynamics for Nonlinear Classification in CMOS+X Neurons
Magnetization dynamics in MTJ-based CMOS+X neurons support nonlinear computation in spiking networks via threshold activation, response latency, and absolute refraction, as shown in XOR classification simulations.
-
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators
Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
-
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.