In-datacenter performance analysis of a tensor processing unit

· 2017

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

browse 6 citing papers

representative citing papers

MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis

cs.AR · 2026-05-02 · unverdicted · novelty 6.0

MANOJAVAM unifies matrix multiplication and SVD for PCA on FPGA with block-streaming systolic arrays and pipelined Jacobi-CORDIC, delivering up to 22.75x SVD speedup and 42.14x lower energy than an NVIDIA A6000 GPU.

EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

cs.AR · 2026-04-13 · unverdicted · novelty 6.0

A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 across multiple SLMs.

M100: An Orchestrated Dataflow Architecture Powering General AI Computing

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.

Biologically Realistic Dynamics for Nonlinear Classification in CMOS+X Neurons

cs.NE · 2026-04-03 · conditional · novelty 5.0

Magnetization dynamics in MTJ-based CMOS+X neurons support nonlinear computation in spiking networks via threshold activation, response latency, and absolute refraction, as shown in XOR classification simulations.

Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

cs.AR · 2026-04-29 · unverdicted · novelty 4.0

Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.

Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

cs.DC · 2026-04-13 · unverdicted · novelty 4.0

StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.

citing papers explorer

Showing 6 of 6 citing papers.

MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis cs.AR · 2026-05-02 · unverdicted · none · ref 42
MANOJAVAM unifies matrix multiplication and SVD for PCA on FPGA with block-streaming systolic arrays and pipelined Jacobi-CORDIC, delivering up to 22.75x SVD speedup and 42.14x lower energy than an NVIDIA A6000 GPU.
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models cs.AR · 2026-04-13 · unverdicted · none · ref 4
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 across multiple SLMs.
M100: An Orchestrated Dataflow Architecture Powering General AI Computing cs.LG · 2026-04-20 · unverdicted · none · ref 26
M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.
Biologically Realistic Dynamics for Nonlinear Classification in CMOS+X Neurons cs.NE · 2026-04-03 · conditional · none · ref 3
Magnetization dynamics in MTJ-based CMOS+X neurons support nonlinear computation in spiking networks via threshold activation, response latency, and absolute refraction, as shown in XOR classification simulations.
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators cs.AR · 2026-04-29 · unverdicted · none · ref 11
Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO cs.DC · 2026-04-13 · unverdicted · none · ref 55
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.

In-datacenter performance analysis of a tensor processing unit

fields

years

verdicts

representative citing papers

citing papers explorer