Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

Jongsoo Park , Maxim Naumov , Protonu Basu , Summer Deng , Aravind Kalaiah , Daya Khudia , James Law , Parth Malani

show 20 more authors

Andrey Malevich Satish Nadathur Juan Pino Martin Schatz Alexander Sidorov Viswanath Sivakumar Andrew Tulloch Xiaodong Wang Yiming Wu Hector Yuen Utku Diril Dmytro Dzhulgakov Kim Hazelwood Bill Jia Yangqing Jia Lin Qiao Vijay Rao Nadav Rotem Sungjoo Yoo Mikhail Smelyanskiy

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords learningdeepmodelscentersdatafacebookhardwareinference

0 comments

read the original abstract

The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
cs.CR 2026-04 unverdicted novelty 7.0

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators
cs.AR 2026-03 conditional novelty 6.0

FireBridge enables cycle-accurate hardware-firmware co-verification in standard simulators using randomized memory bridges, delivering up to 50x faster debug iterations than FPGA-based flows for accelerators such as s...