Fbgemm: Enabling high-performance low-precision deep learning inference

Khudia, D · 2021 · arXiv 2101.05615

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading

cs.CR · 2026-04-19 · unverdicted · novelty 7.0

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

cs.LG · 2024-02-27 · unverdicted · novelty 7.0

HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

cs.LG · 2022-08-15 · conditional · novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

cs.DC · 2026-05-06 · unverdicted · novelty 6.0

HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

At the Edge of the Heart: ULP FPGA-Based CNN for On-Device Cardiac Feature Extraction in Smart Health Sensors for Astronauts

cs.AR · 2026-04-28 · unverdicted · novelty 3.0

A systolic-array CNN accelerator on the Lattice iCE40UP5K FPGA achieves 98% validation accuracy for SCG feature classification while using 8.55 mW, 95.5 ms inference time, 2,861 LUTs, and 7 DSP blocks.

citing papers explorer

Showing 7 of 7 citing papers.

Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading cs.CR · 2026-04-19 · unverdicted · none · ref 160
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations cs.LG · 2024-02-27 · unverdicted · none · ref 124
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 142
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale cs.LG · 2026-05-11 · unverdicted · none · ref 43
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving cs.DC · 2026-05-06 · unverdicted · none · ref 23
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference cs.LG · 2026-05-08 · unverdicted · none · ref 24
Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
At the Edge of the Heart: ULP FPGA-Based CNN for On-Device Cardiac Feature Extraction in Smart Health Sensors for Astronauts cs.AR · 2026-04-28 · unverdicted · none · ref 17
A systolic-array CNN accelerator on the Lattice iCE40UP5K FPGA achieves 98% validation accuracy for SCG feature classification while using 8.55 mW, 95.5 ms inference time, 2,861 LUTs, and 7 DSP blocks.

Fbgemm: Enabling high-performance low-precision deep learning inference

fields

years

verdicts

representative citing papers

citing papers explorer