hub Canonical reference

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, William J. Dally · 2015 · cs.CV · arXiv 1510.00149

Canonical reference. 83% of citing Pith papers cite this work as background.

62 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 62 citing papers arXiv PDF

abstract

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 2

citation-polarity summary

background 10 use method 2

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

cs.LG · 2026-05-04 · conditional · novelty 8.0 · 2 refs

INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.

Federated Learning: Strategies for Improving Communication Efficiency

cs.LG · 2016-10-18 · conditional · novelty 8.0

Structured updates (low-rank or masked) and sketched updates (quantized, rotated, subsampled) reduce uplink communication in federated learning by up to two orders of magnitude on convolutional and recurrent networks.

When Bits Break Recourse: Counterfactual-Faithful Quantization

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.

Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.

Winning Lottery Tickets in Neural Networks via a Quantum-Inspired Classical Algorithm

quant-ph · 2026-05-13 · conditional · novelty 7.0

A classical polynomial-time algorithm for optimized sampling of lottery tickets in neural networks removes the exponential dependence on data dimension from prior classical approaches.

Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

On the Decompositionality of Neural Networks

cs.LO · 2026-04-09 · unverdicted · novelty 7.0

Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

cs.CL · 2025-12-01 · conditional · novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization

cs.CV · 2024-08-01 · unverdicted · novelty 7.0

CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

cs.CV · 2017-04-17 · accept · novelty 7.0

MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.

AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

AutoMCU uses feasibility-first LLM multi-agent coordination to automate MCU-constrained neural network design, delivering competitive accuracy on CIFAR-10/100 in 1-2 hours versus hundreds of GPU hours for prior HW-NAS methods.

Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

eess.SP · 2026-05-16 · unverdicted · novelty 6.0

Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.

ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems

cs.LG · 2026-05-12 · conditional · novelty 6.0

ROMER cuts perplexity by up to 59% in noisy analog CIM environments for MoE LLMs via expert replacement and router recalibration calibrated on real-chip measurements.

ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.

Theory-optimal Quantization Based on Flatness

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on LLaMA-3-8B.

Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.

ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while matching or exceeding it on two text-classification benchmarks and compressing the

Homodyne Photonic Tensor Processor exceeds 1,000-TOPS

cs.ET · 2026-04-20 · unverdicted · novelty 6.0

A homodyne photonic tensor processor using TFLN transmitters and Si/SiN circuits demonstrates 1,000-6,000 TOPS throughput with 6-7 bit accuracy at up to 120 Gbaud/s clock rates.

UCCL-Zip: Lossless Compression Supercharged GPU Communication

cs.DC · 2026-04-19 · unverdicted · novelty 6.0

UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.

Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition

cs.AR · 2026-04-17 · unverdicted · novelty 6.0

A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

cs.CL · 2026-04-10 · unverdicted · novelty 6.0

Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.

citing papers explorer

Showing 50 of 62 citing papers.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts cs.LG · 2026-05-13 · unverdicted · none · ref 22 · internal anchor
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning cs.LG · 2026-05-04 · conditional · none · ref 37 · 2 links · internal anchor
INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
Federated Learning: Strategies for Improving Communication Efficiency cs.LG · 2016-10-18 · conditional · none · ref 10 · internal anchor
Structured updates (low-rank or masked) and sketched updates (quantized, rotated, subsampled) reduce uplink communication in federated learning by up to two orders of magnitude on convolutional and recurrent networks.
When Bits Break Recourse: Counterfactual-Faithful Quantization cs.LG · 2026-05-16 · unverdicted · none · ref 12 · internal anchor
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis cs.LG · 2026-05-15 · unverdicted · none · ref 57 · internal anchor
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
Winning Lottery Tickets in Neural Networks via a Quantum-Inspired Classical Algorithm quant-ph · 2026-05-13 · conditional · none · ref 1 · internal anchor
A classical polynomial-time algorithm for optimized sampling of lottery tickets in neural networks removes the exponential dependence on data dimension from prior classical approaches.
Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns cs.LG · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals cs.CR · 2026-05-08 · unverdicted · none · ref 45 · internal anchor
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
On the Decompositionality of Neural Networks cs.LO · 2026-04-09 · unverdicted · none · ref 13 · internal anchor
Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling cs.CL · 2025-12-01 · conditional · none · ref 35 · internal anchor
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization cs.CV · 2024-08-01 · unverdicted · none · ref 2 · internal anchor
CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications cs.CV · 2017-04-17 · accept · none · ref 5 · internal anchor
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems cs.LG · 2026-05-20 · unverdicted · none · ref 7 · internal anchor
AutoMCU uses feasibility-first LLM multi-agent coordination to automate MCU-constrained neural network design, delivering competitive accuracy on CIFAR-10/100 in 1-2 hours versus hundreds of GPU hours for prior HW-NAS methods.
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis eess.SP · 2026-05-16 · unverdicted · none · ref 260 · internal anchor
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems cs.LG · 2026-05-12 · conditional · none · ref 51 · internal anchor
ROMER cuts perplexity by up to 59% in noisy analog CIM environments for MoE LLMs via expert replacement and router recalibration calibrated on real-chip measurements.
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models cs.LG · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices cs.LG · 2026-05-11 · unverdicted · none · ref 57 · 3 links · internal anchor
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
Theory-optimal Quantization Based on Flatness cs.LG · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on LLaMA-3-8B.
Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning cs.LG · 2026-05-09 · unverdicted · none · ref 48 · internal anchor
Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models cs.CL · 2026-04-27 · unverdicted · none · ref 4 · internal anchor
ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while matching or exceeding it on two text-classification benchmarks and compressing the
Homodyne Photonic Tensor Processor exceeds 1,000-TOPS cs.ET · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
A homodyne photonic tensor processor using TFLN transmitters and Si/SiN circuits demonstrates 1,000-6,000 TOPS throughput with 6-7 bit accuracy at up to 120 Gbaud/s clock rates.
UCCL-Zip: Lossless Compression Supercharged GPU Communication cs.DC · 2026-04-19 · unverdicted · none · ref 18 · internal anchor
UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.
Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition cs.AR · 2026-04-17 · unverdicted · none · ref 17 · internal anchor
A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism cs.CL · 2026-04-10 · unverdicted · none · ref 11 · internal anchor
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization cs.CV · 2026-04-10 · unverdicted · none · ref 7 · internal anchor
DeFakeQ introduces an adaptive bidirectional quantization method tailored for deepfake detectors that maintains detection accuracy while enabling real-time performance on resource-constrained edge devices.
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models cs.LG · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
LILogic Net: Compact Logic Gate Networks with Learnable Connectivity for Efficient Hardware Deployment cs.LG · 2025-11-15 · unverdicted · none · ref 8 · internal anchor
LILogicNet trains compact logic-gate networks with learnable sparse connectivity via Top-K selection, reaching 98.45% MNIST accuracy with 8k gates and 60.98% CIFAR-10 accuracy with 256k gates while using far fewer gates than prior logic models.
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs cs.LG · 2025-06-15 · unverdicted · none · ref 14 · internal anchor
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
Poisoning with A Pill: Circumventing Detection in Federated Learning cs.LG · 2024-07-22 · unverdicted · none · ref 42 · internal anchor
A three-stage pill-based augmentation makes existing FL poisoning attacks evade popular defenses while raising error rates up to 7x on both IID and non-IID data.
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation cs.LG · 2023-10-19 · conditional · none · ref 195 · internal anchor
SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cs.LG · 2023-06-24 · unverdicted · none · ref 55 · internal anchor
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance cs.LG · 2023-05-09 · accept · none · ref 10 · internal anchor
FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.
Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT stat.ML · 2019-07-26 · unverdicted · none · ref 4 · internal anchor
NoNN partitions a teacher model into disjoint compressed students via network science for distributed IoT inference, matching teacher accuracy with far lower per-device memory and communication.
Open DNN Box by Power Side-Channel Attack cs.CR · 2019-07-21 · unverdicted · none · ref 47 · internal anchor
Power side-channel analysis recovers DNN architecture and parameters at 96.5% average accuracy on real embedded devices.
A Unified Optimization Approach for CNN Model Inference on Integrated GPUs cs.DC · 2019-07-03 · unverdicted · none · ref 12 · internal anchor
A unified IR plus ML-based scheduling for CNN inference on multi-vendor integrated GPUs matches or exceeds vendor libraries (up to 1.62x) on image models while supporting more models.
COP: Customized Deep Model Compression via Regularized Correlation-Based Filter-Level Pruning cs.CV · 2019-06-25 · unverdicted · none · ref 4 · internal anchor
COP prunes CNN filters using correlation-based importance with global normalization and dual regularization on parameter quantity and FLOPs to enable customized compression.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation cs.CL · 2016-09-26 · accept · none · ref 20 · internal anchor
GNMT deploys 8-layer LSTMs with attention, wordpieces, low-precision inference, and coverage-penalized beam search to match state-of-the-art on WMT'14 En-Fr and En-De while cutting translation errors by 60% in human evaluations.
SGDR: Stochastic Gradient Descent with Warm Restarts cs.LG · 2016-08-13 · accept · none · ref 4 · internal anchor
SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.
Multibit neural inference in a N-ary crossbar architecture cs.AR · 2026-04-28 · unverdicted · none · ref 15 · internal anchor
Simulation of 4-state MTJ crossbars achieves 94.48% MNIST accuracy for neural inference, close to 97.56% software baseline, with analysis showing quantization as primary error and an optimal number of states per cell.
FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices cs.LG · 2026-04-28 · unverdicted · none · ref 57 · internal anchor
Fed-FSTQ reduces uplink traffic by 46x and improves time-to-accuracy by 52% in federated LLM fine-tuning using Fisher-guided token quantization and selection.
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models cs.LG · 2025-12-25 · unverdicted · none · ref 4 · 2 links · internal anchor
A post-training quantization technique for 1-bit LLMs that corrects layer-wise error accumulation and anisotropic representation distortion to preserve output behavior more effectively than existing methods.
Vanishing Contributions: A Unified Framework for Smooth and Iterative Model Compression cs.LG · 2025-10-09 · unverdicted · none · ref 5 · internal anchor
VCON is a unified framework for smooth iterative DNN compression that uses parallel execution and an affine combination to progressively replace the original model with its compressed form during fine-tuning.
AutoSculpt: A Pattern-based Model Auto-pruning Framework Using Reinforcement Learning and Graph Learning cs.AI · 2024-12-24 · unverdicted · none · ref 11 · internal anchor
AutoSculpt models DNNs as graphs, embeds pruning patterns, and uses deep reinforcement learning to reach up to 90% pruning and 18% better FLOPs reduction than baselines on ResNet, MobileNet, VGG, and Vision Transformers.
Neuron ranking -- an informed way to condense convolutional neural networks architecture cs.LG · 2019-07-03 · unverdicted · none · ref 4 · internal anchor
Shapley value and variational importance switch methods produce consistent rankings of filter importance in CNNs, enabling compression and interpretability.
One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-off in Machine Learning Cloud Service APIs via Tolerance Tiers cs.LG · 2019-06-26 · unverdicted · none · ref 34 · internal anchor
Proposes Tolerance Tiers architecture for MLaaS to let consumers select accuracy-latency trade-offs, shown to outperform single-version deployment on ASR and vision workloads.
New pointwise convolution in Deep Neural Networks through Extremely Fast and Non Parametric Transforms cs.CV · 2019-06-25 · unverdicted · none · ref 7 · internal anchor
Replacing pointwise convolutions with DWHT yields a model with 79.1% fewer parameters, 48.4% fewer FLOPs, and 1.49% higher accuracy than MobileNet-V1 on CIFAR-100.
MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization cs.AR · 2026-05-22 · unverdicted · none · ref 8 · internal anchor
MASQ claims up to 16.06x speedup and 4.18x energy gain over A100 for masked diffusion via stage-wise multi-precision quantization and specialized hardware units while preserving quality.
GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection cs.CV · 2026-05-20 · unverdicted · none · ref 24 · internal anchor
GSA-YOLO modifies YOLOv8n with structured sparsity via Group Lasso and Sparse Structure Selection plus Adaptive Knowledge Distillation, reporting 189.62 FPS and mAP50:95 gains of 2.4% and 1.8% on HiXray and PIDray datasets.
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder cs.CL · 2026-05-19 · unverdicted · none · ref 15 · internal anchor
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
Trajectory-Aware Adaptive Inference in Object Detection Models cs.CV · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
Introduces an early-exit mechanism in YOLOv8 that uses inter-vessel distance and closing speed from trajectories to adapt computation depth per frame in maritime scenes.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer