pith. sign in

arxiv: 2606.00567 · v2 · pith:DPE5MEECnew · submitted 2026-05-30 · 💻 cs.AR · cs.PF

Activation Concentration: Characterizing Column-Level Output Sparsity Across Diffusion Model Architectures

Pith reviewed 2026-06-28 18:21 UTC · model grok-4.3

classification 💻 cs.AR cs.PF
keywords diffusion modelsactivation sparsitycolumn-level sparsitysystolic arraysUNettransformershardware accelerationsparsity exploitation
0
0 comments X

The pith

Element-level sparsity overstates hardware-exploitable sparsity in diffusion models by up to 78 points and produces a three-way taxonomy of workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures activation sparsity at the column granularity required by systolic-array hardware instead of at individual elements. It finds that prior element-level reports of 52-85 percent sparsity greatly exceed what accelerators can actually skip, with the gap reaching 78 points in some cases. This gap produces a three-way taxonomy: UNet-plus-transformer models concentrate non-zeros and permit up to 30.6 percent cycle reductions, pure-transformer DiT models disperse them for only 12.4 percent savings, and motion models vary widely up to 50.8 percent. Cycle-level simulation shows memory stalls dominate 84-89 percent of runtime, and accuracy behaves differently across the groups when thresholds rise.

Core claim

Measurements across seven diffusion workloads reveal that element-level sparsity overstates hardware-exploitable sparsity by up to 78 percentage points and exposes a three-way taxonomy. UNet+transformer workloads exhibit activation concentration with workload-dependent cycle reductions up to 30.6 percent. Pure-transformer DiT shows dispersion, yielding 12.4 percent. Motion/dance transformer workloads range from modest reductions to 50.8 percent for MLD. Cycle-level simulation on a GDDR6-based accelerator confirms that memory stalls account for up to 84-89 percent of total cycles and that layout sensitivity tracks the profiling-based taxonomy. A full accuracy sweep across five thresholds reve

What carries the argument

activation concentration, the clustering of non-zero GELU outputs into fewer columns rather than spreading evenly

If this is right

  • UNet+transformer workloads permit up to 30.6 percent cycle reductions when column skipping is enabled.
  • Pure-transformer DiT workloads deliver only 12.4 percent reductions because of activation dispersion.
  • Motion transformer workloads produce cycle reductions between modest values and 50.8 percent depending on token dimension and expansion ratio.
  • Memory stalls account for 84-89 percent of total cycles on a GDDR6-based accelerator regardless of sparsity.
  • Workload group together with model dimensions determines whether column-level memory layout changes improve performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware designs for diffusion inference could add column-aware skipping units that activate only for UNet-hybrid models.
  • Model designers could adjust expansion ratios in transformer blocks to increase column concentration and raise hardware efficiency.
  • Threshold selection for sparsity must be validated per workload group to avoid the accuracy cliff observed in motion models.
  • The same column-level measurement method could be applied to non-diffusion generative models to test whether the three-way pattern holds.

Load-bearing premise

The seven chosen diffusion workloads and four modalities are representative enough that the observed three-way taxonomy and cycle-reduction numbers will generalize to other diffusion architectures and future models.

What would settle it

Profiling column-level sparsity on one additional diffusion model outside the original seven and finding its cycle reduction lies far outside the ranges reported for its workload group.

Figures

Figures reproduced from arXiv: 2606.00567 by Byeong Kil Lee, Dazhi Yang, Jeeho Ryoo, Shafayat Mowla Anik.

Figure 1
Figure 1. Figure 1: Element vs. column sparsity per model each model’s feed-forward network (FFN) layers share a common fc1 → GELU → fc2 structure whose near-zero GELU outputs create activation sparsity that hardware can exploit. Prior work reports 52–85% element-level sparsity across three diffusion workloads [15], suggesting that this optimization opportunity is broadly and uniformly available. However, this element-level m… view at source ↗
Figure 4
Figure 4. Figure 4: Element- vs. column-level sparsity. Column-level sparsity is therefore always ≤ element￾level sparsity [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three workload groups used in this paper. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FFN-Reuse dataflow. than individual scalar activations. A column j is there￾fore classified as hot if any token exceeds the threshold (∃ i such that |GELU outputij| > τ ), and cold otherwise. The potential speedup depends on two factors, how many columns are cold and how the remaining hot columns are ar￾ranged in memory. Prior work reports 52–85% element-level sparsity across DiT, Stable Diffusion, and Vid… view at source ↗
Figure 5
Figure 5. Figure 5: DRAM row-major and grouped layouts. major layout where columns are stored in their original order. When a sparse execution path reads only a subset of columns, the accessed columns may fall in different DRAM rows, causing row switches. The right half shows an alternative layout where the columns that are read together are grouped contiguously in memory. In this example, the same set of reads now falls with… view at source ↗
Figure 6
Figure 6. Figure 6: Element vs. column sparsity at τ=0.164. setup measures masking-induced output shift rather than standalone generation quality. 3.5. Cycle-Level Simulator Following prior state-of-the-art diffusion accelerator work [15], we develop a custom cycle-level simulator and integrate it with Ramulator 2.0 [29], a widely used DRAM simulator, to model DRAM latency at the channel, bank, and row-buffer level. The accel… view at source ↗
Figure 9
Figure 9. Figure 9: Jaccard stability across iterations. calibrated thresholds settle within the physical activation range of 0.14–0.34, and the resulting cycle reduction reflects genuine column-skipping work that adapts to encoder￾decoder asymmetry. For layers without durable natural sparsity, calibration can degenerate. DiT is the clearest example because late iterations naturally fall to low column sparsity, yet per-layer … view at source ↗
Figure 8
Figure 8. Figure 8: Column sparsity vs. token dimension M. to contain at least one hot token. However, M alone is not sufficient. EDGE has M=3,300 yet shows low column sparsity because its observed activation distribution is more dispersed across columns, while MLD has very small M and a wider expansion ratio that create many columns that remain entirely cold. The MLD exception is explained by two compounding factors. First, … view at source ↗
Figure 11
Figure 11. Figure 11: Cycle reduction vs. threshold τ . 5.2. Uniform Threshold Sweep [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-layer reduction vs. ratio r. SD MaA VC2 DiT MDM MLD EDGE 0 20 40 60 80 Cycle Reduction (%) Uniform (τ=0.164) Per-layer (r=0.164) [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Layout sensitivity at τ=0.164 and r=0.164. UNet+transformer per-layer reductions remain workload￾dependent. Stable Diffusion ranges from 3.3–9.7%, Make￾an-Audio from 7.7–20.4%, and VideoCrafter2 from 2.5– 7.9%. At r=0.164, reductions are 4.6%, 10.6%, and 3.5%, respectively. Pure transformer DiT achieves 53.3–59.4% under per-layer calibration, a substantial improvement over its 7.3–32.5% uniform range. How… view at source ↗
Figure 15
Figure 15. Figure 15: Accuracy degradation vs. threshold. comparison. Together with [PITH_FULL_IMAGE:figures/full_fig_p009_15.png] view at source ↗
read the original abstract

Recent diffusion accelerators exploit activation sparsity by skipping near-zero GELU outputs, reporting 52--85% element-level sparsity. However, systolic-array hardware processes activations at column granularity, where a single non-zero element forces the entire column to be computed. We present the first systematic column-level sparsity characterization across seven diffusion workloads spanning three workload groups and four modalities. Our measurements reveal that element-level sparsity overstates hardware-exploitable sparsity by up to 78 percentage points and exposes a three-way taxonomy. UNet+transformer workloads exhibit activation concentration with workload-dependent cycle reductions up to 30.6%. Pure-transformer DiT shows dispersion, yielding 12.4%. Motion/dance transformer workloads range from modest reductions to 50.8% for MLD, driven by its extreme token dimension and expansion ratio. Cycle-level simulation on a GDDR6-based accelerator confirms that memory stalls account for up to 84--89% of total cycles and that layout sensitivity tracks the profiling-based taxonomy. A full accuracy sweep across five thresholds reveals that UNet+transformer workloads degrade gracefully, while motion models exhibit an accuracy cliff between the primary operating point and the next threshold. Our characterization shows that workload group and model dimensions jointly determine whether column-level memory layout optimization is beneficial, and element-level sparsity alone is insufficient for that prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents the first systematic column-level sparsity characterization across seven diffusion workloads spanning three groups (UNet+transformer, pure-transformer DiT, motion/dance) and four modalities. It claims element-level sparsity overstates hardware-exploitable (column-level) sparsity by up to 78 percentage points, exposing a three-way taxonomy: activation concentration in UNet+transformer workloads (workload-dependent cycle reductions up to 30.6%), dispersion in DiT (12.4%), and variable behavior in motion workloads (up to 50.8% for MLD, linked to token dimension and expansion ratio). Cycle-level simulation on a GDDR6 accelerator shows memory stalls account for 84-89% of cycles with layout sensitivity matching the taxonomy; a full accuracy sweep across five thresholds indicates graceful degradation for UNet+transformer but an accuracy cliff for motion models. The work concludes that workload group and model dimensions jointly determine the benefit of column-level memory layout optimization.

Significance. If the taxonomy and associated cycle-reduction numbers generalize, the result would be significant for guiding sparsity-exploiting accelerator designs in diffusion models, demonstrating that element-level metrics alone are insufficient predictors and that layout sensitivity and model dimensions (token count, expansion ratio) must be considered. The concrete simulation breakdown of memory stalls (84-89%) and the accuracy-threshold analysis provide actionable data for hardware-software co-design. The empirical scope across modalities is a strength of the characterization.

major comments (2)
  1. [Experimental setup / workload selection] Workload selection and taxonomy derivation: The manuscript provides no explicit selection criteria, coverage argument, or comparison against held-out models for the seven workloads and four modalities. This is load-bearing for the central claim, as the three-way taxonomy (concentration/dispersion/variable) and specific numbers (78 pp overstatement, 30.6%/12.4%/50.8% cycle reductions) could be artifacts of the chosen instances rather than defining workload groups.
  2. [Methods / abstract] Methods description: No details are given on measurement methodology, error bars, raw data, statistical significance, or sensitivity of the post-hoc grouping into three categories. This directly affects the reliability of all reported percentages and the taxonomy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer justification of workload selection and expanded methods details. We address both major comments below and commit to revisions that strengthen the manuscript without altering its core claims or results.

read point-by-point responses
  1. Referee: [Experimental setup / workload selection] Workload selection and taxonomy derivation: The manuscript provides no explicit selection criteria, coverage argument, or comparison against held-out models for the seven workloads and four modalities. This is load-bearing for the central claim, as the three-way taxonomy (concentration/dispersion/variable) and specific numbers (78 pp overstatement, 30.6%/12.4%/50.8% cycle reductions) could be artifacts of the chosen instances rather than defining workload groups.

    Authors: The seven workloads were deliberately chosen as canonical, publicly available representatives of the three primary architectural families in diffusion models: UNet+transformer hybrids (Stable Diffusion v1.5/v2.1 and similar), pure-transformer DiT, and motion/dance transformers (MLD and related models). These span the four modalities noted and reflect the most commonly deployed designs in the literature. The taxonomy is not arbitrary but follows directly from measured differences in column-level activation concentration, which align with architectural parameters such as token dimension and expansion ratio. While we lack resources for a formal coverage argument or held-out model experiments, the patterns are consistent within each group. In revision we will add an explicit subsection detailing selection criteria and architectural rationale. revision: partial

  2. Referee: [Methods / abstract] Methods description: No details are given on measurement methodology, error bars, raw data, statistical significance, or sensitivity of the post-hoc grouping into three categories. This directly affects the reliability of all reported percentages and the taxonomy.

    Authors: We agree the current methods description is insufficiently detailed. Column-level sparsity is obtained by thresholding activations and counting non-zero columns per tensor; cycle estimates derive from a deterministic cycle-level simulator of a GDDR6 systolic-array accelerator. Because results are fully determined by model weights, inputs, and thresholds, error bars and statistical tests were omitted. The three-category grouping is post-hoc but driven by clear quantitative gaps in non-zero column fraction and cycle savings. In the revised manuscript we will expand the methods section with measurement pseudocode, a sensitivity discussion of the grouping thresholds, and availability of raw data. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical profiling study

full rationale

The paper reports direct measurements of activation sparsity at element vs. column granularity across seven diffusion workloads, followed by cycle-level simulation on a GDDR6 accelerator. No equations, fitted parameters, or derivations are present that could reduce any claimed result to prior fitted quantities or self-citations by construction. The three-way taxonomy, cycle-reduction percentages, and memory-stall observations are outputs of the profiling and simulation runs themselves, not inputs renamed or fitted. Self-citation is absent from the provided text, and the central claims rest on the chosen workloads' measured behavior rather than any imported uniqueness theorem or ansatz. This is a standard self-contained empirical characterization; the representativeness concern raised by the skeptic is a question of external validity, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical measurement study with no mathematical derivations, free parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5780 in / 946 out tokens · 19722 ms · 2026-06-28T18:21:30.903302+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    A scalable processing- in-memory accelerator for parallel graph processing,

    J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing- in-memory accelerator for parallel graph processing, ” inProceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), 2015, pp. 105–117

  2. [2]

    Stable video diffusion: Scaling latent video diffusion models to large datasets,

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, V. Jampani, and R. Rombach, “Stable video diffusion: Scaling latent video diffusion models to large datasets, ” 2023. 10

  3. [3]

    Token merging for fast stable diffusion,

    D. Bolya and J. Hoffman, “Token merging for fast stable diffusion, ” 2023

  4. [4]

    VideoCrafter2: Overcoming data limitations for high-quality video diffusion models,

    H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “VideoCrafter2: Overcoming data limitations for high-quality video diffusion models, ”arXiv preprint arXiv:2401.09047, 2024

  5. [5]

    WaveGrad: Estimating gradients for waveform generation,

    N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation, ” 2020

  6. [6]

    DianNao: A small-footprint high-throughput accelerator for ubiqui- tous machine-learning,

    T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A small-footprint high-throughput accelerator for ubiqui- tous machine-learning, ” inProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, pp. 269–284

  7. [7]

    Executing your commands via motion diffusion in latent space,

    X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space, ” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 000–18 010

  8. [8]

    Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,

    Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks, ” inProceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), 2016, pp. 367–379

  9. [9]

    DaDianNao: A machine-learning supercomputer,

    Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A machine-learning supercomputer, ” inProceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2014, pp. 609–622

  10. [10]

    LLMServingSim: A HW/SW co-simulation infrastructure for LLM inference serving at scale,

    J. Cho, M. Kim, H. Choi, G. Heo, and J. Park, “LLMServingSim: A HW/SW co-simulation infrastructure for LLM inference serving at scale, ” inProceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2024, pp. 15–29

  11. [11]

    Diffusion models beat GANs on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis, ” 2021

  12. [12]

    Graphi- cionado: A high-performance and energy-efficient accelerator for graph analytics,

    T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, “Graphi- cionado: A high-performance and energy-efficient accelerator for graph analytics, ” inProceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–13

  13. [13]

    EIE: Efficient inference engine on compressed deep neural network,

    S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient inference engine on compressed deep neural network, ” 2016

  14. [14]

    ExTensor: An accelerator for sparse tensor algebra,

    K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “ExTensor: An accelerator for sparse tensor algebra, ” inProceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019, pp. 319– 333

  15. [15]

    EXION: Exploiting inter- and intra-iteration output sparsity for diffusion models,

    J. Heo, A. Putra, J. Yoon, S. Yune, H. Lee, J.-H. Kim, and J.-Y. Kim, “EXION: Exploiting inter- and intra-iteration output sparsity for diffusion models, ” inProceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2025, pp. 324– 337

  16. [16]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models, ” 2020

  17. [17]

    Video diffusion models,

    J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models, ” 2022

  18. [18]

    Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,

    R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models, ”arXiv preprint arXiv:2301.12661, 2023

  19. [19]

    In-datacenter performance analysis of a tensor processing unit,

    N. P. Jouppi, C. Young, N. Patil, D. Pattersonet al., “In-datacenter performance analysis of a tensor processing unit, ” inProceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), 2017, pp. 1–12

  20. [20]

    Elucidating the design space of diffusion-based generative models,

    T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models, ” 2022

  21. [21]

    Ramulator: A fast and extensible DRAM simulator,

    Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible DRAM simulator, ”IEEE Computer Architecture Letters, vol. 15, no. 1, pp. 45–49, 2016

  22. [22]

    Cambricon-D: Full-network differential acceleration for diffusion models,

    W. Kong, Y. Hao, Q. Guo, Y. Zhao, X. Song, X. Li, M. Zou, Z. Du, R. Zhang, C. Liu, Y. Wen, P. Jin, X. Hu, W. Li, Z. Xu, and T. Chen, “Cambricon-D: Full-network differential acceleration for diffusion models, ” inProceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), 2024, pp. 903–914

  23. [23]

    DiffWave: A versatile diffusion model for audio synthesis,

    Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A versatile diffusion model for audio synthesis, ” 2021

  24. [24]

    MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable intercon- nects,

    H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable intercon- nects, ” inProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018, pp. 461–475

  25. [25]

    Q-Diffusion: Quantizing diffusion models,

    X. Li, Y. Liu, L. Lian, H. Yang, Z. Dong, D. Kang, S. Zhang, and K. Keutzer, “Q-Diffusion: Quantizing diffusion models, ” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17 535–17 545

  26. [26]

    SnapFusion: Text-to-image diffusion model on mobile devices within two seconds,

    Y. Li, H. Wang, Q. Jin, J. Hu, P. Chemerys, Y. Fu, Y. Wang, S. Tulyakov, and J. Ren, “SnapFusion: Text-to-image diffusion model on mobile devices within two seconds, ” 2023

  27. [27]

    Pseudo numerical methods for diffusion models on manifolds,

    L. Liu, Y. Ren, Z. Lin, and Z. Zhao, “Pseudo numerical methods for diffusion models on manifolds, ” 2022

  28. [28]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models, ” arXiv preprint arXiv:2211.01095, 2022

  29. [29]

    Ramulator 2.0: A modern, modular, and extensible DRAM simulator,

    H. Luo, Y. C. Tugrul, F. N. Bostanci, A. Olgun, A. G. Yaglikci, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible DRAM simulator, ”IEEE Computer Architecture Letters, vol. 23, no. 1, pp. 112–116, 2024

  30. [30]

    DeepCache: Accelerating diffusion models for free,

    X. Ma, G. Fang, and X. Wang, “DeepCache: Accelerating diffusion models for free, ” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 15 762– 15 772

  31. [31]

    Improved denoising diffusion probabilistic models,

    A. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models, ” 2021

  32. [32]

    OuterSPACE: An outer product based sparse matrix multiplication accelerator,

    S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng, C. Chakrabarti, H.-S. Kim, D. T. Blaauw, T. N. Mudge, and R. G. Dreslinski, “OuterSPACE: An outer product based sparse matrix multiplication accelerator, ” inProceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018

  33. [33]

    Timeloop: A systematic approach to DNN accelerator evaluation,

    A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A systematic approach to DNN accelerator evaluation, ” inProceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 304–315

  34. [34]

    SCNN: An accelerator for compressed-sparse convolutional neural networks,

    A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An accelerator for compressed-sparse convolutional neural networks, ” in Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), 2017, pp. 27–40

  35. [35]

    Tale of two Cs: Computation vs. communication scaling for future trans- formers on future hardware,

    S. Pati, S. Aga, M. Islam, N. Jayasena, and M. D. Sinclair, “Tale of two Cs: Computation vs. communication scaling for future trans- formers on future hardware, ” inProceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2023

  36. [36]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers, ” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4195–4205

  37. [37]

    SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training,

    E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, “SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training, ” inPro- ceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 58–70

  38. [38]

    Hierar- chical text-conditional image generation with CLIP latents,

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierar- chical text-conditional image generation with CLIP latents, ” 2022. 11

  39. [39]

    MLPerf inference benchmark,

    V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Andersonet al., “MLPerf inference benchmark, ” inProceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), 2020, pp. 446–459

  40. [40]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models, ” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 684–10 695

  41. [41]

    DRAMSim2: A cycle accurate memory system simulator,

    P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A cycle accurate memory system simulator, ”IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16–19, 2011

  42. [42]

    Photorealistic text-to-image diffusion models with deep language understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding, ” 2022

  43. [43]

    Progressive distillation for fast sampling of diffusion models,

    T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models, ” 2022

  44. [44]

    Post-training quantization on diffusion models,

    Y. Shang, Z. Yuan, B. Xie, B. Wu, and Y. Yan, “Post-training quantization on diffusion models, ” 2023

  45. [45]

    Denoising diffusion implicit models,

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models, ” inInternational Conference on Learning Representations (ICLR), 2021

  46. [46]

    Score-based generative modeling through stochastic differential equations,

    Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations, ” 2021

  47. [47]

    MatRaptor: A sparse-sparse matrix multiplication accelerator based on row- wise product,

    N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, “MatRaptor: A sparse-sparse matrix multiplication accelerator based on row- wise product, ” inProceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 766–780

  48. [48]

    Human motion diffusion model,

    G. Tevet, S. Raab, B. Gordon, Y. Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model, ” inInternational Confer- ence on Learning Representations (ICLR), 2023

  49. [49]

    EDGE: Editable dance generation from music,

    J. Tseng, R. Castellon, and K. Liu, “EDGE: Editable dance generation from music, ” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 448–458

  50. [50]

    Trapezoid: A versatile accelerator for dense and sparse matrix multiplications,

    Y. Yang, J. Emer, and D. Sanchez, “Trapezoid: A versatile accelerator for dense and sparse matrix multiplications, ” inProceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), 2024, pp. 931–945

  51. [51]

    Gamma: Leveraging gustavson’s algorithm to accelerate sparse matrix multiplication,

    G. Zhang, N. Attaluri, J. S. Emer, and D. Sanchez, “Gamma: Leveraging gustavson’s algorithm to accelerate sparse matrix multiplication, ” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2021, pp. 687–701

  52. [52]

    Cambricon-X: An accelerator for sparse neural networks,

    S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-X: An accelerator for sparse neural networks, ” inProceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12

  53. [53]

    SpArch: Efficient architecture for sparse matrix multiplication,

    Z. Zhang, H. Wang, S. Han, and W. J. Dally, “SpArch: Efficient architecture for sparse matrix multiplication, ” inProceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 261–274. 12