pith. sign in

arxiv: 1906.11915 · v3 · pith:BBDGQV4Onew · submitted 2019-06-27 · 💻 cs.AR

Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through Interleaved Bit-Partitioned Arithmetic

Pith reviewed 2026-05-25 13:42 UTC · model grok-4.3

classification 💻 cs.AR
keywords mixed-signal accelerationdeep neural networkscharge-domain computingbit-partitioned arithmeticswitched-capacitor circuitsanalog dot-product3D-stacked architecture
0
0 comments X

The pith

Bit-partitioning dot-products into interleaved low-bitwidth groups lets mixed-signal circuits accumulate in charge domain and share A/D converters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a vector dot-product can be reformulated as groups of spatially parallel low-bitwidth operations interleaved across vector elements. This change turns the accelerator building blocks into wide yet low-bitwidth multiply-accumulate units that run in the analog domain and share a single A/D converter. Switched-capacitor circuitry performs the group multiplications in the charge domain and accumulates results over multiple cycles. The approach addresses limited encoding range and noise through low bitwidth while removing the need for per-operation A/D conversion.

Core claim

A vector dot-product can be bit-partitioned into groups of spatially parallel low-bitwidth operations interleaved across multiple elements of the vectors, so that groups of wide yet low-bitwidth multiply-accumulate units operate in the analog domain and share a single A/D converter, with switched-capacitor circuitry performing the group multiplications in the charge domain and accumulating the results of the group in its capacitors over multiple cycles.

What carries the argument

Interleaved bit-partitioned arithmetic realized through switched-capacitor charge-domain accumulation that shares one A/D converter across a group.

Load-bearing premise

Low-bitwidth bit-partitioned operations performed in the analog domain can handle encoding range limits and noise while the interleaved capacitive accumulation preserves the accuracy of the original DNN computation.

What would settle it

A direct accuracy comparison between a full DNN inference run on the proposed charge-domain interleaved bit-partitioned units versus an equivalent high-precision digital implementation, or power measurements showing whether A/D conversion energy per dot-product actually drops.

Figures

Figures reproduced from arXiv: 1906.11915 by Amir Yazdanbakhsh, Doug Burger, Hadi Esmaeilzadeh, Hardik Sharma, Kambiz Samadi, Nam Sung Kim, Sean Kinzer, Soroush Ghodrati.

Figure 1
Figure 1. Figure 1: Wide, interleaved, and bit-partitioned mathematical formulation. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hierarchically clustered architecture of B [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: depicts the design of a single 3-bit sign-magnitude MACC. The xsx1x0 and wsw1w0 denote the bit-partitions operands. The result of each MACC operation is retained as electric charge in the accumulating capacitor (CACC). In addi￾tion to CACC, the MACC unit contains two capacitive Digital￾to-Analog Converters, one for inputs (C-DACX) and one for weights (C-DACW). The C-DACX and C-DACW convert the C-DACx C-DAC… view at source ↗
Figure 4
Figure 4. Figure 4: Charge-domain MACC; phase by phase. 2-bit magnitude of the input and weight to the analog domain as an electric charge proportional to |x| and |w| respectively. C-DACX and C-DACW are each composed of two capacitors ((CX, 2CX) and (CW, 2CW)) which operate in parallel and are combined to convert the operands to analog domain. Each of these capacitors are controlled by a pair of transmission gates which deter… view at source ↗
Figure 5
Figure 5. Figure 5: Mixed-Signal bit-partitioned MACC unit. With this choice, Qsw becomes Qsw =|x|×|w| CWvDD 3 . Clkφ(3) : In the last phase, ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ResNet-50 and VGG-16 accuracy after fine-tuning. experiments show that, variations in voltage can be mitigated up to 20%. The extensive amount of vector dot-product operations in DNNs, allows for the minimum and maximum values of the distributions being sampled sufficient amount of times, leading to coverage of the corner cases. Atop all these considerations, we use differential signaling for ADCs which at… view at source ↗
Figure 7
Figure 7. Figure 7: BIHIWE compilation stack. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Speedup and energy improvement over TETRIS. GPU comparison. We also compare BIHIWE to two Nvidia GPUs (i.e., RTX 2080 TI and Titan Xp) based on Turing and Pascal architecture respectively, listed in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Energy breakdown of BIHIWE and TETRIS. normalized to TETRIS. Energy breakdown is reported across four major architectural components: (1) on-chip compute units, (2) on-chip memory (buffers and register file), (3) interconnect, and (4) 3D-stacked DRAM. DRAM accesses account for the highest portion of the energy in BIHIWE, since BIHIWE significantly reduces the on-chip compute energy. While BIHIWE has a sign… view at source ↗
Figure 10
Figure 10. Figure 10: Iso-area comparison with TETRIS. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance comparison to GPUs. weights. Therefore, PTB-RNN and PTB-LSTM use more energy for DRAM accesses compared to other benchmarks. Unlike the fully-digital PEs in TETRIS that perform a single operation in a cycle, BIHIWE uses MS-WAGGs which perform wide vectorized operations–crucial in BIHIWE to amortize the high cost of ADCs. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: Design space exploration for MS-BPMACC. A/D conversion and the number of MACC units (n) are two main parameters of MS-BPMACC which define resolution and the sample rate of the ADC, determining its power [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Design space exploration for # core per cluster. [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
read the original abstract

Low-power potential of mixed-signal design makes it an alluring option to accelerate Deep Neural Networks (DNNs). However, mixed-signal circuitry suffers from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) conversion overheads. This paper aims to address these challenges by offering and leveraging the insight that a vector dot-product (the basic operation in DNNs) can be bit-partitioned into groups of spatially parallel low-bitwidth operations, and interleaved across multiple elements of the vectors. As such, the building blocks of our accelerator become a group of wide, yet low-bitwidth multiply-accumulate units that operate in the analog domain and share a single A/D converter. The low-bitwidth operation tackles the encoding range limitation and facilitates noise mitigation. Moreover, we utilize the switched-capacitor design for our bit-level reformulation of DNN operations. The proposed switched-capacitor circuitry performs the group multiplications in the charge domain and accumulates the results of the group in its capacitors over multiple cycles. The capacitive accumulation combined with wide bit-partitioned operations alleviate the need for A/D conversion per operation. With such mathematical reformulation and its switched-capacitor implementation, we define a 3D-stacked microarchitecture, dubbed BIHIWE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a bit-partitioned reformulation of vector dot-products for DNN acceleration, enabling groups of low-bitwidth analog multiply-accumulate operations that share a single A/D converter. It describes a switched-capacitor implementation for charge-domain group multiplications with capacitive accumulation across cycles, and defines a 3D-stacked microarchitecture (BIHIWE) intended to reduce per-operation A/D overheads while addressing encoding range and noise issues via low-bitwidth operations.

Significance. If the claims on accuracy preservation and overhead reduction hold under realistic noise and process variation, the approach could enable more efficient mixed-signal DNN accelerators by minimizing A/D conversions through interleaved bit-partitioned charge-domain accumulation. The switched-capacitor reformulation is a concrete implementation idea that merits further exploration if supported by analysis.

major comments (2)
  1. [Abstract] Abstract: The central claims regarding noise mitigation, encoding range handling, and A/D overhead reduction via low-bitwidth bit-partitioned operations and interleaved capacitive accumulation are presented at a high level only, with no quantitative results, error analysis, simulations, circuit derivations, or accuracy evaluations provided to support them.
  2. [Abstract (paragraph on insight and implementation)] The weakest assumption—that low-bitwidth analog operations combined with switched-capacitor accumulation can maintain sufficient accuracy without per-operation A/D conversions—is not accompanied by any supporting derivation, noise model, or empirical validation in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, agreeing where additional support is needed and outlining revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims regarding noise mitigation, encoding range handling, and A/D overhead reduction via low-bitwidth bit-partitioned operations and interleaved capacitive accumulation are presented at a high level only, with no quantitative results, error analysis, simulations, circuit derivations, or accuracy evaluations provided to support them.

    Authors: We agree that the abstract presents the claims at a high level. The manuscript body provides the bit-partitioned reformulation of dot-products, the switched-capacitor implementation details, and the 3D-stacked microarchitecture definition. In revision we will expand the abstract to incorporate key quantitative estimates (such as A/D conversion reduction factors) drawn from the analysis already present in the paper. revision: yes

  2. Referee: [Abstract (paragraph on insight and implementation)] The weakest assumption—that low-bitwidth analog operations combined with switched-capacitor accumulation can maintain sufficient accuracy without per-operation A/D conversions—is not accompanied by any supporting derivation, noise model, or empirical validation in the manuscript.

    Authors: The referee correctly notes that the accuracy claim requires explicit support. The current manuscript argues qualitatively that low-bitwidth operations mitigate encoding-range and noise issues but does not include a noise model or derivation. We will add an analytical noise model and supporting derivation in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a novel architectural proposal that reformulates vector dot-products as interleaved bit-partitioned low-bitwidth analog operations implemented via switched-capacitor circuits, leading to the BIHIWE microarchitecture. No load-bearing step reduces by construction to fitted parameters, self-definitions, or self-citation chains; the central claims introduce independent design elements (group MAC units sharing A/D, capacitive accumulation over cycles) whose validity rests on the stated circuit properties rather than prior outputs of the same work. The provided abstract and description contain no equations or citations that exhibit the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; insufficient information to identify any.

pith-pipeline@v0.9.0 · 5793 in / 1099 out tokens · 39668 ms · 2026-05-25T13:42:30.556348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 3 internal anchors

  1. [1]

    Niehues, N.-Q

    J. Niehues, N.-Q. Pham, T.-L. Ha, M. Sperber, and A. Waibel. Low-Latency Neural Speech Translation. ArXiv e-prints, August 2018

  2. [2]

    Mo and J

    J. Mo and J. Sattar. SafeDrive: Enhancing Lane Appearance for Autonomous and Assisted Driving Under Limited Visibility.ArXiv e-prints, July 2018

  3. [3]

    R. Li, Y . Shu, J. Su, H. Feng, and J. Wang. Using deep Residual Network to search for galaxy-Ly {\alpha} emitter lens candidates based on spectroscopic-selection. ArXiv e-prints, July 2018

  4. [4]

    Rohde, S

    D. Rohde, S. Bonner, T. Dunlop, F. Vasile, and A. Karatzoglou. RecoGym: A Reinforcement Learning Environment for the problem of Product Recommenda- tion in Online Advertising. ArXiv e-prints, August 2018

  5. [5]

    Grabec, E

    I. Grabec, E. Švegl, and M. Sok. Development of a sensory-neural network for medical diagnosing.ArXiv e-prints, July 2018

  6. [6]

    Amant, Karthikeyan Sankaralingam, and Doug Burger

    Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. InISCA, 2011

  7. [7]

    Hardavellas, M

    N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers.IEEE Micro, 31(4):6–15, July–Aug. 2011

  8. [8]

    Conservation cores: Reducing the energy of mature computations

    Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo- Martinez, Steven Swanson, and Michael Bedford Taylor. Conservation cores: Reducing the energy of mature computations. In ASPLOS, 2010

  9. [9]

    Optimizing fpga-based accelerator design for deep convolutional neural networks

    Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In FPGA, 2015

  10. [10]

    Neural acceleration for general-purpose approximate programs

    Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for general-purpose approximate programs. to apear in Commun. ACM , 2013

  11. [11]

    Dadiannao: A machine-learning supercomputer

    Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A machine-learning supercomputer. In MICRO, 2014

  12. [12]

    Tetris: Scalable and efficient neural network acceleration with 3d memory

    Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. InASPLOS, 2017

  13. [13]

    Tartan: Accelerating fully-connected and convolutional layers in deep learning networks by exploiting numerical precision variability.arXiv, 2017

    Alberto Delmas, Sayeh Sharify, Patrick Judd, and An- dreas Moshovos. Tartan: Accelerating fully-connected and convolutional layers in deep learning networks by exploiting numerical precision variability.arXiv, 2017

  14. [14]

    TABLA: A unified template-based framework for accelerating statistical machine learning

    Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In HPCA, 2016

  15. [15]

    Cambricon-x: An accelerator for sparse neural networks

    Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In MICRO, 2016

  16. [16]

    Cnvlutin: ineffectual-neuron-free deep neural network computing

    Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. Cnvlutin: ineffectual-neuron-free deep neural network computing. In ISCA, 2016

  17. [17]

    Stripes: Bit- serial deep neural network computing

    Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. Stripes: Bit- serial deep neural network computing. InMICRO, 2016

  18. [18]

    From high-level deep neural models to fpgas

    Hardik Sharma, Jongse Park, Divya Mahajan, Em- manuel Amaro, Joon Kim, Chenkai Shao, Asit Misra, and Hadi Esmaeilzadeh. From high-level deep neural models to fpgas. In MICRO, 2016

  19. [19]

    Accelerating persistent neural networks at datacenter scale

    Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengil, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Christian Boehn, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Tamas Juhasz, Ratna Kumar Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Pat...

  20. [20]

    SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

    Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. InISCA, 2017

  21. [21]

    Yodann: An ultra-low power convolutional neural network accelerator based on binary weights

    Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. Yodann: An ultra-low power convolutional neural network accelerator based on binary weights. arXiv, 2016

  22. [22]

    Eie: efficient inference engine on compressed deep neural 12 network

    Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural 12 network. In ISCA, 2016

  23. [23]

    Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks

    Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. InISCA, 2016

  24. [24]

    Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks

    Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivi- enne Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. JSSC, 2017

  25. [25]

    Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory

    Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 380–392. IEEE, 2016

  26. [26]

    In- datacenter performance analysis of a tensor processing unit

    Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In- datacenter performance analysis of a tensor processing unit. In ISCA, 2017

  27. [27]

    Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

    Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, 2014

  28. [28]

    Bit fusion: Bit-level dynamically compos- able architecture for accelerating deep neural networks

    Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Es- maeilzadeh. Bit fusion: Bit-level dynamically compos- able architecture for accelerating deep neural networks

  29. [29]

    Vahide Aklaghi, Amir Yazdanbakhsh, Kambiz Samadi, Hadi Esmaeilzadeh, and Rajesh K. Gupta. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. InISCA, 2018

  30. [30]

    UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition

    Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. arXiv preprint arXiv:1804.06508, 2018

  31. [31]

    Unpu: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision

    Jinmook Lee, Changhyeon Kim, Sanghoon Kang, Dongjoo Shin, Sangyeob Kim, and Hoi-Jun Yoo. Unpu: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision. In ISSCC, 2018

  32. [32]

    Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars

    Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. InISCA, 2016

  33. [33]

    Promise: An end-to-end design of a programmable mixed-signal accelerator for machine-learning algorithms

    Prakalp Srivastava, Mingu Kang, Sujan K Gonu- gondla, Sungmin Lim, Jungwook Choi, Vikram Adve, Nam Sung Kim, and Naresh Shanbhag. Promise: An end-to-end design of a programmable mixed-signal accelerator for machine-learning algorithms. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018

  34. [34]

    Switched-capacitor neu- ral networks

    YP Tsividis and D Anastassiou. Switched-capacitor neu- ral networks. Electronics Letters, 23(18):958–959, 1987

  35. [35]

    Redeye: analog convnet image sensor architecture for continuous mobile vision

    Robert LiKamWa, Yunhui Hou, Julian Gao, Mia Polansky, and Lin Zhong. Redeye: analog convnet image sensor architecture for continuous mobile vision. In ACM SIGARCH Computer Architecture News , volume 44, pages 255–266. IEEE Press, 2016

  36. [36]

    Passive charge redistribution digital-to-analogue multiplier

    Daniel Bankman and Boris Murmann. Passive charge redistribution digital-to-analogue multiplier. Electronics Letters, 51(5):386–388, 2015

  37. [37]

    E. H. Lee and S. S. Wong. Analysis and design of a passive switched-capacitor matrix multiplier for approximate computing. IEEE Journal of Solid-State Circuits, 52(1):261–271, Jan 2017. ISSN 0018-9200. doi: 10.1109/JSSC.2016.2599536

  38. [38]

    An always-on 3.8µj/86% cifar-10 mixed-signal binary cnn processor with all memory on chip in 28nm cmos

    Daniel Bankman, Lita Yang, Bert Moons, Marian Verhelst, and Boris Murmann. An always-on 3.8µj/86% cifar-10 mixed-signal binary cnn processor with all memory on chip in 28nm cmos. InSolid-State Circuits Conference-(ISSCC), 2018 IEEE International, pages 222–224. IEEE, 2018

  39. [39]

    A 3.43 tops/w 48.9 pj/pixel 50.1 nj/classification 512 analog neuron sparse coding neural network with on-chip learning and classification in 40nm cmos

    Fred N Buhler, Peter Brown, Jiabo Li, Thomas Chen, Zhengya Zhang, and Michael P Flynn. A 3.43 tops/w 48.9 pj/pixel 50.1 nj/classification 512 analog neuron sparse coding neural network with on-chip learning and classification in 40nm cmos. In VLSI Circuits, 2017 Symposium on, pages C30–C31. IEEE, 2017

  40. [40]

    Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger

    Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. General-purpose code acceleration with limited-precision analog computation. In ISCA, 2014

  41. [41]

    18.4 a matrix-multiplying adc implementing a machine- learning classifier directly with data conversion

    Jintao Zhang, Zhuo Wang, and Naveen Verma. 18.4 a matrix-multiplying adc implementing a machine- learning classifier directly with data conversion. In Solid-State Circuits Conference-(ISSCC), 2015 IEEE International, pages 1–3. IEEE, 2015

  42. [42]

    Analysis and Design of a Passive Switched-Capacitor Matrix Multiplier for Approximate Computing

    Edward H Lee and S Simon Wong. Analysis and Design of a Passive Switched-Capacitor Matrix Multiplier for Approximate Computing. IEEE Journal of Solid-State Circuits, 52(1):261–271, 2017

  43. [43]

    Analysis and design of analog integrated circuits

    Paul R Gray, Paul Hurst, Robert G Meyer, and Stephen Lewis. Analysis and design of analog integrated circuits. Wiley, 2001

  44. [44]

    Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory

    Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ISCA, 2016

  45. [45]

    Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks

    Sayeh Sharify, Alberto Delmas Lascorz, Patrick Judd, and Andreas Moshovos. Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks. arXiv, 2017

  46. [46]

    Tetris: Scalable and efficient neural network acceleration with 3d memory

    Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. https: //github.com/stanford-mast/nn_dataflow, 2017

  47. [47]

    Caterpillar: Coarse grain reconfigurable architecture for accelerating the training of deep neural networks

    Yuanfang Li and Ardavan Pedram. Caterpillar: Coarse grain reconfigurable architecture for accelerating the training of deep neural networks. InApplication-specific Systems, Architectures and Processors (ASAP), 2017 IEEE 28th International Conference on , pages 1–10. IEEE, 2017

  48. [48]

    A high speed and low power 8 bit x 8 bit multiplier design using novel two transistor (2t) xor gates

    Himani Upadhyay and Shubhajit Roy Chowdhury. A high speed and low power 8 bit x 8 bit multiplier design using novel two transistor (2t) xor gates. Journal of Low Power Electronics , 01 2015. doi: 13 10.1166/jolpe.2015.1362

  49. [49]

    Hybrid memory cube specification 1.0.Last Revision Jan, 2013

    Hybrid Memory Cube Consortium et al. Hybrid memory cube specification 1.0.Last Revision Jan, 2013

  50. [50]

    Hybrid memory cube new dram architecture increases density and performance

    Joe Jeddeloh and Brent Keeth. Hybrid memory cube new dram architecture increases density and performance. In VLSI Technology (VLSIT), 2012 Symposium on, pages 87–88. IEEE, 2012

  51. [51]

    Wolfe, Kambiz Samadi, Hadi Esmaeilzadeh, and Nam Sung Kim

    Amir Yazdanbakhsh, Hajar Falahati, Philip J. Wolfe, Kambiz Samadi, Hadi Esmaeilzadeh, and Nam Sung Kim. GANAX: A Unified SIMD-MIMD Acceleration for Generative Adversarial Network. InISCA, 2018

  52. [52]

    McGraw-Hill New York, 1994

    Mohammed Ismail and Terri Fiez.Analog VLSI: signal and information processing, volume 166. McGraw-Hill New York, 1994

  53. [53]

    Mismatch characterization of small metal fringe capacitors.IEEE Transactions on Circuits and Systems I: Regular Papers, 61(8):2236–2242, 2014

    Vaibhav Tripathi and Boris Murmann. Mismatch characterization of small metal fringe capacitors.IEEE Transactions on Circuits and Systems I: Regular Papers, 61(8):2236–2242, 2014

  54. [54]

    Thermal feasibility of die-stacked processing in memory

    Yasuko Eckert, Nuwan Jayasena, and Gabriel H Loh. Thermal feasibility of die-stacked processing in memory. 2014

  55. [55]

    Facebook AI Research. Caffe2. https://caffe2.ai/

  56. [56]

    One weird trick for parallelizing convolutional neural networks

    Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv, 2014

  57. [57]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. URL http://image-net.org/

  58. [58]

    Very deep con- volutional networks for large-scale image recognition

    Karen Simonyan and Andrew Zisserman. Very deep con- volutional networks for large-scale image recognition. arXiv, 2014

  59. [59]

    Quantized neural networks: Training neural networks with low precision weights and activations

    Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv, 2016

  60. [60]

    Learning multi- ple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multi- ple layers of features from tiny images. Computer Sci- ence Department, University of Toronto, Tech. Rep, 2009

  61. [61]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015

  62. [62]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016

  63. [63]

    YOLOv3: An Incremental Improvement

    Joseph Redmon and Ali Farhadi. Yolov3: An incre- mental improvement. arXiv preprint arXiv:1804.02767, 2018

  64. [64]

    Building a large annotated corpus of english: The penn treebank

    Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 1993

  65. [65]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997

  66. [66]

    Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients

    Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv, 2016

  67. [67]

    Mishra, Eriko Nurvitadhi, Jeffrey J

    Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. WRPN: wide reduced-precision networks. arXiv, 2017

  68. [68]

    Ternary weight networks

    Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv, 2016

  69. [69]

    LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

    Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. arXiv preprint arXiv:1807.10029, 2018

  70. [70]

    https://developer.nvidia.com/ tensorrt

    Nvidia tensor rt 5.1. https://developer.nvidia.com/ tensorrt

  71. [71]

    Pipelayer: A pipelined reram-based accelerator for deep learning

    Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. Pipelayer: A pipelined reram-based accelerator for deep learning. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pages 541–552. IEEE, 2017

  72. [72]

    Freepdk45, 2018

    NCSU. Freepdk45, 2018. URL https: //www.eda.ncsu.edu/wiki/FreePDK45

  73. [73]

    B. Murmann. ADC Performance Survey 1997-2016 . murmann/adcsurvey.html, [Online]. Available. URL http://web.stanford.edu/

  74. [74]

    A 0.0013 mm2 10b 10ms/s sar adc with a 0.0048 mm2 42db-rejection passive fir filter

    Pieter Harpe. A 0.0013 mm2 10b 10ms/s sar adc with a 0.0048 mm2 42db-rejection passive fir filter. In2018 IEEE Custom Integrated Circuits Conference, CICC

  75. [75]

    Institute of Electrical and Electronics Engineers Inc., 2018

  76. [76]

    S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi. CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In ICCAD, 2011

  77. [77]

    Automatic differentiation in pytorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. InNIPS-W, 2017

  78. [78]

    Neural network distiller, June 2018

    Neta Zmora, Guy Jacob, and Gal Novik. Neural network distiller, June 2018. URL https://doi.org/10.5281/zenodo.1297430

  79. [79]

    Reram-based processing-in-memory architecture for recurrent neural network acceleration

    Yun Long, Taesik Na, and Saibal Mukhopadhyay. Reram-based processing-in-memory architecture for recurrent neural network acceleration. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, (99):1–14, 2018

  80. [80]

    Switched-opamp: An approach to realize full cmos switched-capacitor circuits at very low power supply voltages

    Jan Crols and Michel Steyaert. Switched-opamp: An approach to realize full cmos switched-capacitor circuits at very low power supply voltages. IEEE Journal of Solid-State Circuits, 29(8):936–942, 1994

Showing first 80 references.