arxiv: 2605.05119 · v1 · submitted 2026-05-06 · 💻 cs.AR

Recognition: unknown

MCFlash: Bulk Bitwise Processing in 3D NAND with Dynamic Sensing and Multi-level Encoding

Habib Ur Rahman , Tharini Suresh , Sudeep Pasricha , Biswajit Ray

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:00 UTC · model grok-4.3

classification 💻 cs.AR

keywords 3D NAND flashbitwise operationsin-memory computingmulti-level celldynamic sensingbulk processingcommercial NANDon-chip logic

0 comments

The pith

MCFlash performs bulk bitwise operations directly inside commercial 3D NAND flash chips using only standard user-mode instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that off-the-shelf 3D NAND memory can run large-scale bitwise logic such as AND, OR, and XOR without moving data out to a separate processor. It achieves this by storing data in multiple charge levels per cell and adjusting the voltages used to sense those levels on the fly. A sympathetic reader would care because most data movement between memory and compute units wastes time and energy; keeping simple logic inside the storage array could shrink that cost for workloads that scan or combine huge bit vectors. The authors test the method on multiple real chips and report it stays reliable for over a billion operations on new blocks and for thousands of program-erase cycles afterward.

Core claim

MCFlash is a technique that executes bulk bitwise operations directly within commercial off-the-shelf 3D NAND flash chips. It relies solely on standard user-mode instructions, combining Multi-Level Cell data encodings with dynamically tuned read reference voltages to execute in-place bitwise operations. Evaluations across diverse NAND chips, both floating-gate and charge-trap, demonstrate error-free operation sustaining over one billion operations on fresh blocks and bit-error rates below 0.015 percent even after 10,000 program/erase cycles.

What carries the argument

The pairing of multi-level cell charge encodings with on-the-fly adjustment of read reference voltages that turns standard sense operations into bitwise logic gates performed inside the NAND array itself.

If this is right

Bulk bitwise operations become possible without moving data between the NAND array and an external processor.
Energy and latency costs for scans and reductions over large bit vectors drop because computation stays inside the memory.
The same approach works across floating-gate and charge-trap cell technologies from different generations.
Reliability holds for more than a billion operations on fresh blocks and continues after heavy wear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Storage arrays could act as simple compute units for database filters or neural-network bit packing without extra hardware.
Chaining several bitwise steps inside the array might support wider arithmetic without leaving the chip.
Power savings in large-scale analytics systems would grow if the method scales to entire planes or dies running in parallel.

Load-bearing premise

That any commercial 3D NAND chip will accept and correctly respond to the same user-mode read commands and voltage settings without hidden internal behaviors that would break the bitwise results.

What would settle it

Running the described bitwise sequences on a new commercial 3D NAND chip model, applying the same dynamic voltage adjustments through ordinary commands, and checking whether the output bit-error rate stays below 0.015 percent after 10,000 program-erase cycles; failure on either fresh or aged blocks would disprove the central claim.

read the original abstract

This paper presents MCFlash, a practical and immediately deployable technique for executing bulk bitwise operations directly within commercial off-the-shelf(COTS) 3D NAND flash chips. MCFlash relies solely on standard user-mode instructions, combining Multi-Level Cell (MLC) data encodings with dynamically tuned read reference voltages to execute in-place bitwise operations. We evaluate MCFlash across diverse NAND flash chips, both floating-gate and charge-trap variants, from different generations. Our results represent the first demonstration of error-free, on-chip bitwise operations, sustaining over one billion operations on fresh blocks and maintaining bit-error rates below 0.015% even after 10,000 program/erase (P/E) cycles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MCFlash shows bitwise ops can run inside real 3D NAND chips via multi-level encoding and read-voltage tweaks, with solid error numbers on hardware, though the standard-command claim hinges on how the tuning actually works.

read the letter

The main takeaway is that this work gets error-free bulk bitwise operations running on commercial 3D NAND by storing data across multiple voltage levels and adjusting read references on the fly. They back it with measurements on actual chips, both floating-gate and charge-trap, from different generations, showing over a billion operations on fresh blocks and bit-error rates below 0.015 percent after 10,000 P/E cycles.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MCFlash, a technique for performing bulk bitwise operations directly inside commercial off-the-shelf 3D NAND flash chips. It combines multi-level cell encodings with dynamically tuned read reference voltages applied via standard user-mode instructions, and reports error-free execution of over one billion operations on fresh blocks together with bit-error rates below 0.015% after 10,000 P/E cycles across both floating-gate and charge-trap devices from multiple generations.

Significance. If the central claims are substantiated, the work would constitute a meaningful advance in in-memory computing for NAND-based systems by demonstrating a practical, hardware-agnostic method for on-chip bitwise processing that requires no custom silicon and exhibits strong endurance. The cross-generation, cross-technology evaluation is a positive feature that supports broader applicability.

major comments (2)

[Methods] Methods section (voltage tuning procedure): the claim that bitwise operations are performed using only standard user-mode instructions plus dynamically tuned read voltages is load-bearing for both the 'immediately deployable on COTS' assertion and the 'first demonstration' status, yet the manuscript provides insufficient detail on the exact command sequences, calibration steps, and safeguards against FTL intervention or vendor-specific restrictions. Without this, it is impossible to confirm that the tuning does not rely on non-standard access unavailable in normal operation.
[Evaluation] Evaluation / Results section: the abstract states concrete metrics (error-free operation for >1 billion cycles, BER < 0.015% after 10k P/E), but the manuscript lacks accompanying data tables, statistical summaries, or raw measurement logs that would allow independent verification of these numbers under the stated conditions.

minor comments (1)

[Abstract] The abstract would benefit from explicitly naming the bitwise operations (AND, OR, XOR, etc.) demonstrated so that the scope of the claimed error-free execution is immediately clear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of MCFlash as a practical in-memory computing approach on COTS 3D NAND. We address each major comment point by point below and outline the revisions we will make to improve clarity and verifiability.

read point-by-point responses

Referee: [Methods] Methods section (voltage tuning procedure): the claim that bitwise operations are performed using only standard user-mode instructions plus dynamically tuned read voltages is load-bearing for both the 'immediately deployable on COTS' assertion and the 'first demonstration' status, yet the manuscript provides insufficient detail on the exact command sequences, calibration steps, and safeguards against FTL intervention or vendor-specific restrictions. Without this, it is impossible to confirm that the tuning does not rely on non-standard access unavailable in normal operation.

Authors: We agree that the current Methods section would benefit from greater specificity to allow readers to reproduce the exact procedure. In the revised manuscript we will add a dedicated subsection that enumerates the precise sequence of standard user-mode NAND commands (including the read, program, and erase opcodes issued through the standard interface), the iterative calibration algorithm used to select the dynamic read reference voltages for each multi-level encoding, and the explicit steps taken to operate on raw blocks while disabling or bypassing file-system and FTL layers (e.g., by using direct block-level access on unmounted devices). These additions will be supported by pseudocode and timing diagrams so that the claim of standard-instruction-only operation can be independently verified. revision: yes
Referee: [Evaluation] Evaluation / Results section: the abstract states concrete metrics (error-free operation for >1 billion cycles, BER < 0.015% after 10k P/E), but the manuscript lacks accompanying data tables, statistical summaries, or raw measurement logs that would allow independent verification of these numbers under the stated conditions.

Authors: The aggregate results are derived from repeated trials across multiple devices and P/E points, but we concur that tabular presentation and statistical detail would strengthen verifiability. The revised manuscript will include new tables that report, for each device generation and technology, the mean BER, standard deviation, number of trials, and total operations performed at 0, 1k, 5k, and 10k P/E cycles. A supplementary data file containing per-trial BER logs for a representative subset of experiments will also be provided. Full raw traces exceed practical appendix size; therefore we will make a curated subset available upon request while ensuring the tables allow direct confirmation of the reported thresholds. revision: partial

Circularity Check

0 steps flagged

No circularity: experimental hardware technique with no equations or fitted predictions

full rationale

The paper describes an empirical hardware technique for in-place bitwise operations on COTS 3D NAND using standard user-mode commands and dynamic read-voltage tuning. No mathematical derivation chain, equations, parameters fitted to subsets of data, or self-citation load-bearing premises are present. Claims rest on direct measurements across multiple chip generations and types, with error rates reported from physical experiments rather than any constructed prediction that reduces to the inputs. This is the expected non-finding for a purely experimental systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is an experimental hardware technique that exploits existing NAND cell behavior; no new mathematical axioms, free parameters, or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5426 in / 1076 out tokens · 47907 ms · 2026-05-08T16:00:14.328347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references

[1]

D. S. Cali, G. S. Kalsi, Z. Bing¨ ol, C. Firtina, L. Subramanian, J. S. Kim, R. Ausavarungnirun, M. Alser, J. Gomez-Luna, A. Boroumand, A. Norion, A. Scibisz, S. Subramoneyon, C. Alkan, S. Ghose, and O. Mutlu. Genasm: A high- performance, low-power approximate string matching acceleration framework for genome sequence analysis. In53rd Annual IEEE/ACM Inte...

2020
[2]

Perach, R

B. Perach, R. Ronen, B. Kimelfeld, and S. Kvatinsky. Understanding bulk- bitwise processing in-memory through database analytics.IEEE Transactions on Emerging Topics in Computing, 12(1):7–22, 2024

2024
[3]

Besta, R

M. Besta, R. Kanakagiri, G. Kwasniewski, R. Ausavarungnirun, J. Ber´ anek, K. Kanellopoulos, K. Janda, Z. Vonarburg-Shmaria, L. Gianinazzi, I. Stefan, J. G. Luna, J. Golinowski, M. Copik, L. Kapp-Schwoerer, S. Di Girolamo, N. Blach, M. Konieczny, O. Mutlu, and T. Hoefler. Sisa: Set-centric instruction set archi- tecture for graph mining on processing-in-m...

2021
[4]

Karunaratne, M

G. Karunaratne, M. Le Gallo, G. Cherubini, L. Benini, A. Rahimi, and A. Sebas- tian. In-memory hyperdimensional computing.Nature Electronics, 3(6):327–337, 2020

2020
[5]

Lee and J.-H

S.-T. Lee and J.-H. Lee. Neuromorphic computing using nand flash memory architecture with pulse width modulation scheme.Frontiers in Neuroscience, 14:571292, 2020

2020
[7]

C. Gao, X. Xin, Y. Lu, Y. Zhang, J. Yang, and J. Shu. Parabit: Processing parallel bitwise operations in nand flash memory based ssds. InMICRO-54, pages 59–70, 2021

2021
[8]

J. Park, R. Azizi, G. F. Oliveira, M. Sadrosadati, R. Nadig, D. Novo, J. G´ omez- Luna, M. Kim, and O. Mutlu. Flash-cosmos: In-flash bulk bitwise operations using inherent computation capability of nand flash memory. In55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 937–955, 2022

2022
[9]

Merrikh-Bayat, X

F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, K. K. Likharev, and D. B. Strukov. High-performance mixed-signal neurocomputing with nanoscale floating-gate memory cell arrays.IEEE Transactions on Neural Networks and Learning Systems, 29(10):4782–4790, 2018

2018
[10]

W. H. Choi, P.-F. Chiu, W. Ma, G. Hemink, T. T. Hoang, M. Lueker-Boden, and Z. Bandic. An in-flash binary neural network accelerator with slc nand flash array. InIEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5, 2020

2020
[11]

Bavandpour, S

M. Bavandpour, S. Sahay, M. R. Mahmoodi, and D. Strukov. Mixed-signal vector- by-matrix multiplier circuits based on 3d-nand memories for neurocomputing. In Design, Automation & Test in Europe Conference (DATE), pages 696–701, 2020

2020
[12]

M. Kim, M. Liu, L. R. Everson, and C. H. Kim. An embedded nand flash- based compute-in-memory array demonstrated in a standard logic process.IEEE Journal of Solid-State Circuits, 57(2):625–638, 2022

2022
[13]

P. Wang, F. Xu, B. Wang, B. Gao, H. Wu, H. Qian, and S. Yu. Three-dimensional nand flash for vector–matrix multiplication.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(4):988–991, 2019

2019
[14]

B. Gu, A. S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang. Biscuit: A framework for near-data processing of big data workloads. In43rd International Symposium on Computer 24 Architecture (ISCA), pages 153–165, 2016

2016
[15]

Seshadri, M

S. Seshadri, M. Gahagan, S. Bhaskaran, T. Bunker, A. De, Y. Jin, Y. Liu, and S. Swanson. Willow: A user-programmable ssd. In11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 67–80, 2014

2014
[16]

G. Koo, K. K. Matam, T. I, H. V. K. G. Narra, J. Li, H.-W. Tseng, S. Swanson, and M. Annavaram. Summarizer: Trading communication with computing near storage. InMICRO-50, pages 219–231, 2017

2017
[17]

Mansouri Ghiasi, J

N. Mansouri Ghiasi, J. Park, H. Mustafa, J. Kim, A. Olgun, A. Gollwitzer, D. Senol Cali, C. Firtina, H. Mao, N. Almadhoun Alserr, R. Ausavarungnirun, N. Vijaykumar, M. Alser, and O. Mutlu. Genstore: A high-performance in-storage processing system for genome sequence analysis. InASPLOS, pages 635–654, 2022

2022
[18]

Z. Ruan, T. He, and J. Cong. Insider: Designing in-storage computing system for emerging high-performance drive. InUSENIX Annual Technical Conference, pages 379–394, 2019

2019
[19]

S. Pei, J. Yang, and Q. Yang. Registor: A platform for unstructured data processing inside ssd storage.ACM Transactions on Storage, 15(1):7:1–7:24, 2019

2019
[20]

D. Gouk, M. Kwon, H. Bae, and M. Jung. Dockerssd: Containerized in-storage processing and hardware acceleration for computational ssds. InIEEE Interna- tional Symposium on High-Performance Computer Architecture (HPCA), pages 379–394, 2024

2024
[21]

V. S. Mailthody, Z. Qureshi, W. Liang, Z. Feng, S. G. de Gonzalo, Y. Li, H. Franke, J. Xiong, J. Huang, and W. Hwu. Deepstore: In-storage acceleration for intelligent queries. InMICRO-52, pages 224–238, 2019

2019
[22]

Torabzadehkashi, S

M. Torabzadehkashi, S. Rezaei, A. Heydarigorji, H. Bobarshad, V. Alves, and N. Bagherzadeh. Catalina: In-storage processing acceleration for scalable big data analytics. In27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pages 430–437, 2019

2019
[23]

Hajinazar, G

N. Hajinazar, G. F. Oliveira, S. Gregorio, J. D. Ferreira, N. M. Ghiasi, M. Patel, M. Alser, S. Ghose, J. G´ omez-Luna, and O. Mutlu. Simdram: A framework for bit-serial simd processing using dram. InASPLOS, pages 329–345, 2021

2021
[24]

Seshadri, Y

V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization. InMICRO-46, pages 185–197, 2013

2013
[25]

Seshadri, D

V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry. Ambit: In-memory accelerator for bulk bitwise operations using commodity dram technology. InMICRO-50, pages 273–287, 2017

2017
[26]

S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie. Pinatubo: A processing- in-memory architecture for bulk bitwise operations in emerging non-volatile memories. InDesign Automation Conference (DAC), pages 1–6, 2016

2016
[27]

M. F. Ali, A. Jaiswal, and K. Roy. In-memory low-cost bit-serial addition using commodity dram technology.IEEE Transactions on Circuits and Systems I: Regular Papers, 67(1):155–165, 2020. 25

2020
[28]

X. Xin, Y. Zhang, and J. Yang. Elp2im: Efficient and low power bitwise operation processing in dram. InHPCA, pages 303–314, 2020

2020
[29]

G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, and H. Yang. Graphh: A processing-in-memory architecture for large-scale graph pro- cessing.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(4):640–653, 2019

2019
[30]

˙I. E. Y¨ uksel, Y. C. Tu˘ grul, F. N. Bostancı, G. F. Oliveira, A. G. Ya˘ glık¸ cı, A. Olgun, M. Soysal, H. Luo, J. G´ omez-Luna, M. Sadrosadati, and O. Mutlu. Simultaneous many-row activation in off-the-shelf dram chips: Experimental characterization and analysis. InDSN, pages 99–114, 2024

2024
[31]

G. F. Oliveira, A. Olgun, A. G. Ya˘ glık¸ cı, F. N. Bostancı, J. G´ omez-Luna, S. Ghose, and O. Mutlu. Mimdram: An end-to-end processing-using-dram system for high- throughput, energy-efficient and programmer-transparent multiple-instruction multiple-data computing. InHPCA, pages 186–203, 2024

2024
[32]

G. F. Oliveira Junior, M. Kabra, Y. Guo, K. Chen, A. G. Yaglikci, M. Soysal, M. Sadrosadati, J. O. Bueno, S. Ghose, J. G´ omez-Luna, and O. Mutlu. Proteus: Achieving high-performance processing-using-dram with dynamic bit-precision, adaptive data representation, and flexible arithmetic. InICS, pages 473–494, 2025

2025
[33]

F. Gao, G. Tziantzioulis, and D. Wentzlaff. Computedram: In-memory compute using off-the-shelf drams. InMICRO-52, pages 100–113, 2019

2019
[34]

˙I. E. Y¨ uksel, Y. C. Tu˘ grul, A. Olgun, F. N. Bostancı, A. G. Ya˘ glık¸ cı, G. F. Oliveira, H. Luo, J. G´ omez-Luna, M. Sadrosadati, and O. Mutlu. Functionally-complete boolean logic in real dram chips: Experimental characterization and analysis. In HPCA, pages 280–296, 2024

2024
[35]

Eckert, X

C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das. Neural cache: Bit-serial in-cache acceleration of deep neural networks. InISCA, pages 383–396, 2018

2018
[36]

S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das. Compute caches. InHPCA, pages 481–492, 2017

2017
[37]

Zhang, Z

J. Zhang, Z. Wang, and N. Verma. In-memory computation of a machine-learning classifier in a standard 6t sram array.IEEE Journal of Solid-State Circuits, 52(4):915–924, 2017

2017
[38]

M. Kang, S. K. Gonugondla, and N. R. Shanbhag. Deep in-memory architectures in sram: An analog approach to approximate computing.Proceedings of the IEEE, 108(12):2251–2275, 2020

2020
[39]

Kim and W

J. Kim and W. Sung. Low-energy error correction of nand flash memory through soft-decision decoding.EURASIP Journal on Advances in Signal Processing, 2012(1):195, 2012

2012
[40]

J. Park, M. Kim, M. Chun, L. Orosa, J. Kim, and O. Mutlu. Reducing solid-state drive read latency by optimizing read-retry. InASPLOS, pages 702–716, 2021

2021
[41]

Lee, H.-S

J. Lee, H.-S. Im, D.-S. Byeon, K.-H. Lee, D.-H. Chae, K.-H. Lee, S. W. Hwang, S.- S. Lee, Y.-H. Lim, J.-D. Lee, J.-D. Choi, Y.-I. Seo, J.-S. Lee, and K.-D. Suh. High- performance 1-gb-nand flash memory with 0.12-µm technology.IEEE Journal of Solid-State Circuits, 37(11):1502–1509, 2002. 26

2002
[42]

Ftdi chip, 2025

Future Technology Devices International Ltd. Ftdi chip, 2025

2025
[43]

Open nandflash interface specification revision 5.2, 2025

Open NANDFlash Interface. Open nandflash interface specification revision 5.2, 2025

2025
[44]

Y. Hu, H. Jiang, D. Feng, L. Tian, H. Luo, and C. Ren. Exploring and exploiting the multilevel parallelism inside ssds for improved performance and endurance. IEEE Transactions on Computers, 62(6):1141–1155, 2013

2013
[45]

C. Kim, J. Ryu, T. Lee, H. Kim, J. Lim, J. Jeong, S. Seo, H. Jeon, B. Kim, I. Lee, D. Lee, P. Kwak, S. Cho, Y. Yim, C. Cho, W. Jeong, K. Park, J.-M. Han, D. Song, K. Kyung, Y.-H. Lim, and Y.-H. Jun. A 21 nm high performance 64 gb mlc nand flash memory with 400 mb/s asynchronous toggle ddr interface.IEEE Journal of Solid-State Circuits, 47(4):981–989, 2012

2012
[46]

Kim, D.-H

C. Kim, D.-H. Kim, W. Jeong, H.-J. Kim, I. H. Park, H.-W. Park, J. Lee, J. Park, Y.-L. Ahn, J. Y. Lee, S.-B. Kim, H. Yoon, J. D. Yu, N. Choi, N. Kim, H. Jang, J. Park, S. Song, Y. Park, J. Bang, S. Hong, Y. Choi, M.-S. Kim, H. Kim, P. Kwak, J.-D. Ihm, D. S. Byeon, J.-Y. Lee, K.-T. Park, and K.-H. Kyung. A 512-gb 3- b/cell 64-stacked wl 3-d-nand flash memo...

2018
[47]

H. U. Rahman, S. Tharini, S. Pasricha, and B. Ray. Tcflash: In-flash bulk bitwise processing via dynamic sensing and tlc encoding in 3d nand. In2025 IEEE 43rd International Conference on Computer Design (ICCD), pages 491–494, 2025. 27

2025