The Synthesis-Sequencing Channel for DNA-based Data Storage
Pith reviewed 2026-06-27 20:38 UTC · model grok-4.3
The pith
The synthesis-sequencing channel has an exact information-theoretic capacity given by matching converse and achievability bounds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The synthesis-sequencing channel is formed by composing a synthesis stage followed by a sequencing stage. When synthesis and sequencing are each modeled by a binary symmetric channel, matching converse and achievability bounds establish the exact capacity under mild assumptions on the error probabilities and coverage parameters.
What carries the argument
The synthesis-sequencing channel, whose output statistics arise from the successive action of synthesis coverage, synthesis errors, sequencing coverage, and sequencing errors.
If this is right
- The maximum reliable rate is determined by the product of synthesis coverage, sequencing coverage, and the two binary symmetric channel capacities adjusted for the induced bias.
- Increasing physical coverage after synthesis trades off against increasing sequencing coverage in a manner that is not symmetric.
- The model produces correlated errors across reads even when each individual stage is memoryless.
- Exact capacity formulas become available for any parameter tuple obeying the mild conditions.
Where Pith is reading between the lines
- Designers could allocate total sequencing effort between deeper synthesis and deeper readout to maximize the derived capacity.
- The same two-stage composition idea could be applied to other storage media that undergo fabrication and readout stages.
- Numerical evaluation of the capacity expression for concrete coverage and error values would immediately yield concrete rate targets for laboratory systems.
Load-bearing premise
The mild assumptions on the channel parameters that make the converse and achievability bounds coincide.
What would settle it
An explicit input distribution and decoding rule whose achieved rate exceeds the claimed capacity expression for any set of parameters satisfying the mild assumptions would falsify the result.
Figures
read the original abstract
We introduce and study the synthesis-sequencing channel, a two-stage model for DNA-based data storage that jointly captures synthesis and sequencing effects. The synthesis-sequencing channel provides a more nuanced and realistic model of the DNA storage process compared to prior work, as it distinguishes between physical coverage after synthesis and sequencing coverage after readout, relaxes the assumption of independent errors across reads, and naturally induces coverage bias through the composition of synthesis and sequencing stages. We establish the information-theoretic capacity of this channel by deriving matching converse and achievability bounds for the case where synthesis and sequencing errors are modeled by binary symmetric channels with possibly different error probabilities, under mild assumptions on the channel parameters. Our results reveal multiple trade-offs between physical coverage, synthesis errors, sequencing coverage, and sequencing errors that influence the maximum achievable rate for reliable data storage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the synthesis-sequencing channel, a two-stage model for DNA-based data storage that composes synthesis and sequencing stages, each modeled as a binary symmetric channel (BSC) with possibly distinct crossover probabilities. It distinguishes physical coverage after synthesis from sequencing coverage after readout, relaxes independent-error assumptions across reads, and induces coverage bias naturally. The central claim is that the information-theoretic capacity is established via matching converse and achievability bounds under mild assumptions on the channel parameters, with the results highlighting trade-offs among physical coverage, synthesis errors, sequencing coverage, and sequencing errors.
Significance. If the matching bounds hold under the stated assumptions, the work supplies the first capacity characterization for a more realistic DNA-storage channel that captures coverage bias and error dependence. This would be a substantive advance over prior models that treat synthesis and sequencing separately or assume uniform coverage, and the explicit trade-off analysis could inform practical parameter choices in DNA storage systems.
major comments (1)
- [Abstract and capacity theorem statement] The abstract states that matching converse and achievability bounds are obtained 'under mild assumptions on the channel parameters,' yet the precise form of these assumptions (e.g., any inequalities relating the synthesis BSC crossover p, sequencing BSC crossover q, synthesis coverage, and sequencing coverage) is not visible in the provided text. Because the capacity result is explicitly conditioned on these assumptions, their necessity, scope, and practical relevance must be stated explicitly (ideally with a dedicated lemma or theorem statement) so that readers can determine whether the result applies to the general model or only to a restricted parameter region.
minor comments (2)
- [Model definition] Notation for the two coverage parameters (physical vs. sequencing) should be introduced with a clear diagram or table early in the model section to avoid later ambiguity when discussing bias.
- [Introduction] The abstract claims the model 'relaxes the assumption of independent errors across reads'; the manuscript should include a short paragraph contrasting the induced dependence structure with the independent-read models in prior work.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive suggestion. We address the major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and capacity theorem statement] The abstract states that matching converse and achievability bounds are obtained 'under mild assumptions on the channel parameters,' yet the precise form of these assumptions (e.g., any inequalities relating the synthesis BSC crossover p, sequencing BSC crossover q, synthesis coverage, and sequencing coverage) is not visible in the provided text. Because the capacity result is explicitly conditioned on these assumptions, their necessity, scope, and practical relevance must be stated explicitly (ideally with a dedicated lemma or theorem statement) so that readers can determine whether the result applies to the general model or only to a restricted parameter region.
Authors: We agree that the assumptions conditioning the capacity result should be stated explicitly and visibly. In the revised version we will add a dedicated Lemma 1 that lists all required conditions on p, q, synthesis coverage, and sequencing coverage (including any inequalities needed for the matching bounds), together with a short discussion of their necessity for the proof and their practical relevance. The abstract and the statement of the main capacity theorem will be updated to reference this lemma directly. revision: yes
Circularity Check
No circularity; standard capacity derivation with explicit assumptions.
full rationale
The paper derives matching converse and achievability bounds for the synthesis-sequencing channel (modeled via two BSCs) using standard information-theoretic techniques. The result is conditioned on explicitly stated mild assumptions on parameters, but no step reduces by construction to a fitted input, self-definition, or load-bearing self-citation chain. The central claim remains independent of the paper's own inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- synthesis error probability
- sequencing error probability
axioms (1)
- domain assumption Mild assumptions on the channel parameters allow the converse and achievability bounds to match.
Reference graph
Works this paper leans on
-
[1]
Next-generation digital information storage in DNA,
G. M. Church, Y . Gao, and S. Kosuri, “Next-generation digital information storage in DNA,”Science, vol. 337, no. 6102, pp. 1628–1628, 2012
2012
-
[2]
Robust chemical preservation of digital information on DNA in silica with error-correcting codes,
R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J. Stark, “Robust chemical preservation of digital information on DNA in silica with error-correcting codes,”Angewandte Chemie International Edition, vol. 54, no. 8, pp. 2552–2555, 2015
2015
-
[3]
Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,
N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,”nature, vol. 494, no. 7435, pp. 77–80, 2013
2013
-
[4]
Portable and error-free DNA-based data storage,
S. H. T. Yazdi, R. Gabrys, and O. Milenkovic, “Portable and error-free DNA-based data storage,”Scientific reports, vol. 7, no. 1, p. 5011, 2017
2017
-
[5]
Improved read/write cost tradeoff in DNA-based data storage using LDPC codes,
S. Chandak, K. Tatwawadi, B. Lau, J. Mardia, M. Kubit, J. Neu, P. Griffin, M. Wootters, T. Weissman, and H. Ji, “Improved read/write cost tradeoff in DNA-based data storage using LDPC codes,” in2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2019, pp. 147–156
2019
-
[6]
Hedges error-correcting code for DNA storage corrects indels and allows sequence constraints,
W. H. Press, J. A. Hawkins, S. K. Jones Jr, J. M. Schaub, and I. J. Finkelstein, “Hedges error-correcting code for DNA storage corrects indels and allows sequence constraints,”Proceedings of the National Academy of Sciences, vol. 117, no. 31, pp. 18 489–18 496, 2020
2020
-
[7]
DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage,
M. Welzel, P. M. Schwarz, H. F. Löchel, T. Kabdullayeva, S. Clemens, A. Becker, B. Freisleben, and D. Heider, “DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage,”Nature Communications, vol. 14, no. 1, p. 628, 2023
2023
-
[8]
Scalable and robust DNA-based storage via coding theory and deep learning,
D. Bar-Lev, I. Orr, O. Sabary, T. Etzion, and E. Yaakobi, “Scalable and robust DNA-based storage via coding theory and deep learning,”Nature Machine Intelligence, vol. 7, no. 4, pp. 639–649, 2025
2025
-
[9]
DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA,
R. Khabbaz, J. Mateos, M. Antonini, and S. Kas Hanna, “DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA,”bioRxiv preprint, 2026
2026
-
[10]
A characterization of the DNA data storage channel,
R. Heckel, G. Mikutis, and R. N. Grass, “A characterization of the DNA data storage channel,”Scientific reports, vol. 9, no. 1, p. 9663, 2019
2019
-
[11]
Information-theoretic foundations of DNA data storage,
I. Shomorony and R. Heckel, “Information-theoretic foundations of DNA data storage,”Foundations and Trends® in Communications and Information Theory, vol. 19, no. 1, pp. 1–106, 2022
2022
-
[12]
Fundamental limits of DNA storage systems,
R. Heckel, I. Shomorony, K. Ramchandran, and D. N. Tse, “Fundamental limits of DNA storage systems,” in2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 3130–3134
2017
-
[13]
Capacity results for the noisy shuffling channel,
I. Shomorony and R. Heckel, “Capacity results for the noisy shuffling channel,” in2019 IEEE International Symposium on Information Theory (ISIT). IEEE, 2019, pp. 762–766
2019
-
[14]
DNA-based storage: Models and fundamental limits,
——, “DNA-based storage: Models and fundamental limits,”IEEE Transactions on Information Theory, vol. 67, no. 6, pp. 3675–3689, 2021
2021
-
[15]
The DNA storage channel: Capacity and error probability bounds,
N. Weinberger and N. Merhav, “The DNA storage channel: Capacity and error probability bounds,”IEEE Transactions on Information Theory, vol. 68, no. 9, pp. 5657–5700, 2022
2022
-
[16]
The noisy drawing channel: Reliable data storage in DNA sequences,
A. Lenz, P. H. Siegel, A. Wachter-Zeh, and E. Yaakobi, “The noisy drawing channel: Reliable data storage in DNA sequences,”IEEE Transactions on Information Theory, vol. 69, no. 5, pp. 2757–2778, 2023
2023
-
[17]
DNA storage in the short molecule regime,
R. Tamir, N. Weinberger, and A. Guillén i Fàbregas, “DNA storage in the short molecule regime,” 2025. [Online]. Available: https://arxiv.org/abs/2511.14284
arXiv 2025
-
[18]
R. Tamir, N. Weinberger, and A. Guillén i Fàbregas, “Concatenated codes for short-molecule DNA storage with sequencing channels of positive zero-undetected-error capacity,” 2026, https://arxiv.org/abs/2602.12800
Pith/arXiv arXiv 2026
-
[19]
Error probability bounds for coded-index DNA storage systems,
N. Weinberger, “Error probability bounds for coded-index DNA storage systems,”IEEE Transactions on Information Theory, vol. 68, no. 11, pp. 7005–7022, 2022
2022
-
[20]
Exact error exponents of concatenated codes for DNA storage,
Y . H. Ling and J. Scarlett, “Exact error exponents of concatenated codes for DNA storage,”IEEE Transactions on Information Theory, vol. 71, no. 9, pp. 6566–6585, 2025
2025
-
[21]
Error exponents for DNA storage codes with a variable number of reads,
Y . H. Ling, N. Weinberger, and J. Scarlett, “Error exponents for DNA storage codes with a variable number of reads,” IEEE Journal on Selected Areas in Information Theory, vol. 6, pp. 205–216, 2025
2025
-
[22]
Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction,
P. L. Antkowiak, J. Lietard, M. Z. Darestani, M. M. Somoza, W. J. Stark, R. Heckel, and R. N. Grass, “Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction,”Nature communications, vol. 11, no. 1, p. 5345, 2020
2020
-
[23]
Comparison of state-of-the-art error-correction coding for sequence-based DNA data storage,
A. L. Gimpel, A. Remschak, W. J. Stark, R. Heckel, and R. N. Grass, “Comparison of state-of-the-art error-correction coding for sequence-based DNA data storage,”Nature Communications, 2026
2026
-
[24]
A digital twin for DNA data storage based on comprehensive quantification of errors and biases,
A. L. Gimpel, W. J. Stark, R. Heckel, and R. N. Grass, “A digital twin for DNA data storage based on comprehensive quantification of errors and biases,”Nature Communications, vol. 14, no. 1, p. 6026, 2023
2023
-
[25]
T. M. Cover and J. A. Thomas,Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, July 2006
2006
-
[26]
A mathematical theory of communication,
C. E. Shannon, “A mathematical theory of communication,”The Bell System Technical Journal, vol. 27, pp. 379–423,
-
[27]
Available: http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
[Online]. Available: http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.