pith. sign in

arxiv: 2606.07216 · v1 · pith:HBT2NZPUnew · submitted 2026-06-05 · 💻 cs.IT · cs.ET· math.IT

The Synthesis-Sequencing Channel for DNA-based Data Storage

Pith reviewed 2026-06-27 20:38 UTC · model grok-4.3

classification 💻 cs.IT cs.ETmath.IT
keywords DNA-based data storagesynthesis-sequencing channelinformation-theoretic capacitybinary symmetric channelcoverage biasconverse boundachievability bound
0
0 comments X

The pith

The synthesis-sequencing channel has an exact information-theoretic capacity given by matching converse and achievability bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the synthesis-sequencing channel as a two-stage model for DNA data storage that first applies synthesis to produce physical strands and then applies sequencing to produce reads. It derives matching upper and lower bounds on capacity when both stages are binary symmetric channels with possibly different error probabilities. The bounds hold under mild assumptions on the parameters and incorporate distinct physical coverage and sequencing coverage. This matters because the model captures coverage bias and relaxes independent-error assumptions that appeared in earlier work. The resulting capacity expression therefore quantifies the fundamental storage rate limit under the combined effects of synthesis errors, sequencing errors, and coverage depths.

Core claim

The synthesis-sequencing channel is formed by composing a synthesis stage followed by a sequencing stage. When synthesis and sequencing are each modeled by a binary symmetric channel, matching converse and achievability bounds establish the exact capacity under mild assumptions on the error probabilities and coverage parameters.

What carries the argument

The synthesis-sequencing channel, whose output statistics arise from the successive action of synthesis coverage, synthesis errors, sequencing coverage, and sequencing errors.

If this is right

  • The maximum reliable rate is determined by the product of synthesis coverage, sequencing coverage, and the two binary symmetric channel capacities adjusted for the induced bias.
  • Increasing physical coverage after synthesis trades off against increasing sequencing coverage in a manner that is not symmetric.
  • The model produces correlated errors across reads even when each individual stage is memoryless.
  • Exact capacity formulas become available for any parameter tuple obeying the mild conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could allocate total sequencing effort between deeper synthesis and deeper readout to maximize the derived capacity.
  • The same two-stage composition idea could be applied to other storage media that undergo fabrication and readout stages.
  • Numerical evaluation of the capacity expression for concrete coverage and error values would immediately yield concrete rate targets for laboratory systems.

Load-bearing premise

The mild assumptions on the channel parameters that make the converse and achievability bounds coincide.

What would settle it

An explicit input distribution and decoding rule whose achieved rate exceeds the claimed capacity expression for any set of parameters satisfying the mild assumptions would falsify the result.

Figures

Figures reproduced from arXiv: 2606.07216 by Jo\~ao Ribeiro, Keshav Goyal, Samuel Pearson, Serge Kas Hanna.

Figure 1
Figure 1. Figure 1: Schematic illustration of the DNA synthesis–sequencing channel. Red symbols indicate synthesis errors that appear [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Capacity (a) and coverage distributions (b) for the sequencing-only channel and several instances of the synthesis– [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

We introduce and study the synthesis-sequencing channel, a two-stage model for DNA-based data storage that jointly captures synthesis and sequencing effects. The synthesis-sequencing channel provides a more nuanced and realistic model of the DNA storage process compared to prior work, as it distinguishes between physical coverage after synthesis and sequencing coverage after readout, relaxes the assumption of independent errors across reads, and naturally induces coverage bias through the composition of synthesis and sequencing stages. We establish the information-theoretic capacity of this channel by deriving matching converse and achievability bounds for the case where synthesis and sequencing errors are modeled by binary symmetric channels with possibly different error probabilities, under mild assumptions on the channel parameters. Our results reveal multiple trade-offs between physical coverage, synthesis errors, sequencing coverage, and sequencing errors that influence the maximum achievable rate for reliable data storage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the synthesis-sequencing channel, a two-stage model for DNA-based data storage that composes synthesis and sequencing stages, each modeled as a binary symmetric channel (BSC) with possibly distinct crossover probabilities. It distinguishes physical coverage after synthesis from sequencing coverage after readout, relaxes independent-error assumptions across reads, and induces coverage bias naturally. The central claim is that the information-theoretic capacity is established via matching converse and achievability bounds under mild assumptions on the channel parameters, with the results highlighting trade-offs among physical coverage, synthesis errors, sequencing coverage, and sequencing errors.

Significance. If the matching bounds hold under the stated assumptions, the work supplies the first capacity characterization for a more realistic DNA-storage channel that captures coverage bias and error dependence. This would be a substantive advance over prior models that treat synthesis and sequencing separately or assume uniform coverage, and the explicit trade-off analysis could inform practical parameter choices in DNA storage systems.

major comments (1)
  1. [Abstract and capacity theorem statement] The abstract states that matching converse and achievability bounds are obtained 'under mild assumptions on the channel parameters,' yet the precise form of these assumptions (e.g., any inequalities relating the synthesis BSC crossover p, sequencing BSC crossover q, synthesis coverage, and sequencing coverage) is not visible in the provided text. Because the capacity result is explicitly conditioned on these assumptions, their necessity, scope, and practical relevance must be stated explicitly (ideally with a dedicated lemma or theorem statement) so that readers can determine whether the result applies to the general model or only to a restricted parameter region.
minor comments (2)
  1. [Model definition] Notation for the two coverage parameters (physical vs. sequencing) should be introduced with a clear diagram or table early in the model section to avoid later ambiguity when discussing bias.
  2. [Introduction] The abstract claims the model 'relaxes the assumption of independent errors across reads'; the manuscript should include a short paragraph contrasting the induced dependence structure with the independent-read models in prior work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive suggestion. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and capacity theorem statement] The abstract states that matching converse and achievability bounds are obtained 'under mild assumptions on the channel parameters,' yet the precise form of these assumptions (e.g., any inequalities relating the synthesis BSC crossover p, sequencing BSC crossover q, synthesis coverage, and sequencing coverage) is not visible in the provided text. Because the capacity result is explicitly conditioned on these assumptions, their necessity, scope, and practical relevance must be stated explicitly (ideally with a dedicated lemma or theorem statement) so that readers can determine whether the result applies to the general model or only to a restricted parameter region.

    Authors: We agree that the assumptions conditioning the capacity result should be stated explicitly and visibly. In the revised version we will add a dedicated Lemma 1 that lists all required conditions on p, q, synthesis coverage, and sequencing coverage (including any inequalities needed for the matching bounds), together with a short discussion of their necessity for the proof and their practical relevance. The abstract and the statement of the main capacity theorem will be updated to reference this lemma directly. revision: yes

Circularity Check

0 steps flagged

No circularity; standard capacity derivation with explicit assumptions.

full rationale

The paper derives matching converse and achievability bounds for the synthesis-sequencing channel (modeled via two BSCs) using standard information-theoretic techniques. The result is conditioned on explicitly stated mild assumptions on parameters, but no step reduces by construction to a fitted input, self-definition, or load-bearing self-citation chain. The central claim remains independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger populated from explicitly stated elements. The model introduces a composite channel but relies on standard BSC assumptions and unspecified mild parameter conditions for the capacity result.

free parameters (2)
  • synthesis error probability
    Parameter of the synthesis BSC stage; required for the capacity expression.
  • sequencing error probability
    Parameter of the sequencing BSC stage; required for the capacity expression.
axioms (1)
  • domain assumption Mild assumptions on the channel parameters allow the converse and achievability bounds to match.
    Explicitly invoked in the abstract as necessary for the capacity result.

pith-pipeline@v0.9.1-grok · 5673 in / 1247 out tokens · 33907 ms · 2026-06-27T20:38:38.518287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 1 linked inside Pith

  1. [1]

    Next-generation digital information storage in DNA,

    G. M. Church, Y . Gao, and S. Kosuri, “Next-generation digital information storage in DNA,”Science, vol. 337, no. 6102, pp. 1628–1628, 2012

  2. [2]

    Robust chemical preservation of digital information on DNA in silica with error-correcting codes,

    R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J. Stark, “Robust chemical preservation of digital information on DNA in silica with error-correcting codes,”Angewandte Chemie International Edition, vol. 54, no. 8, pp. 2552–2555, 2015

  3. [3]

    Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,

    N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,”nature, vol. 494, no. 7435, pp. 77–80, 2013

  4. [4]

    Portable and error-free DNA-based data storage,

    S. H. T. Yazdi, R. Gabrys, and O. Milenkovic, “Portable and error-free DNA-based data storage,”Scientific reports, vol. 7, no. 1, p. 5011, 2017

  5. [5]

    Improved read/write cost tradeoff in DNA-based data storage using LDPC codes,

    S. Chandak, K. Tatwawadi, B. Lau, J. Mardia, M. Kubit, J. Neu, P. Griffin, M. Wootters, T. Weissman, and H. Ji, “Improved read/write cost tradeoff in DNA-based data storage using LDPC codes,” in2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2019, pp. 147–156

  6. [6]

    Hedges error-correcting code for DNA storage corrects indels and allows sequence constraints,

    W. H. Press, J. A. Hawkins, S. K. Jones Jr, J. M. Schaub, and I. J. Finkelstein, “Hedges error-correcting code for DNA storage corrects indels and allows sequence constraints,”Proceedings of the National Academy of Sciences, vol. 117, no. 31, pp. 18 489–18 496, 2020

  7. [7]

    DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage,

    M. Welzel, P. M. Schwarz, H. F. Löchel, T. Kabdullayeva, S. Clemens, A. Becker, B. Freisleben, and D. Heider, “DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage,”Nature Communications, vol. 14, no. 1, p. 628, 2023

  8. [8]

    Scalable and robust DNA-based storage via coding theory and deep learning,

    D. Bar-Lev, I. Orr, O. Sabary, T. Etzion, and E. Yaakobi, “Scalable and robust DNA-based storage via coding theory and deep learning,”Nature Machine Intelligence, vol. 7, no. 4, pp. 639–649, 2025

  9. [9]

    DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA,

    R. Khabbaz, J. Mateos, M. Antonini, and S. Kas Hanna, “DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA,”bioRxiv preprint, 2026

  10. [10]

    A characterization of the DNA data storage channel,

    R. Heckel, G. Mikutis, and R. N. Grass, “A characterization of the DNA data storage channel,”Scientific reports, vol. 9, no. 1, p. 9663, 2019

  11. [11]

    Information-theoretic foundations of DNA data storage,

    I. Shomorony and R. Heckel, “Information-theoretic foundations of DNA data storage,”Foundations and Trends® in Communications and Information Theory, vol. 19, no. 1, pp. 1–106, 2022

  12. [12]

    Fundamental limits of DNA storage systems,

    R. Heckel, I. Shomorony, K. Ramchandran, and D. N. Tse, “Fundamental limits of DNA storage systems,” in2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 3130–3134

  13. [13]

    Capacity results for the noisy shuffling channel,

    I. Shomorony and R. Heckel, “Capacity results for the noisy shuffling channel,” in2019 IEEE International Symposium on Information Theory (ISIT). IEEE, 2019, pp. 762–766

  14. [14]

    DNA-based storage: Models and fundamental limits,

    ——, “DNA-based storage: Models and fundamental limits,”IEEE Transactions on Information Theory, vol. 67, no. 6, pp. 3675–3689, 2021

  15. [15]

    The DNA storage channel: Capacity and error probability bounds,

    N. Weinberger and N. Merhav, “The DNA storage channel: Capacity and error probability bounds,”IEEE Transactions on Information Theory, vol. 68, no. 9, pp. 5657–5700, 2022

  16. [16]

    The noisy drawing channel: Reliable data storage in DNA sequences,

    A. Lenz, P. H. Siegel, A. Wachter-Zeh, and E. Yaakobi, “The noisy drawing channel: Reliable data storage in DNA sequences,”IEEE Transactions on Information Theory, vol. 69, no. 5, pp. 2757–2778, 2023

  17. [17]

    DNA storage in the short molecule regime,

    R. Tamir, N. Weinberger, and A. Guillén i Fàbregas, “DNA storage in the short molecule regime,” 2025. [Online]. Available: https://arxiv.org/abs/2511.14284

  18. [18]

    Concatenated codes for short-molecule DNA storage with sequencing channels of positive zero-undetected-error capacity,

    R. Tamir, N. Weinberger, and A. Guillén i Fàbregas, “Concatenated codes for short-molecule DNA storage with sequencing channels of positive zero-undetected-error capacity,” 2026, https://arxiv.org/abs/2602.12800

  19. [19]

    Error probability bounds for coded-index DNA storage systems,

    N. Weinberger, “Error probability bounds for coded-index DNA storage systems,”IEEE Transactions on Information Theory, vol. 68, no. 11, pp. 7005–7022, 2022

  20. [20]

    Exact error exponents of concatenated codes for DNA storage,

    Y . H. Ling and J. Scarlett, “Exact error exponents of concatenated codes for DNA storage,”IEEE Transactions on Information Theory, vol. 71, no. 9, pp. 6566–6585, 2025

  21. [21]

    Error exponents for DNA storage codes with a variable number of reads,

    Y . H. Ling, N. Weinberger, and J. Scarlett, “Error exponents for DNA storage codes with a variable number of reads,” IEEE Journal on Selected Areas in Information Theory, vol. 6, pp. 205–216, 2025

  22. [22]

    Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction,

    P. L. Antkowiak, J. Lietard, M. Z. Darestani, M. M. Somoza, W. J. Stark, R. Heckel, and R. N. Grass, “Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction,”Nature communications, vol. 11, no. 1, p. 5345, 2020

  23. [23]

    Comparison of state-of-the-art error-correction coding for sequence-based DNA data storage,

    A. L. Gimpel, A. Remschak, W. J. Stark, R. Heckel, and R. N. Grass, “Comparison of state-of-the-art error-correction coding for sequence-based DNA data storage,”Nature Communications, 2026

  24. [24]

    A digital twin for DNA data storage based on comprehensive quantification of errors and biases,

    A. L. Gimpel, W. J. Stark, R. Heckel, and R. N. Grass, “A digital twin for DNA data storage based on comprehensive quantification of errors and biases,”Nature Communications, vol. 14, no. 1, p. 6026, 2023

  25. [25]

    T. M. Cover and J. A. Thomas,Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, July 2006

  26. [26]

    A mathematical theory of communication,

    C. E. Shannon, “A mathematical theory of communication,”The Bell System Technical Journal, vol. 27, pp. 379–423,

  27. [27]

    Available: http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

    [Online]. Available: http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf