Recognition: unknown
Faster Algorithms for Shortest Unique or Absent Substrings
Pith reviewed 2026-05-08 17:03 UTC · model grok-4.3
The pith
Algorithms compute shortest unique and absent substrings in O(n log σ / √log n) time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decomposing the search according to the length and period of candidate substrings and employing synchronizing sets, run analysis, and wavelet trees, the computation of a shortest unique substring reduces to a simple geometric problem solvable in O(n log σ / √log n) time; adapting the same reduction together with efficient de Bruijn sequence construction gives the same bound for shortest absent substrings.
What carries the argument
Decomposition of the problem by length and period of the target substring, together with synchronizing sets, run analysis, and wavelet trees that reduce SUS and SAS computation to a geometric problem.
Load-bearing premise
The length-and-period decomposition, synchronizing sets, run analysis, and wavelet trees correctly reduce SUS and SAS to a geometric problem solvable in the claimed time bound inside the word-RAM model.
What would settle it
A concrete string S of length n over small alphabet where the algorithm either returns an incorrect SUS or SAS or exceeds O(n log σ / √log n) operations on a standard word-RAM implementation.
Figures
read the original abstract
We revisit two well-known algorithmic problems on strings: computing a shortest unique substring (SUS) and a shortest absent substring (SAS) of a string $S$ of length $n$. Both problems admit folklore $\mathcal{O}(n)$-time solutions using the suffix tree of $S$. However, for small alphabets, this complexity is not necessarily optimal in the word RAM model, where a string of length $n$ over alphabet $[0,\sigma)$ can be stored in $\mathcal{O}(n \log \sigma/\log n)$ space and read in $\mathcal{O}(n \log \sigma/\log n)$ time. We present an $\mathcal{O}(n \log \sigma/\sqrt{\log n})$-time algorithm for computing a SUS of $S$. This algorithm decomposes the problem according to the length and the period of the sought substring and uses several tools and techniques, such as synchronizing sets, the analysis of runs, and wavelet trees, to reduce the computation of a SUS to a simple geometric problem. Further, we adapt this algorithm and combine it with an efficient construction of de Bruijn sequences in order to obtain an $\mathcal{O}(n \log \sigma/\sqrt{\log n})$-time algorithm for computing a SAS of $S$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims O(n log σ / √log n)-time algorithms for computing a shortest unique substring (SUS) and a shortest absent substring (SAS) of a string S of length n. The SUS algorithm decomposes the problem by substring length and period, then uses synchronizing sets, run analysis, and wavelet trees to reduce SUS computation to a geometric problem. The SAS algorithm adapts the same framework and combines it with an efficient de Bruijn sequence construction to achieve the same bound, improving on the folklore O(n) suffix-tree solutions in the word-RAM model.
Significance. If the reductions and time analysis hold, the result offers a non-trivial asymptotic improvement over linear-time methods for these string problems when the alphabet is small relative to n. The work appropriately credits and builds upon established primitives (runs, synchronizing sets, wavelet trees, de Bruijn sequences) rather than reinventing them, which strengthens the contribution. The approach could be relevant for large-scale string processing applications.
major comments (2)
- [SUS algorithm reduction (length/period decomposition and geometric mapping)] The O(n log σ / √log n) bound for SUS rests on the length/period decomposition correctly partitioning all candidate substrings and the subsequent reduction via synchronizing sets and run analysis mapping the problem to geometry without extra logarithmic factors (see the description following the abstract and the claimed reduction). The manuscript must explicitly verify coverage of long-period cases and boundary overlaps in the word-RAM model; otherwise the improvement over O(n) fails.
- [Time analysis of geometric queries via wavelet trees] The wavelet-tree solution to the resulting geometric problem must be shown to achieve the full √log n cancellation in the word-RAM model without hidden log factors from query overhead or data-structure construction (the time analysis paragraph after the reduction).
minor comments (2)
- [Abstract] The abstract could include a one-sentence pointer to the specific prior O(n) suffix-tree results being improved upon.
- [Preliminaries] Notation for alphabet size (σ) and string length (n) is clear but should be restated at the start of the technical sections for readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. We appreciate the recognition of the significance of the asymptotic improvements and the appropriate use of established primitives. We address each major comment below and will incorporate the requested clarifications and expansions into the revised version.
read point-by-point responses
-
Referee: [SUS algorithm reduction (length/period decomposition and geometric mapping)] The O(n log σ / √log n) bound for SUS rests on the length/period decomposition correctly partitioning all candidate substrings and the subsequent reduction via synchronizing sets and run analysis mapping the problem to geometry without extra logarithmic factors (see the description following the abstract and the claimed reduction). The manuscript must explicitly verify coverage of long-period cases and boundary overlaps in the word-RAM model; otherwise the improvement over O(n) fails.
Authors: We agree that explicit verification strengthens the presentation. The length/period decomposition partitions candidates into short-period and long-period regimes, with synchronizing sets identifying unique positions and run analysis (via the runs theorem) handling periodic repetitions to ensure every possible substring is considered exactly once. Boundary overlaps are resolved by the geometric mapping, which encodes positions and periods without introducing extra log factors in the word-RAM model. In the revision we will insert a dedicated lemma (with proof) that formally verifies complete coverage of long-period cases and all boundary conditions. revision: yes
-
Referee: [Time analysis of geometric queries via wavelet trees] The wavelet-tree solution to the resulting geometric problem must be shown to achieve the full √log n cancellation in the word-RAM model without hidden log factors from query overhead or data-structure construction (the time analysis paragraph after the reduction).
Authors: The wavelet-tree queries are performed on a structure built over the geometric encoding of candidate positions; each query costs O(log σ) time in the word-RAM model, and the total number of queries is bounded by O(n log σ / √log n) after the decomposition. Construction time is linear in the size of the encoded instance (O(n log σ / log n) words), which is absorbed into the overall bound. We will expand the time-analysis paragraph to include an explicit step-by-step accounting that isolates the √log n cancellation and confirms no additional logarithmic overhead arises from either queries or preprocessing. revision: yes
Circularity Check
No significant circularity in the algorithmic reduction for SUS and SAS
full rationale
The paper's derivation is a constructive reduction: it decomposes SUS/SAS computation by substring length and period, then applies standard stringology primitives (synchronizing sets, maximal runs, wavelet trees) to map the remaining cases to a geometric problem whose solution yields the stated O(n log σ/√log n) bound in the word-RAM model. The SAS variant further adapts the same reduction and invokes an efficient de Bruijn sequence construction, a well-known independent primitive. No equation, time bound, or correctness claim reduces by definition to the paper's own inputs, nor does any load-bearing step rely on a self-citation chain that itself lacks external verification. All cited components (wavelet trees, run analysis, synchronizing sets) are established results outside the present work, making the overall derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Suffix trees and runs in strings can be constructed and analyzed in linear time under the word-RAM model.
- domain assumption Synchronizing sets and wavelet trees admit efficient construction and query times in the word-RAM model for small alphabets.
Reference graph
Works this paper leans on
-
[1]
Babenko, Paweł Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya
Maxim A. Babenko, Paweł Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya. Wavelet trees meetsuffixtrees. InProceedingsoftheTwenty-SixthAnnualACM-SIAMSymposiumonDiscreteAlgorithms, SODA 2015, pages 572–591, 2015.doi:10.1137/1.9781611973730.39
-
[2]
Ricardo A. Baeza-Yates. Improved string searching.Softw. Pract. Exp., 19(3):257–271, 1989.doi: 10.1002/SPE.4380190305
-
[3]
Lyndon arrays in sublinear time
Hideo Bannai and Jonas Ellert. Lyndon arrays in sublinear time. In31st Annual European Symposium on Algorithms, ESA 2023, volume 274 ofLIPIcs, pages 14:1–14:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.doi:10.4230/LIPICS.ESA.2023.14
-
[4]
Jérémy Barbay, Johannes Fischer, and Gonzalo Navarro. Lrm-trees: Compressed indices, adaptive sorting, andcompressedpermutations.Theor.Comput.Sci.,459:26–41,2012.URL:https://doi.org/10.1016/ j.tcs.2012.08.010,doi:10.1016/J.TCS.2012.08.010
-
[5]
Worst case efficient single and multiple string matching in the RAM model
Djamal Belazzougui. Worst case efficient single and multiple string matching in the RAM model. In Combinatorial Algorithms - 21st International Workshop, IWOCA 2010, volume 6460 ofLecture Notes in Computer Science, pages 90–102. Springer, 2010.doi:10.1007/978-3-642-19222-7_10. 27
-
[6]
Towards optimal packed string matching.Theoretical Computer Science, 525:111–129, 2014.doi:10
Oren Ben-Kiki, Philip Bille, Dany Breslauer, Leszek Gasieniec, Roberto Grossi, and Oren Weimann. Towards optimal packed string matching.Theoretical Computer Science, 525:111–129, 2014.doi:10. 1016/j.tcs.2013.06.013
2014
-
[7]
LATIN 2000: Theoretical Informatics
Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H. Gonnet, Daniel Panario,andAlfredoViola,editors,LATIN2000: TheoreticalInformatics,4thLatinAmericanSymposium, 2000, Proceedings, volume 1776 ofLecture Notes in Computer Science, pages 88–94. Springer, 2000. doi:10.1007/10719839\_9
-
[8]
Philip Bille. Fast searching in packed strings.Journal of Discrete Algorithms, 9(1):49–56, 2011.doi: 10.1016/j.jda.2010.09.003
-
[9]
Deterministic indexing for packed strings
Philip Bille, Inge Li Gørtz, and Frederik Rye Skjoldjensen. Deterministic indexing for packed strings. In28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, volume 78 ofLIPIcs, pages 6:1–6:11.SchlossDagstuhl-Leibniz-ZentrumfürInformatik,2017.doi:10.4230/LIPIcs.CPM.2017.6
-
[10]
Constant-time word-size string matching
Dany Breslauer, Leszek Gasieniec, and Roberto Grossi. Constant-time word-size string matching. In Combinatorial Pattern Matching - 23rd Annual Symposium, CPM 2012, Proceedings, volume 7354 of Lecture Notes in Computer Science, pages 83–96. Springer, 2012.doi:10.1007/978-3-642-31265-6\ _7
-
[11]
Panagiotis Charalampopoulos, Maxime Crochemore, Gabriele Fici, Robert Mercas, and Solon P. Pissis. Alignment-free sequence comparison using absent words.Inf. Comput., 262:57–68, 2018.doi:10.1016/ J.IC.2018.06.002
2018
-
[12]
Iliopoulos, Tomasz Kociumaka, Solon P
Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis,JakubRadoszewski,WojciechRytter,andTomaszWalen. Linear-TimeAlgorithmforLongLCFwith k Mismatches. In29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018), volume 105 ofLeibniz International Proceedings in Informatics (LIPIcs), pages 23:1–23:...
-
[13]
Iliopoulos, Chang Liu, and Solon P
Panagiotis Charalampopoulos, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. Property suffix array with applications in indexing weighted sequences.ACM J. Exp. Algorithmics, 25:1–16, 2020.doi: 10.1145/3385898
-
[14]
Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, and Solon P. Pissis. Faster algo- rithms for longest common substring.ACM Trans. Algorithms, 2025.doi:10.1145/3774754
-
[15]
Counting distinct square substrings in sublinear time
Panagiotis Charalampopoulos, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen, and Wiktor Zuba. Counting distinct square substrings in sublinear time. In50th International Symposium on Mathematical Foundations of Computer Science, MFCS 2025, volume 345 ofLIPIcs, pages 36:1–36:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2025.doi:1...
-
[16]
Pissis, and Jakub Radoszewski
Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest palindromic substring in sublinear time. In33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, volume 223 ofLIPIcs, pages 20:1–20:9. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022.doi:10.4230/ LIPICS.CPM.2022.20
2022
-
[17]
Data compression using antidictionaries.Proc
Maxime Crochemore, Filippo Mignosi, Antonio Restivo, and Sergio Salemi. Data compression using antidictionaries.Proc. IEEE, 88(11):1756–1768, 2000.doi:10.1109/5.892711
-
[18]
A combinatorial problem.Proc
Nicolaas Govert de Bruijn. A combinatorial problem.Proc. Koninklijke Nederlandse Akademie V. Wetenschappen, 49:758–764, 1946. URL:http://resolver.tudelft.nl/uuid: f9a56551-402a-4428-98e6-1469e38e64c1. 28
1946
-
[19]
Jean-Pierre Duval. Génération d’une section des classes de conjugaison et arbre des mots de lyndon de longueur bornée.Theor. Comput. Sci., 60:255–283, 1988.doi:10.1016/0304-3975(88)90113-2
-
[20]
String Processing and Information Retrieval - 30th International Symposium,
Jonas Ellert. Sublinear time Lempel-Ziv (LZ77) factorization. InString Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Proceedings, volume 14240 ofLecture Notes in Computer Science, pages 171–187. Springer, 2023.doi:10.1007/978-3-031-43980-3\_14
-
[21]
Time-optimal construction of string synchronizing sets
Jonas Ellert and Tomasz Kociumaka. Time-optimal construction of string synchronizing sets. In43rd International Symposium on Theoretical Aspects of Computer Science, STACS 2026, LIPIcs, pages 36:1– 36:22.SchlossDagstuhl-Leibniz-ZentrumfürInformatik,2026.doi:10.4230/LIPICS.STACS.2026.36
-
[22]
Optimal suffix tree construction with large alphabets
Martin Farach. Optimal suffix tree construction with large alphabets. In38th Annual Symposium on Foundations of Computer Science, FOCS 1997, pages 137–143. IEEE Computer Society, 1997.doi: 10.1109/SFCS.1997.646102
-
[23]
Alphabet-dependent string searching with wexponential search trees
Johannes Fischer and Paweł Gawrychowski. Alphabet-dependent string searching with wexponential search trees. InCombinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Proceed- ings, volume 9133 ofLecture Notes in Computer Science, pages 160–171. Springer, 2015.doi: 10.1007/978-3-319-19929-0\_14
-
[24]
Necklaces of beads in k colors and k-ary de bruijn sequences
Harold Fredricksen and James Maiorana. Necklaces of beads in k colors and k-ary de bruijn sequences. Discret. Math., 23(3):207–210, 1978.doi:10.1016/0012-365X(78)90002-X
-
[25]
Kimmo Fredriksson. Shift-or string matching with super-alphabets.Information Processing Letters, 87(4):201–204, 2003.doi:10.1016/S0020-0190(03)00296-5
-
[26]
Szymon Grabowski and Kimmo Fredriksson. Bit-parallel string matching under hamming distance in o(n[m/w])worstcasetime.Inf.Process.Lett.,105(5):182–187,2008.doi:10.1016/J.IPL.2007.08.021
-
[27]
High-order entropy-compressed text indexes
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 841–850,
- [28]
-
[29]
Genome comparison with- out alignment using shortest unique substrings.BMC Bioinform., 6:123, 2005.doi:10.1186/ 1471-2105-6-123
Bernhard Haubold, Nora Pierstorff, Friedrich Möller, and Thomas Wiehe. Genome comparison with- out alignment using shortest unique substrings.BMC Bioinform., 6:123, 2005.doi:10.1186/ 1471-2105-6-123
2005
-
[30]
Costas S. Iliopoulos, Dennis W. G. Moore, and William F. Smyth. A characterization of the squares in a fibonaccistring.Theor.Comput.Sci.,172(1-2):281–291,1997.doi:10.1016/S0304-3975(96)00141-7
-
[31]
Linear-time longest-common- prefix computation in suffix arrays and its applications
Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common- prefix computation in suffix arrays and its applications. InCombinatorial Pattern Matching, 12th Annual Symposium, CPM 2001, volume 2089 ofLecture Notes in Computer Science, pages 181–192. Springer, 2001.doi:10.1007/3-540-48194-X\_17
-
[32]
An Optimal Space Lower Bound for Approximating MAX-CUT
DominikKempaandTomaszKociumaka. Stringsynchronizingsets: sublinear-timeBWTconstructionand optimal LCE data structure. InProceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 756–767. ACM, 2019.doi:10.1145/3313276.3316368
-
[33]
Breaking theO(n)-barrier in the construction of compressed suffix arrays and suffix trees
Dominik Kempa and Tomasz Kociumaka. Breaking the𝑂(n)-barrier in the construction of compressed suffix arrays and suffix trees. InProceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, pages 5122–5202. SIAM, 2023.doi:10.1137/1.9781611977554.CH187. 29
-
[34]
Dominik Kempa and Tomasz Kociumaka. Lempel-Ziv (LZ77) factorization in sublinear time. In65th IEEEAnnualSymposiumonFoundationsofComputerScience,FOCS2024,pages2045–2055.IEEE,2024. doi:10.1109/FOCS61266.2024.00122
-
[35]
On the hardness hierarchy for the o(n√log n) complexity in the word RAM
Dominik Kempa and Tomasz Kociumaka. On the hardness hierarchy for the o(n√log n) complexity in the word RAM. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, STOC 2025, pages 290–300. ACM, 2025.doi:10.1145/3717823.3718291
-
[36]
Ian Munro, Yakov Nekrich, and Jeffrey Scott Vitter
J. Ian Munro, Yakov Nekrich, and Jeffrey Scott Vitter. Fast construction of wavelet trees.Theor. Comput. Sci., 638:91–97, 2016.doi:10.1016/j.tcs.2015.11.011
-
[37]
A bit-parallel approach to suffix automata: Fast extended string matching
Gonzalo Navarro and Mathieu Raffinot. A bit-parallel approach to suffix automata: Fast extended string matching. InCombinatorialPatternMatching,9thAnnualSymposium,CPM98,Proceedings,volume1448 ofLecture Notes in Computer Science, pages 14–33. Springer, 1998.doi:10.1007/BFB0030778
-
[38]
On shortest unique substring queries
Jian Pei, Wush Chi-Hsuan Wu, and Mi-Yen Yeh. On shortest unique substring queries. In29th IEEE InternationalConferenceonDataEngineering,ICDE2013,pages937–948.IEEEComputerSociety,2013. doi:10.1109/ICDE.2013.6544887
-
[39]
Diogo Pratas and Jorge Miguel Silva. Persistent minimal sequences of SARS-CoV-2.Bioinform., 36(21):5129–5132, 2021.doi:10.1093/BIOINFORMATICS/BTAA686
-
[40]
On computing the smallest suffixient set
Jakub Radoszewski and Wiktor Zuba. Computing string covers in sublinear time. InString Processing and InformationRetrieval-31stInternationalSymposium, SPIRE2024, Proceedings, volume14899ofLecture Notes in Computer Science, pages 272–288. Springer, 2024.doi:10.1007/978-3-031-72200-4\_21
-
[41]
Silva, Diogo Pratas, Luísa Castro, Armando J
Raquel M. Silva, Diogo Pratas, Luísa Castro, Armando J. Pinho, and Paulo Jorge S. G. Ferreira. Three minimal sequences found in ebola virus genomes and absent from human DNA.Bioinform., 31(15):2421– 2425, 2015.doi:10.1093/BIOINFORMATICS/BTV189
-
[42]
Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees.J. Comput. Syst. Sci., 26(3):362–391, 1983.doi:10.1016/0022-0000(83)90006-5
-
[43]
Takuya Takagi, Shunsuke Inenaga, Kunihiko Sadakane, and Hiroki Arimura. Packed compact tries: A fast andefficientdatastructureforonlinestringprocessing.IEICETrans.Fundam.Electron.Commun.Comput. Sci., 100-A(9):1785–1793, 2017.doi:10.1587/TRANSFUN.E100.A.1785
-
[44]
Linear pattern matching algorithms
Peter Weiner. Linear pattern matching algorithms. In14th Annual Symposium on Switching and Automata Theory, pages 1–11. IEEE Computer Society, 1973.doi:10.1109/SWAT.1973.13. 30
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.