arxiv: 2604.02244 · v3 · submitted 2026-04-02 · 💻 cs.FL · cs.LG

Recognition: no theorem link

PAC learning PDFA from data streams

Robert Baumgartner , Sicco Verwer

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:54 UTC · model grok-4.3

classification 💻 cs.FL cs.LG

keywords PDFAdata streamsPAC learningstate mergingautomata inferencesketchingstreaming algorithms

0 comments

The pith

A streaming algorithm learns PDFAs from data streams with PAC guarantees and faster runtime for large samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a generic method for learning probabilistic deterministic finite automata from continuous data streams. It uses a merge heuristic based on sketches to manage incomplete prefix trees built from streaming data. The approach is formally shown to be capable of learning within the PAC framework, providing bounds on sample size for accurate models. A theoretical improvement is introduced to enhance runtime efficiency without affecting correctness as sample sizes grow. This addresses the gap in learning state machines when data arrives continuously rather than all at once.

Core claim

The paper establishes a streaming state merging algorithm for PDFAs that incrementally processes data, maintaining an approximate prefix tree and using frequency sketches to decide state merges. It proves that this algorithm satisfies PAC learning conditions, ensuring that with enough samples it converges to the target model. Additionally, it demonstrates a way to optimize the merge process for better performance on larger streams.

What carries the argument

The sketch-based heuristic for deciding merges in an incrementally built prefix tree, which approximates state probabilities to enable correct decisions under streaming constraints.

If this is right

The algorithm can learn accurate PDFA models from ongoing data streams.
It provides runtime improvements for larger sample sizes while maintaining PAC correctness.
Implementation shows better runtime and memory use compared to existing methods on benchmark datasets.
The formal PAC analysis guarantees learnability with sufficient data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could support real-time monitoring and modeling of systems like networks or software.
Extensions might include handling concept drift where the underlying PDFA changes over time.
Similar sketching techniques could improve other online automata learning algorithms.

Load-bearing premise

The underlying process generating the data stream is a fixed PDFA, and the sketches provide accurate enough approximations for merge decisions to satisfy the PAC sample complexity bounds.

What would settle it

Running the algorithm on streams from a known PDFA with a sample size meeting the PAC bound and verifying whether the output model is epsilon-close to the true distribution with high probability.

Figures

Figures reproduced from arXiv: 2604.02244 by Robert Baumgartner, Sicco Verwer.

**Figure 2.** Figure 2: Boxplots of all heuristics. Due to the difference in magitude in between the SpaceSave heuristic and the other heuristics we separated them into two subplots. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: The function f(m0) with varying µ and α. be extracted exactly the same from (Palmer and Goldberg, 2007), and simply arrive at the next theorem, which is a direct copy of Theorem 2 of Palmer et al.(Palmer and Goldberg, 2007): Theorem 9 Let our algorithm read samples with a batch-size of B ≥ max 8n 2 |Σ| 2 ϵ 2 ln 2n 2 |Σ| 2 δ ′ , 4m0n|Σ| ϵ , then there exists T ′ a subset of the transitions of A, and… view at source ↗

read the original abstract

This is an extended version of our publication Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco. It has been extended with a formal proof on PAC-bounds, and the discussion and analysis of a similar approach has been moved from the appendix and now has a full dedicated section. State machine models are models that simulate the behavior of discrete event systems, capable of representing systems such as software systems, network interactions, and control systems, and have been researched extensively. The nature of most learning algorithms however is the assumption that all data be available at the beginning of the algorithm, and little research has been done in learning state machines from streaming data. In this paper, we want to close this gap further by presenting a generic method for learning state machines from data streams, as well as a merge heuristic that uses sketches to account for incomplete prefix trees. We implement our approach in an open-source state merging library and compare it with existing methods. We show the effectiveness of our approach with respect to run-time, memory consumption, and quality of results on a well known open dataset. Additionally, we provide a formal analysis of our algorithm, showing that it is capable of learning within the PAC framework, and show a theoretical improvement to increase run-time, without sacrificing correctness of the algorithm in larger sample sizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends the 2023 ICGI version with a PAC proof for a sketch-based streaming PDFA learner, but the approximation error in unbounded streams needs explicit checking in the analysis.

read the letter

The main point to know is that this is an extended version of their 2023 ICGI paper, now including a formal PAC learning proof for the streaming algorithm and a full section analyzing similar approaches. They describe a generic strategy for learning from data streams by maintaining a prefix tree and using sketches to approximate frequencies for merge decisions. This allows handling unbounded data without storing everything. The improved heuristic aims to speed up runtime for larger samples without losing correctness. They provide an open-source implementation and test it on a known dataset, showing gains in time and memory over existing methods while maintaining result quality. The formal analysis is the new element, claiming the algorithm learns within PAC bounds. That's useful for the subfield, as most automata learning assumes batch data. The potential issue is whether the PAC proof properly accounts for the sketch approximations. Sketches introduce additive errors, and in a continuous stream, the number of merge decisions grows, so it's not obvious that the failure probability stays below delta. If the proof treats sketches as exact or doesn't bound the error accumulation, that would weaken the guarantee. The abstract mentions the proof, so presumably they address this, but it would be good to verify the derivation. This work is aimed at people working on grammatical inference for practical systems like software verification or network analysis. Readers interested in streaming algorithms with theoretical guarantees will find value in the proof and the empirical results. It deserves peer review because the core idea is solid and the addition of the PAC analysis makes it more substantial than the conference version.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a generic streaming algorithm for learning PDFAs via state merging on data streams, using a sketch-based heuristic to handle incomplete prefix trees. It includes an open-source implementation, empirical comparisons on a standard dataset for runtime/memory/quality, a formal PAC analysis claiming learnability within PAC bounds, and a theoretical runtime improvement that preserves correctness for large samples.

Significance. If the PAC analysis holds, the work would advance streaming automata learning by closing the gap between batch state-merging methods and online settings, with practical benefits for modeling discrete-event systems. The open-source library, empirical evaluation, and explicit PAC proof are strengths that would make the contribution notable in formal language learning.

major comments (2)

[Formal PAC Analysis] Formal PAC Analysis section: the proof sketch assumes that merge decisions based on frequency comparisons are exact (or controlled solely by the PAC sample size n and delta), but the sketch heuristic introduces additive approximation error of order 1/w. This error is not bounded relative to the minimum distribution gap in the analysis, so the probability of incorrect merges (and thus failure to achieve epsilon-closeness) can exceed delta even when the exact-count version succeeds. A revised sample-complexity bound incorporating w and the number of candidate pairs is needed.
[Theoretical Improvement] Section on theoretical runtime improvement: the claim that the improvement increases runtime without sacrificing correctness for larger samples relies on the same merge heuristic; if sketch errors invalidate merges, the correctness preservation does not follow. Please clarify how the improvement interacts with the PAC guarantee.

minor comments (2)

[Abstract] Abstract: the statement that the approach is 'capable of learning within the PAC framework' should be qualified to note that it holds under the sketch approximation assumptions made explicit in the analysis.
[Implementation and Experiments] The manuscript mentions an open-source state merging library but provides no repository link, version, or exact name in the text; this should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and revise the formal analysis and related sections accordingly.

read point-by-point responses

Referee: [Formal PAC Analysis] Formal PAC Analysis section: the proof sketch assumes that merge decisions based on frequency comparisons are exact (or controlled solely by the PAC sample size n and delta), but the sketch heuristic introduces additive approximation error of order 1/w. This error is not bounded relative to the minimum distribution gap in the analysis, so the probability of incorrect merges (and thus failure to achieve epsilon-closeness) can exceed delta even when the exact-count version succeeds. A revised sample-complexity bound incorporating w and the number of candidate pairs is needed.

Authors: We agree that the current proof sketch does not explicitly incorporate the additive approximation error of order 1/w from the sketches. In the revised manuscript we will strengthen the PAC analysis by deriving a modified sample-complexity bound that accounts for this error. The bound will be expressed in terms of the sketch width w, the minimum distribution gap, and a union bound over the (polynomially many) candidate merge pairs, ensuring that the overall probability of an incorrect merge remains below delta. revision: yes
Referee: [Theoretical Improvement] Section on theoretical runtime improvement: the claim that the improvement increases runtime without sacrificing correctness for larger samples relies on the same merge heuristic; if sketch errors invalidate merges, the correctness preservation does not follow. Please clarify how the improvement interacts with the PAC guarantee.

Authors: We acknowledge the need for explicit clarification. The runtime improvement is intended for regimes in which the sample size n is large enough that the sketch error 1/w is dominated by the distribution gaps used in the PAC analysis. In the revision we will add a precise statement of the parameter regime (relating n, w, and the minimum gap) under which the improved procedure inherits the same PAC guarantee as the base algorithm, together with a short argument showing that erroneous merges due to sketches are absorbed into the existing delta term. revision: yes

Circularity Check

1 steps flagged

Minor self-citation to prior conference version; PAC analysis is an independent extension

specific steps

self citation load bearing [Abstract]
"This is an extended version of our publication Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco. It has been extended with a formal proof on PAC-bounds"

The core algorithm originates in the authors' prior self-publication; the new PAC analysis is described as an extension rather than derived from first principles within this manuscript alone. However, because the PAC proof is explicitly added as novel content and does not reduce the learning guarantee to the prior paper's fitted heuristics, the self-citation remains non-load-bearing.

full rationale

The paper is an extended version of the authors' own ICGI 2023 work, with the new formal PAC proof added here. No derivation step reduces a claimed prediction or bound to a fitted parameter or to the self-citation by construction. The streaming state-merging algorithm and sketch heuristic are presented as a generic strategy whose correctness under PAC is analyzed separately using standard sample-complexity arguments. The self-citation is therefore minor and non-load-bearing for the central PAC claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method appears to extend existing state-merging techniques without new postulated objects.

pith-pipeline@v0.9.0 · 5545 in / 1039 out tokens · 44104 ms · 2026-05-13T20:54:37.226420+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

ISBN 1581138857

Association for Computing Machinery. ISBN 1581138857. doi: 10.1145/997817. 997857. URLhttps://doi-org.tudelft.idm.oclc.org/10.1145/997817.997857. B. de Balle Pigem, R.G. Mestre, and Universitat Polit` ecnica de Catalunya. Departament de Llenguatges i Sistemes Inform` atics.Learning Finite-state Machines: Statistical and Algorithmic Aspects. PhD thesis, 20...

work page doi:10.1145/997817 2014
[2]

ISBN 3540242880

Springer-Verlag. ISBN 3540242880. doi: 10.1007/978-3-540-30570-5 27. URL https://doi.org/10.1007/978-3-540-30570-5_27. Nicholas Nethercote and Julian Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation.SIGPLAN Not., 42(6):89–100, jun 2007. ISSN 0362-1340. doi: 10.1145/1273442.1250746. URLhttps://doi.org/10.1145/1273442.1250746. A....

work page doi:10.1007/978-3-540-30570-5 2007
[3]

doi: 10.1007/s10994-013-5409-9

ISSN 15730565. doi: 10.1007/s10994-013-5409-9. URLhttp://ai.cs.umbc.edu/ icgi2012/challenge/Pautomac/committee.php. Asmir Vodenˇ carevi´ c, Hans Kleine B¨ urring, Oliver Niggemann, and Alexander Maier. Iden- tifying behavior models for process plants. InETFA2011, pages 1–8, 2011. doi: 10.1109/ETFA.2011.6059080. Neil Walkinshaw, Ramsay Taylor, and John Der...

work page doi:10.1007/s10994-013-5409-9 2011