Prague Dependency Treebank -- Consolidated 2.0: Enriching a Complex Annotation Scheme

Barbora \v{S}t\v{e}p\'ankov\'a; Jan Haji\v{c}; Jan \v{S}t\v{e}p\'anek; Ji\v{r}\'i M\'irovsk\'y; Marie Mikulov\'a; Milan Straka; Pavl\'ina Synkov\'a

arxiv: 2606.24324 · v1 · pith:GWDGOMIKnew · submitted 2026-06-23 · 💻 cs.CL

Prague Dependency Treebank -- Consolidated 2.0: Enriching a Complex Annotation Scheme

Marie Mikulov\'a , Ji\v{r}\'i M\'irovsk\'y , Milan Straka , Pavl\'ina Synkov\'a , Jan \v{S}t\v{e}p\'anek , Barbora \v{S}t\v{e}p\'ankov\'a , Jan Haji\v{c} This is my paper

Pith reviewed 2026-06-26 00:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords Prague Dependency TreebankCzech corpusmulti-layer annotationcoreferencediscourse relationsdependency parsingNLP resources

0 comments

The pith

The Prague Dependency Treebank reaches consolidated 2.0 as a uniform 4-million-token multi-layer Czech resource.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the second consolidated version of the Prague Dependency Treebank, PDT-C 2.0, as the end point of a nearly 30-year project. This version unifies annotations across multiple language layers into one coherent scheme for Czech. The result is a genre-diversified corpus of almost 4 million tokens accompanied by compatible lexicons. The framework stands out for linking syntax, semantics, coreference, and discourse relations in a single resource that supports both linguistic inquiry and NLP tool development.

Core claim

What carries the argument

The multi-layer annotation scheme that links morphology, syntax, tectogrammatical semantics, coreference, and discourse relations into one uniform structure.

If this is right

The corpus supports continuous linguistic research on Czech through its uniform multi-layer structure.
International comparisons of traditional and novel NLP tools become possible with the consistent annotations.
Conversions of the annotations into other formalisms are enabled by the rich linked layers.
Trained parsers derived from the resource are released for direct use in further work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consolidation approach could be tested on treebanks for languages other than Czech to produce comparable resources.
Researchers may now examine interactions between syntactic and discourse layers that were harder to study before unification.
The CC BY-NC-SA licence opens the data for wider experimentation in both academic and applied NLP settings.

Load-bearing premise

Annotations from different layers and prior versions have been successfully unified into a single coherent scheme without unresolved inconsistencies.

What would settle it

Finding inconsistencies or incompatibilities when inspecting or processing the linked layers in the released PDT-C 2.0 data would show the unification claim does not hold.

Figures

Figures reproduced from arXiv: 2606.24324 by Barbora \v{S}t\v{e}p\'ankov\'a, Jan Haji\v{c}, Jan \v{S}t\v{e}p\'anek, Ji\v{r}\'i M\'irovsk\'y, Marie Mikulov\'a, Milan Straka, Pavl\'ina Synkov\'a.

read the original abstract

The Prague Dependency Treebank framework is unique in its attempt to systematically include and link different layers of language, including a meaning representation with several types of inter-sentential phenomena, especially coreference and discourse relations. We present its second consolidated version (PDT-C 2.0), which concludes almost 30-years long project of sustained development of the resource to a uniformly and coherently annotated, genre-diversified, almost 4 million token language resource of Czech language, with accompanying fully compatible lexicons. In addition to continuous linguistic research, the richly linguistically annotated corpus is also widely used in international comparisons of the development of traditional and novel NLP tools as well as in conversions into other formalisms. The corpus and the trained parsers are available under the CC BY-NC-SA licence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a resource release paper for PDT-C 2.0, a consolidated 4M-token multi-layer Czech treebank that finishes a 30-year project.

read the letter

The main point is that PDT-C 2.0 is now available as the second consolidated version of the Prague Dependency Treebank. It combines nearly 4 million tokens of Czech text into one uniformly annotated resource that links dependency syntax with meaning layers, coreference, and discourse relations, plus matching lexicons. The data is released under CC BY-NC-SA and has already supported tool comparisons and formalism conversions.

What the paper does well is lay out the scale and the multi-layer design clearly. Sustained work on a single language resource over decades is uncommon, and the genre diversification plus the explicit linking of inter-sentential phenomena gives users a concrete, large-scale dataset for Czech NLP.

The soft spot is the absence of any numbers on annotation consistency. The text asserts uniformity and coherence but supplies no agreement figures, verification steps, or error analysis, so the central claim rests on description alone. For a resource paper this is typical, yet it leaves the quality hard to assess without downloading the data.

This is for researchers who need annotated Czech data or who study multi-layer schemes. Readers looking for new methods or broad theoretical claims will not find them here. It deserves peer review because the resource itself is substantial and the community benefits from documented updates to established treebanks.

Referee Report

1 major / 1 minor

Summary. The manuscript presents the second consolidated version of the Prague Dependency Treebank (PDT-C 2.0), concluding a nearly 30-year project. It describes a uniformly and coherently annotated, genre-diversified Czech corpus of almost 4 million tokens with multiple linked layers (morphology, dependency syntax, tectogrammatical semantics, coreference, and discourse relations) plus fully compatible lexicons. The resource is released under CC BY-NC-SA and noted for use in NLP tool comparisons and formalism conversions.

Significance. If the uniformity and coherence claims hold, PDT-C 2.0 would constitute a major, long-term resource for Czech linguistics and NLP, enabling sustained research on multi-layer phenomena and serving as a benchmark for parser development and cross-formalism conversions. The public release with lexicons is a clear strength for reproducibility and community use.

major comments (1)

[Abstract] Abstract: The central claim that the corpus is 'uniformly and coherently annotated' after consolidation of prior versions is asserted without any supporting metrics (e.g., inter-annotator agreement, layer-consistency scores, or verification procedures for resolving inconsistencies across the 30-year history). This directly underpins the paper's descriptive contribution and requires explicit evidence.

minor comments (1)

[Abstract] Abstract: The title emphasizes 'Enriching a Complex Annotation Scheme' while the text focuses on consolidation; a brief clarification of what new enrichment occurred in version 2.0 versus prior releases would improve alignment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the recognition of the resource's potential significance. We address the single major comment below and agree that strengthening the evidential basis for the uniformity claim will improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the corpus is 'uniformly and coherently annotated' after consolidation of prior versions is asserted without any supporting metrics (e.g., inter-annotator agreement, layer-consistency scores, or verification procedures for resolving inconsistencies across the 30-year history). This directly underpins the paper's descriptive contribution and requires explicit evidence.

Authors: We accept this observation. While the full manuscript (Sections 3–5) describes the multi-stage consolidation workflow, including manual cross-layer consistency checks and the application of unified annotation guidelines developed over the project lifetime, it does not present aggregated quantitative metrics in a single location. We will therefore add a new subsection (provisionally 5.3) that collates available inter-annotator agreement figures from the original annotation campaigns, reports on the verification procedures used to resolve historical inconsistencies, and summarises layer-consistency statistics obtained during the final consolidation pass. The abstract will be revised to reference this subsection. This is a partial revision: the descriptive account of the process remains unchanged, but explicit supporting evidence will be supplied. revision: partial

Circularity Check

0 steps flagged

No circularity: descriptive resource release with no derivations

full rationale

The paper presents PDT-C 2.0 as the outcome of sustained annotation unification over 30 years. No equations, predictions, fitted parameters, or load-bearing derivations appear in the abstract or described content. The claim of uniformity and coherence is an empirical assertion about the released data itself, externally verifiable by inspection of the corpus rather than by internal reduction to self-citations or ansatzes. No steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a corpus description paper with no mathematical content; it introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5717 in / 1032 out tokens · 22292 ms · 2026-06-26T00:07:52.603473+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 1 linked inside Pith

[1]

Com- pared to the previous 1.0 version, the data are now fully manually annotated

Introduction We present thePrague Dependency Treebank - Consolidated 2.0(PDT -C 2.0; Hajiˇc et al., 2024a) a second consolidated release of the existing orig- inal PDT -corpora of Czech data published in one package to allow for easier data handling. Com- pared to the previous 1.0 version, the data are now fully manually annotated. Included is a morpholog...

Pith/arXiv arXiv 2024
[2]

annotated

Multi-layer Architecture The PDT annotation scheme is based on the well- developed theory of language description, Func- tional Generative Description (Sgall et al., 1986) and was reflected in several annotation manuals available from the project website.2 The multi-layer architecture (linked from mean- ing to text) allows a comprehensive description of t...

1986
[3]

3.1), translated texts (Sect

Genre Diversified Data PDT -C 2.0 consists of four different datasets: writ- ten texts (Sect. 3.1), translated texts (Sect. 3.2), spoken texts (Sect. 3.3), and of user-generated texts (Sect. 3.4). The datasets are uniformly published in three formats: pml, mrp, and treex. The Prague Markup Language format (PML, Pajas and Št ˇepánek,
[4]

Treex is technically also a PML format, used in the NLP system Treex (all annotation layers are in a sin- gle file; Žabokrtský, 2011)

is a language-independent, XML-based for- mat customized for multi-layer linguistic annotation. Treex is technically also a PML format, used in the NLP system Treex (all annotation layers are in a sin- gle file; Žabokrtský, 2011). MRP is a JSON-based format used in the CoNLL 2019 and 2020 shared tasks on meaning representation parsing (Oepen et al., 2019,...

2011
[5]

Volume of Data The data volume is given in Tab. 1. Altogether, the consolidated treebank contains of almost 4 million tokens with manual morphological annota- tion (Sect. 5.2) and 3.5 million tokens with manual surface syntactic annotation (Sect. 5.3) and 2.7 mil- lion with manual deep syntactic and other semantic annotations (Sect. 5.4). The different nu...
[6]

translate

Rich Linguistic Annotation The long-run Prague Dependency Treebank project is unique in its attempt to systematically cover and link different layers of language descrip- tion including a rich semantic annotation. Tab. 2 provides an overview of the different types of an- notation across the three annotation layers (see Sect. 2) for each dataset (see Sect....

2008
[7]

They can be used to distinguish the different meanings of words and also to main- tain or monitor the consistency of annotations

External Resources An important part of annotation also involves vari- ous dictionaries. They can be used to distinguish the different meanings of words and also to main- tain or monitor the consistency of annotations. The PDT -C annotation is associated with the morpho- logical (Sect. 6.1) and valency (6.2) dictionaries. 6.1. MorfFlex MorfFlex (the lates...

2026
[8]

The richly linguistically annotated PDT corpora are also widely used in international comparisons in the NLP field

Related Data and Tools Throughout their development, the PDT corpora have served as an invaluable resource for lin- guistic research, for enriching the description of the Czech language system, and for developing general methods of language description. The richly linguistically annotated PDT corpora are also widely used in international comparisons in th...

2014
[9]

The rich annotation at the t-layer serves as a source for conversion into various semantic and knowledge representations

within thePrague Discourse Treebank 4.0(PDiT 4.0) 7 release (Mírovský and Synková, 2026). The rich annotation at the t-layer serves as a source for conversion into various semantic and knowledge representations. Among earlier conver- sions, let us mention the transformation into the formal-logical formatMinimal Recursion Seman- tics(Copestake et al., 2005...

2026
[10]

Future Work The description of language is far from complete. Despite the remarkable success of large language models (LLMs), we are still far from achieving sys- tems that truly understand natural language, and fundamental linguistic research remains essential. In this respect, we aim to continue our efforts in systematically describing language from for...

2025
[11]

Conclusion In the contribution, we present the Prague Depen- dency Treebank – Consolidated 2.0, a comprehen- sive, multi-layer linguistic resource that integrates semantic, syntactic, and morphological informa- tion, including inter-sentential phenomena such as coreference and discourse relations. The long- term development of the PDT framework has re- su...
[12]

Some types of annotation are currently available only for a sub- set of the treebank — this applies in particular to grammatemes (Sect

Limitations While we present a large, genre-diversified, and richly annotated language resource, there are sev- eral limitations we are aware of. Some types of annotation are currently available only for a sub- set of the treebank — this applies in particular to grammatemes (Sect. 5.9), as well as to bridging relations (Sect. 5.10) and multiword expressio...

2026
[13]

The work described herein has also been supported by the Ministry of Education, Y outh and Sports of the Czech Republic, Project No

Acknowledgements The research reported here has been supported by the Czech Science Foundation under the project 22-03269S. The work described herein has also been supported by the Ministry of Education, Y outh and Sports of the Czech Republic, Project No. LM2023062 LINDAT/CLARIAH-CZ.9
[14]

Bibliographical References Ann Copestake, Dan Flickinger, Carl Pollard, and Ivan A. Sag. 2005. Minimal Recursion Seman- tics: An Introduction.Research on Language and Computation, 3(4):281–332. Marie-Catherine de Marneffe, Christopher D. Man- ning, Joakim Nivre, and Daniel Zeman. 2021. Universal Dependencies.Computational Lin- guistics, 47(2):255–308. Kir...

2005
[15]

InProceedings of the Second International Conference on Language Resources and Eval- uation, Athens, Greece

Coreference in Annotating a Large Cor- pus. InProceedings of the Second International Conference on Language Resources and Eval- uation, Athens, Greece. European Language Resources Association. Eva Hajiˇcová, Petr Sgall, and Barbara Partee. 1998. Topic-focus articulation, tripartite structures, and semantic content. Kluwer Academic Publishers, Dordrecht, ...

1998
[16]

InProceedings of the Seventh International Con- ference on Language Resources and Evaluation, Valletta, Malta

Mapping between Dependency Structures and Compositional Semantic Representations. InProceedings of the Seventh International Con- ference on Language Resources and Evaluation, Valletta, Malta. European Language Resources Association. Markéta Lopatková, Eva Fu ˇcíková, Federica Gamba, Jan Št ˇepánek, Daniel Zeman, and Šárka Zikánová. 2024. Towards a Conver...

2024
[17]

2024a.Prague Depen- dency Treebank - Consolidated 2.0 (PDT-C 2.0)

Language Resource References Hajiˇc, Jan and Bej ˇcek, Eduard and Bémová, Alevtina and Burá ˇnová, Eva and Fu ˇcíková, Eva and Haji ˇcová, Eva and Havelka, Ji ˇrí and Hlaváˇcová, Jaroslava and Homola, Petr and Ircing, Pavel and Kárník, Ji ˇrí and Kettnerová, Václava and Klyueva, Natalia and Kolá ˇrová, Veronika and Ku ˇcová, Lucie and Lopatková, Markéta a...

2024

[1] [1]

Com- pared to the previous 1.0 version, the data are now fully manually annotated

Introduction We present thePrague Dependency Treebank - Consolidated 2.0(PDT -C 2.0; Hajiˇc et al., 2024a) a second consolidated release of the existing orig- inal PDT -corpora of Czech data published in one package to allow for easier data handling. Com- pared to the previous 1.0 version, the data are now fully manually annotated. Included is a morpholog...

Pith/arXiv arXiv 2024

[2] [2]

annotated

Multi-layer Architecture The PDT annotation scheme is based on the well- developed theory of language description, Func- tional Generative Description (Sgall et al., 1986) and was reflected in several annotation manuals available from the project website.2 The multi-layer architecture (linked from mean- ing to text) allows a comprehensive description of t...

1986

[3] [3]

3.1), translated texts (Sect

Genre Diversified Data PDT -C 2.0 consists of four different datasets: writ- ten texts (Sect. 3.1), translated texts (Sect. 3.2), spoken texts (Sect. 3.3), and of user-generated texts (Sect. 3.4). The datasets are uniformly published in three formats: pml, mrp, and treex. The Prague Markup Language format (PML, Pajas and Št ˇepánek,

[4] [4]

Treex is technically also a PML format, used in the NLP system Treex (all annotation layers are in a sin- gle file; Žabokrtský, 2011)

is a language-independent, XML-based for- mat customized for multi-layer linguistic annotation. Treex is technically also a PML format, used in the NLP system Treex (all annotation layers are in a sin- gle file; Žabokrtský, 2011). MRP is a JSON-based format used in the CoNLL 2019 and 2020 shared tasks on meaning representation parsing (Oepen et al., 2019,...

2011

[5] [5]

Volume of Data The data volume is given in Tab. 1. Altogether, the consolidated treebank contains of almost 4 million tokens with manual morphological annota- tion (Sect. 5.2) and 3.5 million tokens with manual surface syntactic annotation (Sect. 5.3) and 2.7 mil- lion with manual deep syntactic and other semantic annotations (Sect. 5.4). The different nu...

[6] [6]

translate

Rich Linguistic Annotation The long-run Prague Dependency Treebank project is unique in its attempt to systematically cover and link different layers of language descrip- tion including a rich semantic annotation. Tab. 2 provides an overview of the different types of an- notation across the three annotation layers (see Sect. 2) for each dataset (see Sect....

2008

[7] [7]

They can be used to distinguish the different meanings of words and also to main- tain or monitor the consistency of annotations

External Resources An important part of annotation also involves vari- ous dictionaries. They can be used to distinguish the different meanings of words and also to main- tain or monitor the consistency of annotations. The PDT -C annotation is associated with the morpho- logical (Sect. 6.1) and valency (6.2) dictionaries. 6.1. MorfFlex MorfFlex (the lates...

2026

[8] [8]

The richly linguistically annotated PDT corpora are also widely used in international comparisons in the NLP field

Related Data and Tools Throughout their development, the PDT corpora have served as an invaluable resource for lin- guistic research, for enriching the description of the Czech language system, and for developing general methods of language description. The richly linguistically annotated PDT corpora are also widely used in international comparisons in th...

2014

[9] [9]

The rich annotation at the t-layer serves as a source for conversion into various semantic and knowledge representations

within thePrague Discourse Treebank 4.0(PDiT 4.0) 7 release (Mírovský and Synková, 2026). The rich annotation at the t-layer serves as a source for conversion into various semantic and knowledge representations. Among earlier conver- sions, let us mention the transformation into the formal-logical formatMinimal Recursion Seman- tics(Copestake et al., 2005...

2026

[10] [10]

Future Work The description of language is far from complete. Despite the remarkable success of large language models (LLMs), we are still far from achieving sys- tems that truly understand natural language, and fundamental linguistic research remains essential. In this respect, we aim to continue our efforts in systematically describing language from for...

2025

[11] [11]

Conclusion In the contribution, we present the Prague Depen- dency Treebank – Consolidated 2.0, a comprehen- sive, multi-layer linguistic resource that integrates semantic, syntactic, and morphological informa- tion, including inter-sentential phenomena such as coreference and discourse relations. The long- term development of the PDT framework has re- su...

[12] [12]

Some types of annotation are currently available only for a sub- set of the treebank — this applies in particular to grammatemes (Sect

Limitations While we present a large, genre-diversified, and richly annotated language resource, there are sev- eral limitations we are aware of. Some types of annotation are currently available only for a sub- set of the treebank — this applies in particular to grammatemes (Sect. 5.9), as well as to bridging relations (Sect. 5.10) and multiword expressio...

2026

[13] [13]

The work described herein has also been supported by the Ministry of Education, Y outh and Sports of the Czech Republic, Project No

Acknowledgements The research reported here has been supported by the Czech Science Foundation under the project 22-03269S. The work described herein has also been supported by the Ministry of Education, Y outh and Sports of the Czech Republic, Project No. LM2023062 LINDAT/CLARIAH-CZ.9

[14] [14]

Bibliographical References Ann Copestake, Dan Flickinger, Carl Pollard, and Ivan A. Sag. 2005. Minimal Recursion Seman- tics: An Introduction.Research on Language and Computation, 3(4):281–332. Marie-Catherine de Marneffe, Christopher D. Man- ning, Joakim Nivre, and Daniel Zeman. 2021. Universal Dependencies.Computational Lin- guistics, 47(2):255–308. Kir...

2005

[15] [15]

InProceedings of the Second International Conference on Language Resources and Eval- uation, Athens, Greece

Coreference in Annotating a Large Cor- pus. InProceedings of the Second International Conference on Language Resources and Eval- uation, Athens, Greece. European Language Resources Association. Eva Hajiˇcová, Petr Sgall, and Barbara Partee. 1998. Topic-focus articulation, tripartite structures, and semantic content. Kluwer Academic Publishers, Dordrecht, ...

1998

[16] [16]

InProceedings of the Seventh International Con- ference on Language Resources and Evaluation, Valletta, Malta

Mapping between Dependency Structures and Compositional Semantic Representations. InProceedings of the Seventh International Con- ference on Language Resources and Evaluation, Valletta, Malta. European Language Resources Association. Markéta Lopatková, Eva Fu ˇcíková, Federica Gamba, Jan Št ˇepánek, Daniel Zeman, and Šárka Zikánová. 2024. Towards a Conver...

2024

[17] [17]

2024a.Prague Depen- dency Treebank - Consolidated 2.0 (PDT-C 2.0)

Language Resource References Hajiˇc, Jan and Bej ˇcek, Eduard and Bémová, Alevtina and Burá ˇnová, Eva and Fu ˇcíková, Eva and Haji ˇcová, Eva and Havelka, Ji ˇrí and Hlaváˇcová, Jaroslava and Homola, Petr and Ircing, Pavel and Kárník, Ji ˇrí and Kettnerová, Václava and Klyueva, Natalia and Kolá ˇrová, Veronika and Ku ˇcová, Lucie and Lopatková, Markéta a...

2024