pith. machine review for the scientific record. sign in

arxiv: 2605.09156 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

Ahan Chatterjee, Esteban Garces Arias, Matthias A{\ss}enmacher, Matthias Sch\"offel

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords grammatical genderdiachronic changeLatinOccitandeep learninginterpretable modelsRomance languageshistorical linguistics
0
0 comments X

The pith

An interpretable neural framework shows grammatical gender cues shifting from Latin word forms to Occitan sentence context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a deep learning framework to examine the historical change in grammatical gender from Latin's three genders to Occitan's two. It improves tokenization for historical texts and then quantifies how much gender prediction depends on the word's own morphology versus the words around it. This approach lets us see the balance of gender information between the lemma and its context during language evolution. The resulting analyses and public code offer a new way to study such diachronic processes.

Core claim

We introduce an interpretable deep learning framework to investigate the restructuring of grammatical gender from a tripartite to a bipartite system at both lexical and contextual levels. Analyses show that a custom tokenizer outperforms standard ones on low-resource historical data, morphological features contribute to lexical gender prediction, and different part-of-speech categories contribute variably to contextual prediction, together characterizing the distribution of gender information between the lemma and its sentential context.

What carries the argument

The interpretable deep learning framework using feature attributions to measure morphological contributions at the lexical level and part-of-speech contributions at the contextual level for gender prediction.

If this is right

  • Custom tokenization improves model performance over conventional strategies in low-resource historical settings.
  • Morphological features of lemmas contribute substantially to gender prediction at the lexical level.
  • Contributions from different part-of-speech categories can be quantified for grammatical gender at the contextual level.
  • The gender information is distributed between the lemma and its sentential context in a measurable way.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar frameworks could quantify shifts in other grammatical categories like case or number across language families.
  • Applying this to larger corpora of other Romance languages might reveal if the pattern of increasing contextual dependence is general.
  • The public release of code and data enables direct testing on additional historical periods or languages.

Load-bearing premise

That the neural network's feature attributions on limited historical data capture genuine diachronic changes in language rather than artifacts of the model or data scarcity.

What would settle it

If expert linguists annotate the gender-carrying elements in sample Latin and Occitan sentences and these annotations do not match the model's attributed contributions from lemmas versus contexts.

Figures

Figures reproduced from arXiv: 2605.09156 by Ahan Chatterjee, Esteban Garces Arias, Matthias A{\ss}enmacher, Matthias Sch\"offel.

Figure 1
Figure 1. Figure 1: Map representing the spread of the Occitan [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Gender Shift Frequencies across all three [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of hybrid tokenization capturing or [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Proposed Architecture to assess the impact [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SHAP summary plot for the best-performing [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example in which the lemma-only model mis [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Gender shift frequencies for different Lemma endings. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention-based contextual evidence for grammatical gender prediction shown for two representative [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: SHAP beeswarm plot showing feature contributions to model error prediction. Each dot represents a [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine). In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims to introduce an interpretable deep learning framework for investigating the diachronic shift in grammatical gender from Latin's tripartite (masculine, feminine, neuter) to Occitan's bipartite (masculine, feminine) system. It argues that conventional tokenizers are insufficient for this low-resource historical setting and that a proposed custom tokenizer improves performance; at the lexical level it evaluates morphological feature contributions to gender prediction, and at the contextual level it quantifies part-of-speech category contributions, together characterizing the distribution of gender information between the lemma and sentential context. Code, datasets, and results are released publicly.

Significance. If the empirical results prove robust and the feature attributions align with established philological observations on neuter loss and gender merger, the work could offer a quantitative, interpretable bridge between computational methods and historical linguistics. The public release of code and data is a clear strength that supports reproducibility and extension by others in the field.

major comments (2)
  1. Abstract: the central claim that the framework characterizes the genuine distribution of gender information between lemma and context requires that model predictions and attributions recover linguistic reality rather than artifacts; however, no external validation against established facts on neuter loss or merger patterns is referenced, leaving the quantified POS and morphological contributions open to the possibility that they reflect data scarcity or inductive biases instead.
  2. Results section on tokenizer evaluation: the assertion that the custom tokenizer improves performance over conventional baselines is load-bearing for the low-resource setting claim, yet the abstract supplies no metrics, baseline definitions, error bars, or statistical tests; without these the improvement cannot be assessed as substantive rather than marginal or artifactual.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions we will make to strengthen the manuscript while preserving its core contributions.

read point-by-point responses
  1. Referee: Abstract: the central claim that the framework characterizes the genuine distribution of gender information between lemma and context requires that model predictions and attributions recover linguistic reality rather than artifacts; however, no external validation against established facts on neuter loss or merger patterns is referenced, leaving the quantified POS and morphological contributions open to the possibility that they reflect data scarcity or inductive biases instead.

    Authors: We agree that explicit linkage to philological knowledge strengthens interpretability claims. The manuscript's analyses are motivated by and consistent with known patterns of neuter loss and gender merger in the Latin-to-Romance transition, but we did not include a dedicated comparison subsection. In the revised version we will add a short Discussion paragraph that directly maps our morphological and POS attribution results to established historical linguistics findings (e.g., loss of neuter in specific semantic classes and merger trajectories), citing the relevant philological sources. This addition will make the alignment with linguistic reality explicit and reduce the risk that readers interpret the numbers as purely model-driven artifacts. revision: yes

  2. Referee: Results section on tokenizer evaluation: the assertion that the custom tokenizer improves performance over conventional baselines is load-bearing for the low-resource setting claim, yet the abstract supplies no metrics, baseline definitions, error bars, or statistical tests; without these the improvement cannot be assessed as substantive rather than marginal or artifactual.

    Authors: The Results section already reports the full tokenizer comparison, including accuracy/F1 deltas, baseline tokenizers, standard deviations across runs, and statistical tests. The abstract, however, states the improvement only qualitatively. We will revise the abstract to include one concise quantitative clause (e.g., “our custom tokenizer yields a 4.2-point absolute F1 improvement over subword baselines, significant at p<0.01”) while keeping the abstract within length limits. This change makes the load-bearing claim immediately verifiable without altering the paper’s technical content. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent evaluations

full rationale

The paper's claims rest on training interpretable models, comparing tokenizer performance against baselines, and quantifying POS/morphological contributions via feature attributions on held-out historical data. No equations, derivations, or predictions are shown that reduce to fitted parameters or self-citations by construction. The distribution characterization follows directly from the model's learned behavior on external data splits rather than any self-definitional loop or renamed input.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full paper likely details additional model assumptions and data preprocessing choices not visible here.

free parameters (2)
  • tokenizer hyperparameters
    Proposed tokenizer parameters are tuned on historical data but not enumerated.
  • neural network hyperparameters
    Standard deep learning model settings required for training and interpretation.
axioms (1)
  • domain assumption Grammatical gender in historical texts can be reliably recovered from lemma morphology and sentential context via neural networks.
    Foundational premise enabling both lexical and contextual analyses.

pith-pipeline@v0.9.0 · 5454 in / 1159 out tokens · 51755 ms · 2026-05-12T03:51:09.540267+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 2 internal anchors

  1. [1]

    Proceedings of the 1st Workshop on Multilingual Representation Learning , pages=

    Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings , author=. Proceedings of the 1st Workshop on Multilingual Representation Learning , pages=

  2. [2]

    ACM SIGIR Forum , volume=

    Shaping the Future of Endangered and Low-Resource Languages---Our Role in the Age of LLMs: A Keynote at ECIR 2024 , author=. ACM SIGIR Forum , volume=. 2024 , organization=

  3. [3]

    Journal of Quantitative Linguistics , volume=

    Stability of meanings versus rate of replacement of words: an experimental test , author=. Journal of Quantitative Linguistics , volume=. 2021 , publisher=

  4. [4]

    Language , volume=

    Development of gender classifications: Modeling the historical change from Latin to French , author=. Language , volume=. 2003 , publisher=

  5. [5]

    Acta Antiqua Academiae Scientiarum Hungaricae , volume=

    Preliminary examination of the Latin neuter on inscriptions , author=. Acta Antiqua Academiae Scientiarum Hungaricae , volume=. 2023 , publisher=

  6. [6]

    C orpus A ri \`e ja: Building an Annotated Corpus with Variation in O ccitan

    Poujade, Clamenca and Bras, Myriam and Urieli, Assaf. C orpus A ri \`e ja: Building an Annotated Corpus with Variation in O ccitan. Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024. 2024

  7. [7]

    arXiv preprint arXiv:2506.17715 , year=

    Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages , author=. arXiv preprint arXiv:2506.17715 , year=

  8. [8]

    2000 , publisher=

    Speech & language processing , author=. 2000 , publisher=

  9. [9]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  10. [10]

    2022 , eprint=

    Training language models to follow instructions with human feedback , author=. 2022 , eprint=

  11. [11]

    2019 , eprint=

    Recurrent Neural Networks (RNNs): A gentle Introduction and Overview , author=. 2019 , eprint=

  12. [12]

    Computational linguistics , volume=

    The mathematics of statistical machine translation: Parameter estimation , author=. Computational linguistics , volume=

  13. [13]

    Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages , year=

    Natural Language Processing for Less Privileged Languages: Where do we come from? Where are we going? , author=. Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages , year=

  14. [14]

    Natural Language Engineering , volume=

    Natural language processing for similar languages, varieties, and dialects: A survey , author=. Natural Language Engineering , volume=. 2020 , publisher=

  15. [15]

    Efficient Estimation of Word Representations in Vector Space

    Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

  16. [16]

    Enriching word vectors with subword information.arXiv preprint arXiv:1607.04606, 2016

    Enriching Word Vectors with Subword Information , author=. arXiv preprint arXiv:1607.04606 , year=

  17. [17]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  18. [18]

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

    Unsupervised cross-lingual representation learning at scale , author=. arXiv preprint arXiv:1911.02116 , year=

  19. [19]

    Transactions of the Association for Computational Linguistics , volume=

    ByT5: Towards a token-free future with pre-trained byte-to-byte models , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  20. [20]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Recall@ k surrogate loss with large batches and similarity mixup , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  21. [21]

    Proceedings of the international conference on web intelligence , pages=

    Mrr: an unsupervised algorithm to rank reviews by relevance , author=. Proceedings of the international conference on web intelligence , pages=

  22. [22]

    2013 , eprint=

    A Theoretical Analysis of NDCG Type Ranking Measures , author=. 2013 , eprint=

  23. [23]

    Neural Machine Translation of Rare Words with Subword Units

    Neural machine translation of rare words with subword units , author=. arXiv preprint arXiv:1508.07909 , year=

  24. [24]

    arXiv preprint arXiv:1804.10959 , year=

    Subword regularization: Improving neural network translation models with multiple subword candidates , author=. arXiv preprint arXiv:1804.10959 , year=

  25. [25]

    2025 IEEE International Conference on Electro Information Technology (eIT) , pages=

    Tokenization matters: Improving zero-shot ner for indic languages , author=. 2025 IEEE International Conference on Electro Information Technology (eIT) , pages=. 2025 , organization=

  26. [26]

    Journal of machine learning research , volume=

    A neural probabilistic language model , author=. Journal of machine learning research , volume=

  27. [27]

    , author=

    Recurrent neural network based language model. , author=. Interspeech , volume=. 2010 , organization=

  28. [28]

    Learning internal representations by error propagation , author=

  29. [29]

    IEEE 1988 International Conference on Neural Networks , pages=

    Backpropagation: Past and future , author=. IEEE 1988 International Conference on Neural Networks , pages=. 1988 , organization=

  30. [30]

    Neural computation , volume=

    Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

  31. [31]

    2014 , eprint=

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , author=. 2014 , eprint=

  32. [32]

    2016 , eprint=

    Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=

  33. [33]

    2018 , publisher=

    Improving language understanding by generative pre-training , author=. 2018 , publisher=

  34. [34]

    Deep contextualized word representations

    Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10...

  35. [35]

    2017 , eprint=

    A Unified Approach to Interpreting Model Predictions , author=. 2017 , eprint=

  36. [36]

    2022 , eprint=

    Auto-Encoding Variational Bayes , author=. 2022 , eprint=

  37. [37]

    2017 , eprint=

    Semi-Supervised Classification with Graph Convolutional Networks , author=. 2017 , eprint=

  38. [38]

    The Year’s Work in Modern Language Studies , volume=

    Occitan Studies: Language and Linguistics , author=. The Year’s Work in Modern Language Studies , volume=. 2019 , publisher=

  39. [39]

    2018 , publisher=

    Gender from Latin to Romance: History, geography, typology , author=. 2018 , publisher=

  40. [40]

    1897 , publisher=

    The nature and origin of the noun genders in the Indo-European languages: A lecture delivered on the occasion of the sesquicentennial celebration of Princeton University , author=. 1897 , publisher=

  41. [41]

    Corbett, Greville G. , year=. Gender , publisher=

  42. [42]

    Journal of French Language Studies , volume=

    Predictability in French gender attribution: A corpus analysis , author=. Journal of French Language Studies , volume=. 2006 , publisher=

  43. [43]

    What ' s in a name? I n some languages, grammatical gender

    Nastase, Vivi and Popescu, Marius. What ' s in a name? I n some languages, grammatical gender. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009

  44. [44]

    Quantifying the Semantic Core of Gender Systems

    Williams, Adina and Blasi, Damian and Wolf-Sonkin, Lawrence and Wallach, Hanna and Cotterell, Ryan. Quantifying the Semantic Core of Gender Systems. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1577

  45. [45]

    Minimally Supervised Induction of Grammatical Gender

    Cucerzan, Silviu and Yarowsky, David. Minimally Supervised Induction of Grammatical Gender. Proceedings of the 2003 Human Language Technology Conference of the North A merican Chapter of the Association for Computational Linguistics. 2003

  46. [46]

    Natural Language Engineering , volume=

    Efficiently generating correction suggestions for garbled tokens of historical language , author=. Natural Language Engineering , volume=. 2011 , publisher=

  47. [47]

    Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects , pages=

    Morphological disambiguation and text normalization for southern quechua varieties , author=. Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects , pages=

  48. [48]

    Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

    Automatic normalisation of early Modern French , author=. Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

  49. [49]

    2016 , publisher=

    The Oxford guide to the Romance languages , author=. 2016 , publisher=

  50. [50]

    International Conference on Artificial Neural Networks , pages=

    Text generation in discrete space , author=. International Conference on Artificial Neural Networks , pages=. 2020 , organization=

  51. [51]

    Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders

    Zhang, Yingji and Carvalho, Danilo and Valentino, Marco and Pratt-Hartmann, Ian and Freitas, Andre. Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders. Findings of the Association for Computational Linguistics: EACL 2024. 2024

  52. [52]

    Zeitschrift f

    Sprachdatenbasierte Modellierung von Wissensnetzen in der mittelalterlichen Romania (ALMA): Projektskizze , author=. Zeitschrift f. 2023 , publisher=

  53. [53]

    2025 , month = may, howpublished =

    Wiedner, Marinus , title =. 2025 , month = may, howpublished =. doi:10.5281/zenodo.15300719 , url =

  54. [54]

    Dictionnaire de l’occitan médiéval (DOM) , howpublished =

  55. [55]

    ALMA: Knowledge Networks of Medieval Romance-Speaking Europe , author =

  56. [56]

    Degradation Prediction of Semiconductor Lasers Using Conditional Variational Autoencoder , volume=

    Abdelli, Khouloud and Grieser, Helmut and Neumeyr, Christian and Hohenleitner, Robert and Pachnicke, Stephan , year=. Degradation Prediction of Semiconductor Lasers Using Conditional Variational Autoencoder , volume=. Journal of Lightwave Technology , publisher=. doi:10.1109/jlt.2022.3188831 , number=

  57. [57]

    2019 , eprint=

    Soft edit distance for differentiable comparison of symbolic sequences , author=. 2019 , eprint=

  58. [58]

    2024 , eprint=

    Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels , author=. 2024 , eprint=

  59. [59]

    2017 , eprint=

    Categorical Reparameterization with Gumbel-Softmax , author=. 2017 , eprint=

  60. [60]

    Improving Lemmatization of Non-Standard Languages with Joint Learning

    Manjavacas, Enrique and K \'a d \'a r, \'A kos and Kestemont, Mike. Improving Lemmatization of Non-Standard Languages with Joint Learning. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1153

  61. [61]

    Automatic Transcription of Handwritten Old O ccitan Language

    Garces Arias, Esteban and Pai, Vallari and Sch. Automatic Transcription of Handwritten Old O ccitan Language. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.953