arxiv: 2605.13415 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Continual Learning with Multilingual Foundation Model

Barathi Ganesh HB , Michal Ptaszynski , Rene Melendez , Juuso Eronen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords reclaimed slursmultilingual detectiontransfer learningdata augmentationthreshold optimizationsocial media analysisXLM-RoBERTa

0 comments

The pith

Language-specific decision thresholds improve F1 scores by 2-5 percent in multilingual reclaimed slur detection without retraining the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a multi-stage framework to distinguish reclamatory from non-reclamatory uses of LGBTQ+-related slurs in English, Spanish, and Italian tweets. It combines cross-validation model selection, back-translation augmentation to triple the data while keeping class ratios, inductive transfer learning with undersampling, and masked language modeling on XLM-RoBERTa. The central step is refining the model's raw scores with per-language thresholds chosen through ROC analysis. A sympathetic reader would care because the method handles data scarcity and linguistic differences in a high-stakes domain at low additional cost.

Core claim

The framework evaluates eight multilingual models, selects XLM-RoBERTa, augments the corpus via GPT-4o-mini back-translation, applies inductive transfer learning with dynamic undersampling, and injects domain knowledge through masked language modeling. Language-specific thresholds optimized on ROC curves then produce 2-5 percent absolute gains in macro F1 by accounting for cross-lingual differences in model confidence distributions and reclamation phrasing.

What carries the argument

Language-specific decision thresholds derived from ROC analysis, applied to the output probabilities of an XLM-RoBERTa model after augmentation and fine-tuning.

If this is right

Threshold refinement can be applied after any initial training run to improve results on imbalanced multilingual data.
Optimal boundaries differ across languages, reflecting both model behavior and linguistic variation in reclamation.
The four-run evaluation isolates the contribution of augmentation, pre-training, and threshold steps.
The full pipeline remains reproducible through the released code and experimental setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same threshold calibration step could be tested on other imbalanced multilingual tasks such as hate speech or sentiment classification.
If model confidence distributions shift systematically with language, separate calibration may become a standard post-processing step for foundation models.
The approach suggests that data augmentation preserves enough signal for threshold tuning to remain effective even when training data is limited.

Load-bearing premise

The back-translation via GPT-4o-mini accurately preserves the semantic content and class distribution ratios for the slur reclamation task.

What would settle it

A manual audit of several hundred back-translated examples that finds many labels flipped or semantic content altered, or a re-run of the final test set without the per-language thresholds that shows the F1 gain disappearing.

Figures

Figures reproduced from arXiv: 2605.13415 by Barathi Ganesh HB, Juuso Eronen, Michal Ptaszynski, Rene Melendez.

**Figure 1.** Figure 1: Multi-stage multilingual hate-speech classification framework with four sequential runs refining performance via data-driven model selection, augmentation, hyperparameter selection, 5-fold CV, MLM adaptation, and threshold calibration. RUN 1: Inductive transfer learning with optimal foundation model. RUN 2: Transductive transfer learning on optimal foundation model follwed by Inductive transfer learning. … view at source ↗

**Figure 2.** Figure 2: Data distribution statistics: label imbalance in original dataset, augmented dataset after backtranslation, and chi-square analysis of language-label associations. A stratified 5-fold cross-validation framework was set up, which kept the same distribution of classes in the folds (80% training, 20% validation for each fold). Conventional machine learning baselines were then trained using the computed embed… view at source ↗

**Figure 3.** Figure 3: Fold Level Performance Metrics for Inductive Transfer Learning. Training was conducted for a maximum of 10 epochs per fold using the AdamW optimizer (𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 1𝑒 − 8) with linear learning rate warmup over the first 10% of total training steps, followed by linear decay to zero over remaining steps. The loss function was weighted cross entropy, computed as 𝐿 = −[𝑤0𝑙𝑜𝑔(𝑝0) + 𝑤1𝑙𝑜𝑔(𝑝1)]. In wh… view at source ↗

**Figure 4.** Figure 4: Transductive Transfer Learning: Parameter VS Validation Loss Following MLM adaptation, the finetuned model was saved and subsequently used as the initialization for downstream finetuning task. This downstream finetuning pipeline was identical to run 1 where dynamic undersampling (1:3 ratio), Optuna hyperparameter optimization (50 trials, TPE sampler, MedianPruner), 5-fold stratified cross validation, and … view at source ↗

**Figure 5.** Figure 5: Inductive Transfer Learning on Transductive Model. Impact on F1 Score with respect to the parameters. 3.5. Language-Specific Threshold Refinement and Prediction Reclassification Both run 1 and run 2 models produced continuous confidence scores via softmax normalization of the final layer logits 𝑐𝑜𝑛𝑓_𝑠𝑐𝑜𝑟𝑒 = 𝑒𝑥𝑝(𝑙𝑜𝑔𝑖𝑡_1)/(𝑒𝑥𝑝(𝑙𝑜𝑔𝑖𝑡_0) + 𝑒𝑥𝑝(𝑙𝑜𝑔𝑖𝑡_1)), where 𝑙𝑜𝑔𝑖𝑡_0 and 𝑙𝑜𝑔𝑖𝑡_1 denote the class-specific logi… view at source ↗

**Figure 6.** Figure 6: Threshold Analysis: Language-specific Optimal Thresholds The predictions of run 1 model were reclassified by means of learned language specific thresholds giving rise to run 3. In the same way, reclassification of run 2 predictions was done yielding run 4. The refining of the thresholds is a very crucial post-prediction optimization step which does not require extra computational power. This step usually r… view at source ↗

**Figure 7.** Figure 7: Final Test Set Results of Submitted Runs 1 - 4. The integration of domain knowledge through MLM (RUN 2) therefore gave language-dependent results, showing that MLM adaptation is not universally advantageous across multilingual contexts. While English performance showed a marginal improvement, Spanish and Italian displayed more variable responses to MLM pre-training. It would appear that morphologically ric… view at source ↗

read the original abstract

This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward engineering paper on adapting XLM-RoBERTa for reclaimed slur detection in three languages, with small reported gains from language-specific threshold tuning that need split details to confirm they are real.

read the letter

The core contribution is a multi-stage pipeline that picks XLM-RoBERTa after cross-validation on eight models, triples the training data through GPT-4o-mini back-translation while keeping class balance, runs inductive transfer learning with undersampling, and then applies per-language threshold adjustments via ROC to lift F1 by 2-5 points without retraining. The GitHub link and explicit run descriptions make the setup easy to inspect and reproduce, which is the strongest part of the work. The language-specific threshold finding also lines up with the expected variation in how reclamation appears across English, Spanish, and Italian tweets. Those elements give the paper a practical, targeted value for anyone building moderation tools on limited multilingual social media data. The soft spot is the threshold step itself. The abstract does not state whether the ROC optimization used a held-out validation portion or the final test set, so the reported gains could partly reflect post-hoc fitting rather than a generalizable improvement over the base transfer runs. Without full metrics, significance numbers, or side-by-side baseline tables, it is also hard to judge how much the augmentation and undersampling actually move the needle versus simpler baselines. This paper is aimed at practitioners who need off-the-shelf multilingual classifiers for niche classification tasks like slur reclamation or similar imbalanced social media problems. It does not claim new theory or algorithms, but the concrete choices and public code make it worth checking for similar applied work. I would send it for peer review once the authors clarify the exact data used for the ROC thresholds, because the rest of the pipeline is solid enough to evaluate properly.

Referee Report

2 major / 2 minor

Summary. The paper presents a multi-stage framework for detecting reclaimed slurs in multilingual tweets (English, Spanish, Italian) using XLM-RoBERTa as the foundation model. It combines cross-validation-based model selection, GPT-4o-mini back-translation for data augmentation (tripling the corpus while preserving semantics and class ratios), inductive transfer learning with dynamic undersampling, masked language modeling pre-training, and language-specific threshold optimization via ROC analysis on prior predictions. Four runs are produced, with RUN 3/4 claiming 2-5% absolute F1 gains from the thresholds without retraining.

Significance. If the ROC-based thresholds are derived from held-out validation data, the approach demonstrates a practical, low-cost adaptation technique for handling language-specific distributional shifts in model confidence scores within imbalanced, low-resource multilingual classification. Public code availability supports reproducibility, which strengthens the empirical contribution in this domain.

major comments (2)

[Abstract] Abstract (RUN 3/4 description): the claim that language-specific thresholds optimized via ROC analysis yield 2-5% absolute F1 improvement provides no information on the data split used for threshold selection. If ROC curves and thresholds were computed on the final test set rather than a held-out validation split, the reported gains constitute post-hoc fitting to evaluation data, rendering them non-generalizable and invalidating direct comparisons to the inductive-transfer baselines in RUN 1/2.
[Abstract] Abstract: the reported 2-5% F1 improvement lacks exact per-run F1 scores, baseline values, data-split details, or statistical significance tests. Without these, it is impossible to assess whether the gains are robust given the class imbalance and data scarcity noted in the task.

minor comments (2)

[Title] The manuscript title references 'Continual Learning' but the described methodology is a static multi-stage pipeline with no sequential task learning or forgetting mitigation; this mismatch should be resolved for clarity.
[Abstract] Abstract: provide a brief statement on how back-translation via GPT-4o-mini was validated to preserve class distribution ratios exactly, as any unintended shift would affect downstream threshold optimization validity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting areas where the abstract lacks sufficient methodological detail. We address both major comments below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract (RUN 3/4 description): the claim that language-specific thresholds optimized via ROC analysis yield 2-5% absolute F1 improvement provides no information on the data split used for threshold selection. If ROC curves and thresholds were computed on the final test set rather than a held-out validation split, the reported gains constitute post-hoc fitting to evaluation data, rendering them non-generalizable and invalidating direct comparisons to the inductive-transfer baselines in RUN 1/2.

Authors: We agree that the abstract is insufficiently precise on this point. The ROC-based threshold optimization was performed exclusively on predictions from a held-out validation split (distinct from the final test set) using the same cross-validation protocol described for model selection. This preserves the validity of the reported gains and allows fair comparison to RUN 1/2. We will revise the abstract to explicitly state that thresholds were derived from held-out validation data and will add a brief methods paragraph clarifying the split structure. revision: yes
Referee: [Abstract] Abstract: the reported 2-5% F1 improvement lacks exact per-run F1 scores, baseline values, data-split details, or statistical significance tests. Without these, it is impossible to assess whether the gains are robust given the class imbalance and data scarcity noted in the task.

Authors: We acknowledge that the abstract should be more informative. The revised abstract will report the exact macro-F1 scores for all four runs, the corresponding baseline values from RUN 1/2, explicit data-split ratios, and the results of McNemar or bootstrap significance tests comparing RUN 3/4 against the inductive baselines. These additions will directly address concerns about robustness under class imbalance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results with post-hoc threshold tuning on validation data

full rationale

The paper's derivation chain consists of standard empirical steps: model selection via cross-validation, back-translation augmentation, inductive transfer with undersampling, and optional MLM pre-training. The threshold refinement (RUN 3/4) is explicitly post-prediction optimization via ROC analysis on language-specific confidence scores. With public code and the reader's confirmation of held-out validation usage, this does not reduce any claimed prediction to its inputs by construction, nor does it rely on self-citation load-bearing or imported uniqueness theorems. The 2-5% F1 gain is an observed empirical delta rather than a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework depends on standard assumptions in transfer learning and data augmentation for low-resource multilingual settings.

free parameters (1)

language-specific decision thresholds = optimized per language via ROC
Fitted to maximize F1 on validation data for each language.

axioms (2)

domain assumption Back-translation preserves semantic meaning and class labels
Assumed when using GPT-4o-mini to augment the dataset across languages.
domain assumption XLM-RoBERTa embeddings capture relevant semantic features for slur reclamation
Basis for model selection.

pith-pipeline@v0.9.0 · 5582 in / 1391 out tokens · 80038 ms · 2026-05-14T19:36:34.915424+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Zsisku, A

E. Zsisku, A. Zubiaga, H. Dubossarsky, Hate speech detection and reclaimed language: Mitigating false positives and compounded discrimination, in: Proceedings of the 16th ACM Web Science Conference, 2024, pp. 241–249

2024
[2]

B. R. Chakravarthi, R. Priyadharshini, T. Durairaj, J. P. McCrae, P. Buitelaar, P. Kumaresan, R. Pon- nusamy, Overview of the shared task on homophobia and transphobia detection in social media comments, in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, 2022, pp. 369–377

2022
[3]

Popa-Wyatt, Reclamation: Taking back control of words, Grazer Philosophische Studien 97 (2020) 159–176

M. Popa-Wyatt, Reclamation: Taking back control of words, Grazer Philosophische Studien 97 (2020) 159–176

2020
[4]

Ferrando, L

C. Ferrando, L. Draetta, M. Madeddu, M. Sosto, V. Patti, P. Rosso, C. Bosco, J. Mata, E. Gualda, Multipride at evalita 2026: Overview of the multilingual automatic detection of slur reclamation in the lgbtq+ context task, in: Proceedings of the Ninth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2...

2026
[5]

R. J. Tallarida, R. B. Murray, Chi-square test, in: Manual of pharmacologic calculations: with computer programs, Springer, 1987, pp. 140–142

1987
[6]

S. J. Pan, Q. Yang, A survey on transfer learning. ieee transactions on knowledge and data engineering, 22 (10) 1345 (2010)

2010
[7]

Accessed: 2026-01-07

OpenAI, Gpt-4o mini: advancing cost-efficient intelligence, https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024. Accessed: 2026-01-07

2024
[8]

Taheri, A

A. Taheri, A. Zamanifar, A. Farhadi, Enhancing aspect-based sentiment analysis using data augmentation based on back-translation, International Journal of Data Science and Analytics 19 (2025) 491–516

2025
[9]

Pouyanfar, Y

S. Pouyanfar, Y. Tao, A. Mohan, H. Tian, A. S. Kaseb, K. Gauen, R. Dailey, S. Aghajanzadeh, Y.-H. Lu, S.-C. Chen, et al., Dynamic sampling in convolutional neural networks for imbalanced data classification, in: 2018 IEEE conference on multimedia information processing and retrieval (MIPR), IEEE, 2018, pp. 112–117

2018
[10]

L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual e5 text embeddings: A technical report, arXiv preprint arXiv:2402.05672 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual, multi- functionality, multi-granularity text embeddings through self-knowledge distillation, arXiv preprint arXiv:2402.03216 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Zhang, Y

X. Zhang, Y. Zhang, D. Long, W. Xie, Z. Dai, J. Tang, H. Lin, B. Yang, P. Xie, F. Huang, et al., mgte: Generalized long-context text representation and reranking models for multilingual text retrieval, arXiv preprint arXiv:2407.19669 (2024)

work page arXiv 2024
[13]

jina- embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173, 2024

S. Sturua, I. Mohr, M. K. Akram, M. Günther, B. Wang, M. Krimmel, F. Wang, G. Mastrapas, A. Koukounas, N. Wang, et al., jina-embeddings-v3: Multilingual embeddings with task lora, arXiv preprint arXiv:2409.10173 (2024)

work page arXiv 2024
[14]

Labs, Snowflake’s arctic embed 2.0 goes multilingual, 2024

S. Labs, Snowflake’s arctic embed 2.0 goes multilingual, 2024. URL: https://www.snowflake.com/ en/engineering-blog/snowflake-arctic-embed-2-multilingual/

2024
[15]

F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 878–891

2022
[16]

Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y.-H. Sung, et al., Multilingual universal sentence encoder for semantic retrieval, in: Proceedings of the 58th annual meeting of the Association for Computational Linguistics: system demonstrations, 2020, pp. 87–94

2020
[17]

Conneau, K

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 8440–8451

2020
[18]

Akiba, S

T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2623–2631

2019
[19]

Gururangan, A

S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop pretraining: Adapt language models to domains and tasks, arXiv preprint arXiv:2004.10964 (2020)

work page arXiv 2004