Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

Amy Rechkemmer; Denny Vrande\v{c}i\'c; Elena Simperl; Elizabeth Black; Gerrit Quaremba

arxiv: 2605.31136 · v1 · pith:XRXU7JVYnew · submitted 2026-05-29 · 💻 cs.CL

Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

Gerrit Quaremba , Amy Rechkemmer , Elizabeth Black , Denny Vrande\v{c}i\'c , Elena Simperl This is my paper

Pith reviewed 2026-06-28 22:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords citation needed detectionmultilingualcross-lingual transfersmall language modelsWikipediaautomated fact-checkinglower-resource languagesencoder-style fine-tuning

0 comments

The pith

Fine-tuned small language models outperform prompted large models for detecting missing citations on Wikipedia across 18 languages, including via English-only training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the MCN corpus covering citation-needed claims in 18 languages at three resource levels. It tests small decoder-based language models fine-tuned with an encoder-style objective against prompted large language models on the citation needed detection task. The fine-tuned small models show stronger performance both within each language and in cross-lingual settings where models see only English training data. This setup is presented as more accessible for lower-resource Wikipedia communities that cannot rely on large models. The work frames compact task-specific models as a practical alternative to prompted LLMs for this narrow fact-checking step.

Core claim

We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation.

What carries the argument

The MCN corpus paired with encoder-style fine-tuning of small decoder-based language models for the citation needed detection task.

If this is right

Compact models become preferable to large models for citation needed detection in lower-resource Wikipedia settings.
English-only fine-tuning transfers effectively to other languages for this task with minimal adaptation.
Lower-resource communities gain a practical tool that does not require access to large language models.
Task-specific fine-tuning on a dedicated corpus yields better results than prompting for citation needed detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fine-tuning approach could be tested on other narrow Wikipedia maintenance tasks beyond citation detection.
If the pattern holds, organizations could maintain small specialized models instead of depending on general large models for repeated fact-checking steps.
Cross-lingual gains may reduce the data collection burden for languages with few labeled examples.

Load-bearing premise

That the performance comparison between fine-tuned small models and prompted large models is fair and that the MCN corpus gives an unbiased sample of citation-needed claims across the 18 languages.

What would settle it

A re-evaluation on a fresh sample of Wikipedia claims from the same languages where prompted large models match or exceed the fine-tuned small models under identical evaluation rules.

Figures

Figures reproduced from arXiv: 2605.31136 by Amy Rechkemmer, Denny Vrande\v{c}i\'c, Elena Simperl, Elizabeth Black, Gerrit Quaremba.

**Figure 1.** Figure 1: MCN Language Coverage. of claims by applying Wikipedia’s verifiability policy.1 When editors encounter claims that violate the policy, they tag them with the CITATION NEEDED template to signal the need for an authoritative citation. This maintenance work is critical for preserving Wikipedia’s knowledge integrity, which has made Wikipedia an essential resource for the artificial intelligence community (… view at source ↗

**Figure 2.** Figure 2: Dataset examples from the Wikipedia articles [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Zero-shot CND accuracy by test data setting, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Descriptive statistics [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Zero-shot CND F1 scores by test data setting, [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation. Our findings have important implications for lower-resource Wikipedia communities and suggest that compact, task-specific models are preferable to LLMs for CND. We release all data and code at https://github.com/gerritq/mcn

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New 18-language MCN corpus for citation needed detection plus cross-lingual SLM results that beat prompted LLMs, but the size of that gap depends on unshown details of the LLM baselines.

read the letter

The paper's main deliverable is the MCN corpus covering citation needed detection across 18 languages at three resource levels, along with experiments that fine-tune small decoder models in an encoder-style setup and test them cross-lingually.

Releasing the data and code is the clearest positive. That makes the resource immediately usable for groups maintaining Wikipedia in lower-resource languages who cannot run large models. The cross-lingual finding, that English-only training still beats prompted LLMs with little target adaptation, is the part that is not already in the cited prior work.

The comparison between fine-tuned SLMs and prompted LLMs is the load-bearing claim. The abstract states clear outperformance, but the stress-test note is right that prompt templates, output parsing, and identical sampling of positive and negative examples across languages need to be verified in the methods. If those controls are loose, the reported advantage could shrink. The paper does not appear to have circularity or invented entities; it is a standard empirical benchmark.

This is for researchers in multilingual NLP or automated fact-checking who need labeled data for low-resource settings. A reader working on Wikipedia quality tools would get direct value from the corpus even if the model rankings require closer inspection.

It deserves a serious referee because the dataset is new and the cross-lingual angle is checkable once the full experimental protocol is on the table.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces the MCN corpus, a multilingual dataset for Citation Needed Detection (CND) spanning 18 languages across three resource levels. It evaluates small decoder-based language models (SLMs) fine-tuned using an encoder-style objective and reports that these models substantially outperform prompted large language models (LLMs) in both monolingual and cross-lingual settings; English-only fine-tuned SLMs are shown to surpass LLMs on target languages with minimal adaptation. The authors release the full dataset and code.

Significance. If the empirical results hold, the work is significant for lower-resource Wikipedia communities because it demonstrates that compact, task-specific models can be preferable to LLMs for CND and provides one of the first cross-lingual studies on the task. The public release of the MCN corpus and code is a clear strength that supports reproducibility and follow-on research.

minor comments (2)

[Abstract] Abstract: the claim of outperformance is stated without reference to specific metrics, data splits, or statistical tests; adding one sentence summarizing the evaluation protocol would improve immediate verifiability.
[Experiments] The comparison between fine-tuned SLMs and prompted LLMs is central; while code release mitigates this, a brief explicit statement in the experimental section confirming identical positive/negative sampling and metric computation across all 18 languages would strengthen the fairness claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the MCN corpus, the significance assessment for lower-resource Wikipedia communities, and the recommendation of minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with released data

full rationale

The paper is an empirical study that introduces the MCN corpus spanning 18 languages and reports experimental comparisons between fine-tuned SLMs and prompted LLMs, including cross-lingual settings. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The central claims rest on benchmark results and released code rather than any self-definitional or load-bearing reductions. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard supervised learning assumptions and the new dataset; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)

standard math Standard machine learning assumptions including representative train/test splits and appropriate classification metrics.
Implicit in any supervised benchmarking study on text classification.

pith-pipeline@v0.9.1-grok · 5736 in / 1178 out tokens · 28206 ms · 2026-06-28T22:45:08.606961+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 1 linked inside Pith

[1]

InWorking Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Work- shop Proceedings, pages 276–286

Overview of the clef-2024 checkthat! lab task 1 on check-worthiness estimation of multigenre con- tent. InWorking Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Work- shop Proceedings, pages 276–286. CEUR Workshop Proceedings. Publisher Copyright: © 2024 Copy- right for this paper by its authors.; 25th Working Notes of the Conf...

2024
[2]

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury

Wikimedia data for ai: a review of wikimedia datasets for nlp tasks and ai-assisted editing.arXiv preprint arXiv:2410.08918. Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the nlp world.arXiv preprint arXiv:2004.09095. Ashkan Kazemi, Kiran Garimella, Devin...

arXiv 2020
[3]

Automated fact-checking for assisting human fact-checkers.arXiv preprint arXiv:2103.07769. OpenAI. 2025. Gpt-5 system card. https://cdn. openai.com/gpt-5-system-card.pdf . Accessed: 20225-12-12. Rrubaa Panchendrarajan and Arkaitz Zubiaga. 2024. Claim detection for automated fact-checking: A sur- vey on monolingual, multilingual and cross-lingual research....

arXiv 2025
[4]

In the fall, he began recording a new album, and thus, eleven months afterDue anni dopo,L’isola non trovatawas released

Show me the work: Fact-checkers’ require- ments for explainable automated fact-checking. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–21. Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An- drew M Dai, and Quoc V Le. 2021. Finetuned lan- guage models are zero-shot learne...

Pith/arXiv arXiv 2025
[5]

label ": 0} or {

: Quotation - The statement is a direct quotation or close paraphrase of a source . Statistics - The statement contains statistics or quantitative data . Controversial - The statement makes surprising or potentially controversial claims . Opinion - The statement expresses a person's subjective opinion or belief . Private Life - The statement contains clai...

arXiv 2019

[1] [1]

InWorking Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Work- shop Proceedings, pages 276–286

Overview of the clef-2024 checkthat! lab task 1 on check-worthiness estimation of multigenre con- tent. InWorking Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Work- shop Proceedings, pages 276–286. CEUR Workshop Proceedings. Publisher Copyright: © 2024 Copy- right for this paper by its authors.; 25th Working Notes of the Conf...

2024

[2] [2]

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury

Wikimedia data for ai: a review of wikimedia datasets for nlp tasks and ai-assisted editing.arXiv preprint arXiv:2410.08918. Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the nlp world.arXiv preprint arXiv:2004.09095. Ashkan Kazemi, Kiran Garimella, Devin...

arXiv 2020

[3] [3]

Automated fact-checking for assisting human fact-checkers.arXiv preprint arXiv:2103.07769. OpenAI. 2025. Gpt-5 system card. https://cdn. openai.com/gpt-5-system-card.pdf . Accessed: 20225-12-12. Rrubaa Panchendrarajan and Arkaitz Zubiaga. 2024. Claim detection for automated fact-checking: A sur- vey on monolingual, multilingual and cross-lingual research....

arXiv 2025

[4] [4]

In the fall, he began recording a new album, and thus, eleven months afterDue anni dopo,L’isola non trovatawas released

Show me the work: Fact-checkers’ require- ments for explainable automated fact-checking. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–21. Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An- drew M Dai, and Quoc V Le. 2021. Finetuned lan- guage models are zero-shot learne...

Pith/arXiv arXiv 2025

[5] [5]

label ": 0} or {

: Quotation - The statement is a direct quotation or close paraphrase of a source . Statistics - The statement contains statistics or quantitative data . Controversial - The statement makes surprising or potentially controversial claims . Opinion - The statement expresses a person's subjective opinion or belief . Private Life - The statement contains clai...

arXiv 2019