arxiv: 2604.26597 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.AI

Recognition: unknown

Translating Under Pressure: Domain-Aware LLMs for Crisis Communication

Antonio Castaldo, Francesca Chiusaroli, Johanna Monti, Maria Carmen Staiano, Sheila Castilho

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords crisis communicationmachine translationdomain adaptationsimplified Englishpreference optimizationlingua francaemergency responsereadability

0 comments

The pith

Simplified English, combined with domain adaptation, can function as a practical lingua franca for emergency communication when full multilingual coverage is not feasible.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a pipeline for adapting language models to the crisis domain and simplifying their English outputs can deliver effective translations for disasters. It starts with a small reference corpus and enlarges it by retrieving and filtering material from general text collections. The enlarged data is used to fine-tune a compact language model, after which preference optimization steers the outputs toward simple A2-level English. Automatic and human checks confirm that readability rises while the core meaning stays intact. This matters for real emergencies because it offers a workable way to communicate across languages when building full translation systems for every possible pair is too slow or data-poor.

Core claim

By expanding a small crisis reference corpus through retrieval and filtering from general corpora, fine-tuning a small language model for domain-specific translation, and applying preference optimization to favor CEFR A2-level English, the system achieves improved readability in translations while maintaining strong adequacy. This supports the use of simplified English as a practical lingua franca for emergency communication.

What carries the argument

The domain-adaptive pipeline that retrieves and filters data from general corpora to expand a reference corpus, fine-tunes a small language model for crisis translation, and applies preference optimization to bias toward CEFR A2 English.

Load-bearing premise

The assumption that data pulled and filtered from general corpora accurately captures crisis-domain specifics without adding noise or bias that harms translation quality.

What would settle it

Human evaluation in a simulated crisis where participants must act on the translated instructions; if they fail to follow critical steps due to the simplification, the claim is falsified.

Figures

Figures reproduced from arXiv: 2604.26597 by Antonio Castaldo, Francesca Chiusaroli, Johanna Monti, Maria Carmen Staiano, Sheila Castilho.

**Figure 1.** Figure 1: Overview of our two-stage data retrieval pipeline. Stage 1 focuses on cleaning and clustering the reference corpus to generate distinct semantic centroids. Stage 2 leverages these centroids to retrieve in-domain sentences from general corpora (OPUS) via embedding similarity, validated by a stratified manual annotation. allowing candidate segments to be matched to the most relevant crisis profile, and impro… view at source ↗

**Figure 2.** Figure 2: Relationship between weighted MQM score and DA score for the DPO model. Bubble size reflects the frequency of the segments. The shaded region (DA ≥ 75) highlights translations judged high quality by DA despite MQM penalties. Results. Our results find that the DPO model, optimized for readability, produces mostly accurate translations with a mean score of 83 points, compared to 95 for SFT. The translatio… view at source ↗

**Figure 3.** Figure 3: Distribution of dominant error categories per model. DPO shows substantially higher rates of errors related to its simplification behavior on the crisis domain, and then optimized via preferential learning to bias its outputs toward CEFR A2-level English. Our results demonstrate that this two-stage optimization process enables translations that balance translation quality and accessibility, producing o… view at source ↗

read the original abstract

Timely and reliable multilingual communication is critical during natural and human-induced disasters, but developing effective solutions for crisis communication is limited by the scarcity of curated parallel data. We propose a domain-adaptive pipeline that expands a small reference corpus, by retrieving and filtering data from general corpora. We use the resulting dataset to fine-tune a small language model for crisis-domain translation and then apply preference optimization to bias outputs toward CEFR A2-level English. Automatic and human evaluation shows that this approach improves readability, while maintaining strong adequacy. Our results indicate that simplified English, combined with domain adaptation, can function as a practical lingua franca for emergency communication when full multilingual coverage is not feasible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clear pipeline for bootstrapping crisis translation data and pushing outputs to simple A2 English, but the evaluation details are too thin to tell if the domain adaptation actually delivers.

read the letter

The main takeaway is a practical pipeline that starts with a small crisis reference corpus, expands it by retrieving and filtering from general corpora, fine-tunes a small language model on the result, and then runs preference optimization to favor CEFR A2 readability. The authors claim this keeps translation adequacy while improving readability, positioning simplified English as a workable fallback when full multilingual coverage is impossible in emergencies.

Referee Report

3 major / 2 minor

Summary. The paper proposes a domain-adaptive pipeline to address data scarcity in crisis communication translation. It expands a small reference corpus by retrieving and filtering data from general corpora, fine-tunes a small language model on the resulting dataset for crisis-domain translation, and applies preference optimization to produce CEFR A2-level simplified English outputs. Automatic and human evaluations are reported to show gains in readability while preserving adequacy, leading to the claim that simplified English combined with domain adaptation can serve as a practical lingua franca for emergency communication when full multilingual coverage is unavailable.

Significance. If the results hold under rigorous validation, the work offers a pragmatic, low-resource solution for high-stakes multilingual crisis messaging with clear societal value. The pipeline design and inclusion of human evaluation alongside automatic metrics are positive elements that ground the approach in real-world constraints.

major comments (3)

[§3] §3 (Data Expansion): The retrieval-and-filtering procedure for expanding the reference corpus provides no quantitative metrics on filter precision, recall, or retention of crisis-specific terminology and protocols. This is load-bearing for the central domain-adaptation claim; without such evidence the expanded dataset may contain substantial off-domain noise, rendering downstream fine-tuning ineffective.
[§4] §4 (Preference Optimization): The paper must demonstrate that the optimization step for A2 English does not degrade adequacy on crisis-critical items (e.g., safety instructions, terminology). No pre/post-optimization comparisons on domain-specific test cases are described, leaving the risk that simplification trades off precision unaddressed.
[§5] §5 (Evaluation): The results claim improvements in readability and adequacy but supply no concrete automatic metrics, baseline systems, statistical significance tests, or details of the human evaluation protocol (annotator count, agreement, item selection). These omissions prevent assessment of effect size and replicability.

minor comments (2)

[Abstract] The abstract would benefit from at least one illustrative numerical result (e.g., a readability score delta or adequacy rating) to substantiate the reported improvements.
[Introduction] Notation for CEFR A2 level is introduced without a brief definition or reference in the opening sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment point by point below and revised the manuscript to incorporate the requested evidence and details where feasible.

read point-by-point responses

Referee: [§3] §3 (Data Expansion): The retrieval-and-filtering procedure for expanding the reference corpus provides no quantitative metrics on filter precision, recall, or retention of crisis-specific terminology and protocols. This is load-bearing for the central domain-adaptation claim; without such evidence the expanded dataset may contain substantial off-domain noise, rendering downstream fine-tuning ineffective.

Authors: We agree that quantitative validation of the filtering step is necessary to substantiate the domain-adaptation claim. In the revised manuscript we have added a new subsection in §3 reporting precision and recall of the retrieval-and-filtering heuristics, computed on a held-out manually annotated set of crisis documents. We also include the retention rate of a glossary of crisis-specific terminology and protocols. These metrics confirm that off-domain noise remains low and support the downstream fine-tuning results. revision: yes
Referee: [§4] §4 (Preference Optimization): The paper must demonstrate that the optimization step for A2 English does not degrade adequacy on crisis-critical items (e.g., safety instructions, terminology). No pre/post-optimization comparisons on domain-specific test cases are described, leaving the risk that simplification trades off precision unaddressed.

Authors: We acknowledge the need to explicitly verify that preference optimization preserves adequacy on safety-critical content. The revised version of §4 now contains pre- and post-optimization comparisons on a dedicated set of crisis-specific test cases covering safety instructions and key terminology. Adequacy scores (both automatic and human) remain stable with no statistically significant degradation, demonstrating that the simplification step does not trade off precision on these items. revision: yes
Referee: [§5] §5 (Evaluation): The results claim improvements in readability and adequacy but supply no concrete automatic metrics, baseline systems, statistical significance tests, or details of the human evaluation protocol (annotator count, agreement, item selection). These omissions prevent assessment of effect size and replicability.

Authors: We apologize for the lack of concrete details in the original evaluation section. The revised manuscript now reports all automatic metrics (BLEU, COMET, and Flesch-Kincaid readability scores), the full set of baseline systems (general-domain MT models and non-adapted LLMs), statistical significance results from paired t-tests, and complete human-evaluation protocol information: five annotators, Cohen’s kappa agreement, item-selection criteria, and annotation guidelines. These additions enable assessment of effect sizes and support replicability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical pipeline with no derivations or self-referential fitting

full rationale

The paper describes a data-expansion pipeline (retrieve/filter from general corpora), fine-tuning of a small LM, preference optimization toward A2 English, and subsequent automatic/human evaluation. No equations, no fitted parameters renamed as predictions, no self-citation chains invoked as uniqueness theorems, and no ansatzes or renamings of known results. The central claim rests on experimental outcomes rather than any reduction to its own inputs by construction; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters or invented entities are mentioned; the approach relies on standard NLP assumptions about adaptation techniques.

axioms (2)

domain assumption Fine-tuning language models on domain-specific data improves translation performance in that domain
Core to the proposed pipeline
domain assumption Preference optimization can bias model outputs toward specific readability levels like CEFR A2 without major loss in adequacy
Assumed in the second stage of the method

pith-pipeline@v0.9.0 · 5416 in / 986 out tokens · 68794 ms · 2026-05-07T10:55:50.269687+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 4 canonical work pages · 2 internal anchors

[1]

InConference on Machine Translation

Findings of the 2017 conference on machine translation (wmt17). InConference on Machine Translation. Cadwell, Patrick, Sharon O’Brien, and Eric DeLuca

2017
[2]

The Faiss library

More than tweets: A critical reflection on de- veloping and testing crisis machine translation tech- nology.Translation Spaces, 8(2):300–333. Coche, Julien, Jess Kropczynski, Aur ´elie Montarnal, Andrea Tapia, and Frederick Benaben. 2021. Ac- tionability in a Situation Awareness world: Impli- cations for social media processing system design. InISCRAM 202...

work page internal anchor Pith review arXiv 2021
[3]

Glavaˇs, Goran, Marc Franco-Salvador, Simone P Ponzetto, and Paolo Rosso

International network in crisis translation- recommendations on policies. Glavaˇs, Goran, Marc Franco-Salvador, Simone P Ponzetto, and Paolo Rosso. 2018. A resource-light method for cross-lingual semantic textual similarity. Knowledge-based systems, 143:1–9. Graham, Yvette, Timothy Baldwin, and Nitika Mathur

2018
[4]

Accurate Evaluation of Segment-level Ma- chine Translation Metrics. In Mihalcea, Rada, Joyce Chai, and Anoop Sarkar, editors,Proceedings of the 2015 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 1183–1191, Denver, Colorado, May. Association for Computa- tional Linguistics. H...

work page internal anchor Pith review arXiv 2015
[5]

arXiv preprint arXiv:2503.20959

Sociotechnical effects of machine translation. arXiv preprint arXiv:2503.20959. Musacchio, Maria Teresa, Raffaella Panizzon, et al

work page arXiv
[6]

O’Brien, Sharon

Localising or globalising? multilingualism and lingua franca in the management of emergencies from natural disasters.Cultus, 10:92–107. O’Brien, Sharon. 2022. Crisis translation: A snapshot in time.INContext: Studies in Translation and In- terculturalism, 2(1):84–108. O’Brien, Sharon and Federico Marco Federici. 2020. Crisis translation: Considering langu...

2022
[7]

Paesi Bassi

Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Lan- guage Models, January. Issue: arXiv:2401.08350 arXiv:2401.08350 [cs]. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Isabelle, Pierre, Eugene Charniak, and Dekang Lin, editors, P...

work page arXiv 2002