Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

Andr\'e F.T. Martins; Beatrice Savoldi; Daniel Chechelnitsky; Giuseppe Attanasio; Maarten Sap; Marine Carpuat; Matteo Negri

arxiv: 2606.06177 · v1 · pith:SZSWIASTnew · submitted 2026-06-04 · 💻 cs.CL · cs.HC

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

Giuseppe Attanasio , Beatrice Savoldi , Daniel Chechelnitsky , Matteo Negri , Marine Carpuat , Maarten Sap , Andr\'e F.T. Martins This is my paper

Pith reviewed 2026-06-28 01:08 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords speech translationusability evaluationuser-centered frameworkreal-world scenariosQA-based metricsdemographic differencesmachine translation evaluationsituated evaluation

0 comments

The pith

A user study finds speech translation usable in only about half of real one-to-one interactions, with large gaps by demographic group and QA metrics as stronger predictors than standard scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ouvia to test how usable speech translation feels to users when they need to convey messages across languages in actual conversations. It collected more than 1750 English-to-Portuguese interactions through a custom web app in healthcare and everyday settings using four different systems and speakers from varied dialects and genders. Results show only around half the interactions rated usable, with clear differences in ratings across groups. QA-based metrics aligned much more closely with those user ratings than conventional quality measures. The work argues that evaluation must shift from abstract quality scores to situated user needs if the technology is to serve communication effectively.

Core claim

Ouvia measures user-perceived usability of speech translation by placing English speakers and Portuguese speakers in one-to-one request-conveying tasks that mimic real healthcare and daily situations. Data from over 1750 mediated interactions across four systems show modern speech translation produces usable results in only around half of cases, with substantial reported differences by English dialect and gender. Among available quality metrics, question-answering evaluation predicts these usability ratings substantially better than standard approaches.

What carries the argument

Ouvia, a framework that gathers usability ratings from situated one-to-one communication tasks run inside a custom web app.

If this is right

Only around half of speech translation interactions receive usable ratings in realistic settings.
Usability ratings differ significantly across demographic groups including English dialects and genders.
QA-based evaluation predicts real-world usability ratings substantially better than standard quality metrics.
Evaluation of speech translation benefits from moving beyond decontextualized test sets to user-centered, situated designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be applied to additional language pairs to test whether similar usability ceilings appear.
Demographic gaps may trace to uneven performance on particular dialects or speaker characteristics in current models.
Developers might adopt QA-style checks during system tuning to improve alignment with actual user experience.

Load-bearing premise

The custom web app and multi-phase study design with healthcare and everyday interactions accurately reflect real-world communication needs and user-perceived usability for the tested systems.

What would settle it

A follow-up study that uses different real-world scenarios or in-person interactions and reports usability rates well above or below 50 percent, or no demographic differences, would challenge the central findings.

Figures

Figures reproduced from arXiv: 2606.06177 by Andr\'e F.T. Martins, Beatrice Savoldi, Daniel Chechelnitsky, Giuseppe Attanasio, Maarten Sap, Marine Carpuat, Matteo Negri.

**Figure 1.** Figure 1: Study Workflow. 1 : A English-speaking sender reads aloud and records a conversation starter, which we translate with a speech translation system. 2 : A L1 Portuguese receiver answers up to 10 open-text questions about the translated text. 3 : A validator, fluent in both languages, assesses translation quality and flags the questions answered incorrectly in 2 . 4 : the initial sender rates their perceived … view at source ↗

**Figure 2.** Figure 2: (a) Survival curves of usability scores by language group. Each curve shows the fraction of the group whose usability rating reaches or exceeds a given score on the x-axis. Annotations at u = 4. (b) Fixed-effect estimates (95% CI) from a LMM predicting usability (u) (§4.1). Filled circles p < 0.05, diamonds 0.05 ≤ p < 0.1, hollow non-significant. Estimates annotated right of each interval. (c) Estimated ma… view at source ↗

**Figure 3.** Figure 3: (a) Fixed-effect estimates (95% CI) from a linear mixed model predicting validator translation score. Reference: Hindi, Man, DeSTA2, Everyday, MED-MT. Filled circles p < 0.05, diamonds 0.05 ≤ p < 0.1, hollow non-significant. (b) Model-adjusted marginal means (±95% CI) by language group and gender, marginalized over topic, model, and source. (c) Descriptive means (±95% CI) by translation model and conversat… view at source ↗

**Figure 4.** Figure 4: Mean QA Score and COMET (±95% CI) among records whose usability meets or exceeds each threshold value on the x-axis. Error bars are 95% confidence intervals. Annotations at thresholds 3, 4, and 5. ity predicts usability, but not all quality signals are equally informative. Standard quality estimation metrics rank highly on leaderboards for overall quality prediction (Kocmi et al., 2025), yet in communic… view at source ↗

**Figure 5.** Figure 5: Causal Graph. The links establish how different factors interact in the study. Boxes describe contextual (yellow shading) and latent (gray) factors that shape the result variables (white), alongside the study participants. Help me generate a dataset of conversation starters. Each conversation takes place within a specific context, and the person is seeking a particular piece of information. The generated s… view at source ↗

**Figure 6.** Figure 6: Example prompt for synthetic data generation. Over-the-counter pharmacy scenario. You are a professional English to Portuguese translator, tasked with providing translations suitable for use in Portugal. Your goal is to accurately convey the meaning and nuances of the original English text while adhering to Portuguese grammar, vocabulary, and cultural sensitivities. ,→ ,→ Please translate the following Eng… view at source ↗

**Figure 7.** Figure 7: Prompt for Question Translation. We use it to prepare Portuguese questions in 2 . starters and 200 questions). The analysis was conducted by one author with a background in translation studies. We verified the starter content and the grounding of such questions in the starter content. No issues were found with the starters. Out of 200 questions, only one raised an ambiguity, making it arguably unanswera… view at source ↗

**Figure 8.** Figure 8: Prompt for Question Generation. We use the generated questions in 2 . B.3 Scalar Quality Metric For the quality assessment, we use the SQM scale that features seven labeled tick marks indicating different quality labels combining accuracy and grammatical correctness described as follows: • 6: Perfect Meaning and Grammar: The meaning of the translation is completely consistent with the source and the surro… view at source ↗

**Figure 10.** Figure 10: Mean usability scores by translation model and conversation topic. Left: estimated means ±95% CI for Everyday and Health conversations. Right: per-model difference ∆u = uHealth − uEveryday with 95% CI; positive values indicate higher usability for healthrelated conversations. composite measure. Given the nested structure of the data—10 interaction-level observations per participant—we decompose variance … view at source ↗

**Figure 11.** Figure 11: Spearman’s ρ correlation between Usability (u) and translation quality metrics. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ouvia introduces a user study framework for speech translation usability in real conversations, finding roughly 50% success rate with demographic gaps and QA metrics as stronger predictors than standard ones.

read the letter

The main thing here is that Ouvia gives a way to test speech translation in actual back-and-forth talks rather than just on test sets, and their study with 1750 interactions finds that only half the time it works well enough for users, with big differences depending on who is speaking.

What stands out is the effort to run this in healthcare and everyday settings with different English dialects and genders, using four different ST systems. They also show that QA-style metrics do a better job predicting those usability ratings than the usual approaches.

This is a step forward from the usual quality-focused evals because it ties directly to whether the translation lets the communication happen. The scale of the data collection is substantial for this kind of work.

The soft spots are in the study design. The custom web app and chosen situations may not capture the full range of real-world use, and the definition of usable could be sensitive to how they phrased the questions to participants. The demographic gaps are interesting but rest on their specific setup, so generalizability needs checking.

This paper is for researchers in machine translation evaluation and HCI who want to move beyond decontextualized scores. A reader focused on deployment and user needs would get concrete data points from the metric comparisons.

I would send it to peer review. The empirical approach and the angle on who the systems actually serve make it worth referee time, even if some details on analysis and controls might need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces Ouvia, a user-centered evaluation framework for measuring the usability of speech translation (ST) outputs in real-world one-to-one communication scenarios (English speaker conveying requests to a Portuguese speaker). It describes a custom web app and multi-phase study collecting over 1,750 interactions across healthcare and everyday situations using four ST systems, with participants from three English dialects and two genders. Key claims are that modern ST is usable in only around half of interactions, with significant demographic gaps in usability ratings, and that QA-based metrics substantially outperform standard quality metrics as predictors of real-world usability.

Significance. If the empirical results hold after detailed validation, the work would provide concrete evidence that current ST systems have limited practical utility and that evaluation must move beyond decontextualized holistic scores to situated, user-centered measures that account for demographic factors. This could shift research priorities in speech translation and human-AI communication toward more ecologically valid testing protocols.

major comments (2)

[§3] §3 (Study Design): The central claim that the custom web app and multi-phase protocol accurately capture real-world usability rests on an untested assumption that simulated healthcare/everyday interactions via the app generalize to natural face-to-face or device-mediated communication; without a validation study or comparison to field data, this undermines the external validity of the 50% usability rate and demographic-gap findings.
[§4] §4 (Results): The assertion that QA-based evaluation is a 'substantially stronger predictor' of usability requires the specific correlation coefficients, regression models, and baseline comparisons (e.g., vs. BLEU, COMET) to be reported with confidence intervals and cross-validation details; the abstract alone does not allow assessment of whether this superiority is robust or driven by particular operationalizations of 'usable'.

minor comments (2)

[Abstract] The abstract and introduction would benefit from an explicit operational definition of 'usable' (e.g., rating threshold or binary criterion) to allow readers to interpret the 'around half' figure.
[§4] Table or figure presenting the exact distribution of usability ratings across the four ST systems and demographic strata should be added for transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and describe planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: §3 (Study Design): The central claim that the custom web app and multi-phase protocol accurately capture real-world usability rests on an untested assumption that simulated healthcare/everyday interactions via the app generalize to natural face-to-face or device-mediated communication; without a validation study or comparison to field data, this undermines the external validity of the 50% usability rate and demographic-gap findings.

Authors: We acknowledge that the study relies on a simulated web-app environment rather than direct field observations. This design enables ethical, large-scale data collection (1,750+ interactions) with controlled variables across demographics and scenarios. We agree this introduces an assumption about generalizability. In revision we will expand the limitations section to explicitly discuss the simulation's scope and implications for the reported usability rates and demographic differences. revision: partial
Referee: §4 (Results): The assertion that QA-based evaluation is a 'substantially stronger predictor' of usability requires the specific correlation coefficients, regression models, and baseline comparisons (e.g., vs. BLEU, COMET) to be reported with confidence intervals and cross-validation details; the abstract alone does not allow assessment of whether this superiority is robust or driven by particular operationalizations of 'usable'.

Authors: We agree that detailed statistical reporting is required for evaluation of the claim. The manuscript body contains the relevant analyses, but we will revise to present explicit correlation coefficients, regression models, confidence intervals, and cross-validation results in a dedicated table or subsection, ensuring readers can fully assess the comparison to BLEU, COMET, and other baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical user study collecting 1750+ real-world interactions via a custom web app across healthcare and everyday scenarios, with usability ratings provided directly by participants. Central claims (approximately 50% usability rate, demographic gaps, and QA metrics outperforming standard ones) rest on observed data and statistical comparisons rather than any derivation, equation, fitted parameter renamed as prediction, or self-citation chain. No mathematical model, ansatz, uniqueness theorem, or self-referential definition appears in the provided text; the evaluation framework is defined by the study protocol itself and is externally falsifiable through replication of the user ratings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities; the work is empirical and framework-based.

pith-pipeline@v0.9.1-grok · 5754 in / 1122 out tokens · 54173 ms · 2026-06-28T01:08:56.090415+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 28 canonical work pages

[1]

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, and 1 others. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743

Pith/arXiv arXiv 2025
[2]

AIMA . 2025. https://aima.gov.pt/media/pages/documents/fec4d6a712-1760603125/relatorio-migracoes-e-asilo-2024.pdf Relatório de migrações e asilo 2024 . Technical report, Agência para a Integração, Migrações e Asilo (AIMA I.P.), Lisboa. Edição Digital. Coordenação: Sílvia Lopes

2025
[3]

Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, and Dirk Hovy. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1188 Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21318--21340, Miami, Florida, USA. As...

work page doi:10.18653/v1/2024.emnlp-main.1188 2024
[4]

Jeffrey Basoah, Daniel Chechelnitsky, Tao Long, Katharina Reinecke, Chrysoula Zerva, Kaitlyn Zhou, Mark D \' az, and Maarten Sap. 2025. Not like us, hunty: Measuring perceptions and behavioral effects of minoritized anthropomorphic cues in llms. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 710--745

2025
[5]

Anol Bhattacherjee. 2001. Understanding information systems continuance: An expectation-confirmation model1. MIS quarterly, 25(3):351--370

2001
[6]

Marc Brysbaert. 2019. How many participants do we have to include in properly powered experiments? a tutorial of power analysis with reference tables. Journal of cognition, 2(1):16

2019
[7]

Khoong, William D

Marine Carpuat, Omri Asscher, Kalika Bali, Luisa Bentivogli, Fr \'e d \'e ric Blain, Lynne Bowker, Monojit Choudhury, Hal Daum \'e III, Kevin Duh, Ge Gao, Alvin Grissom II, Marzena Karpinska, Elaine C. Khoong, William D. Lewis, Andr \'e F. T. Martins, Mary Nurminen, Douglas W. Oard, Maja Popovic, Michel Simard, and Fran c ois Yvon. 2025. https://doi.org/1...

work page doi:10.18653/v1/2025.emnlp-main.1164 2025
[8]

Gary Charness, Uri Gneezy, and Michael A Kuhn. 2012. Experimental methods: Between-subject and within-subject design. Journal of economic behavior & organization, 81(1):1--8

2012
[9]

Sonia Colina. 2009. Further evidence for a functionalist approach to translation quality evaluation. Target. International Journal of Translation Studies, 21(2):235--264

2009
[10]

Amin Farajian, Ant \'o nio V

M. Amin Farajian, Ant \'o nio V. Lopes, Andr \'e F. T. Martins, Sameen Maruf, and Gholamreza Haffari. 2020. https://doi.org/10.18653/v1/2020.wmt-1.3 Findings of the WMT 2020 shared task on chat translation . In Proceedings of the Fifth Conference on Machine Translation, pages 65--75, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.wmt-1.3 2020
[11]

Faiha Fareez, Tishya Parikh, Christopher Wavell, Saba Shahab, Meghan Chevalier, Scott Good, Isabella De Blasi, Rafik Rhouma, Christopher McMahon, Jean-Paul Lam, and 1 others. 2022. A dataset of simulated patient-physician medical interviews with a focus on respiratory cases. Scientific Data, 9(1):313

2022
[12]

Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis, Andr \'e FT Martins, and Graham Neubig. 2025. Do llms understand your translations? evaluating paragraph-level mt with question answering. arXiv preprint arXiv:2504.07583

arXiv 2025
[13]

Dennis Fucci, Marco Gaido, Matteo Negri, Luisa Bentivogli, Andre Martins, and Giuseppe Attanasio. 2025. https://doi.org/10.18653/v1/2025.acl-short.78 Different speech translation models encode and translate speaker gender differently . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), page...

work page doi:10.18653/v1/2025.acl-short.78 2025
[14]

Susanne Fuchs and Martine Toda. 2010. Do differences in male versus female/s/reflect biological or sociophonetic factors . Turbulent sounds: An interdisciplinary guide, 21:281--302

2010
[15]

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2020. https://doi.org/10.18653/v1/2020.coling-main.350 Breeding gender-aware direct speech translation systems . In Proceedings of the 28th International Conference on Computational Linguistics, pages 3951--3964, Barcelona, Spain (Online). International Committee on Computati...

work page doi:10.18653/v1/2020.coling-main.350 2020
[16]

Ariana Genovese, Sahar Borna, Cesar A Gomez-Cabello, Syed Ali Haider, Srinivasagam Prabha, Antonio J Forte, and Benjamin R Veenstra. 2024. Artificial intelligence in clinical settings: a systematic review of its role in language translation and interpretation. Annals of Translational Medicine, 12(6):117

2024
[17]

Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. https://aclanthology.org/W13-2305/ Continuous measurement scales in human evaluation of machine translation . In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33--41, Sofia, Bulgaria. Association for Computational Linguistics

2013
[18]

and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F

Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and Andr \'e F. T. Martins. 2024. https://doi.org/10.1162/tacl_a_00683 x COMET : Transparent machine translation evaluation through fine-grained error detection . Transactions of the Association for Computational Linguistics, 12:979--995

work page doi:10.1162/tacl_a_00683 2024
[19]

HyoJung Han, Kevin Duh, and Marine Carpuat. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1218 S peech QE : Estimating the quality of direct speech translation . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21852--21867, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.1218 2024
[20]

Camille Harris, Chijioke Mgbahurike, Neha Kumar, and Diyi Yang. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.890 Modeling gender and dialect bias in automatic speech recognition . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15166--15184, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.890 2024
[21]

Hoffman, Shane T

Robert R. Hoffman, Shane T. Mueller, Gary Klein, and Jordan Litman. 2023. https://doi.org/10.3389/fcomp.2023.1096257 Measures for explainable ai: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-ai performance . Frontiers in Computer Science, Volume 5 - 2023

work page doi:10.3389/fcomp.2023.1096257 2023
[22]

Eduard Hovy, Margaret King, and Andrei Popescu-Belis. 2002. Principles of context-based machine translation evaluation. Machine Translation, 17(1):43--75

2002
[23]

John Hutchins. 2005. Current commercial machine translation systems and computer-based translation tools: system types and their uses. International journal of translation, 17(1-2):5--38

2005
[24]

ISO . 2018. https://www.iso.org/obp/ui/en/#iso:std:iso:9241:-11:ed-2:v1:en ISO 9241-11:2018 --- ergonomics of human-system interaction---part 11: Usability: Definitions and concepts . Standard, International Organization for Standardization

2018
[25]

Eshin Jolly. 2018. Pymer4: Connecting r and python for linear mixed modeling. Journal of Open Source Software, 3(31):862

2018
[26]

Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.35 M etric X -24: The G oogle submission to the WMT 2024 metrics shared task . In Proceedings of the Ninth Conference on Machine Translation, pages 492--504, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.wmt-1.35 2024
[27]

Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, and Markus Freitag. 2025. https://doi.org/10.18653/v1/2025.wmt-1.70 M etric X -25 and G em S pan E val: G oogle T ranslate submissions to the WMT 25 evaluation shared task . In Proceedings of the Tenth Conference on Machine Translation, pages 957--9...

work page doi:10.18653/v1/2025.wmt-1.70 2025
[28]

Elaine C Khoong, Eric Steinbrook, Cortlyn Brown, and Alicia Fernandez. 2019. Assessing the use of google translate for spanish and chinese translations of emergency department discharge instructions. JAMA internal medicine, 179(4):580--582

2019
[29]

Dayeon Ki, Kevin Duh, and Marine Carpuat. 2025 a . https://doi.org/10.18653/v1/2025.findings-acl.899 A sk QE : Question answering as automatic evaluation for machine translation . In Findings of the Association for Computational Linguistics: ACL 2025, pages 17478--17515, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-acl.899 2025
[30]

Dayeon Ki, Kevin Duh, and Marine Carpuat. 2025 b . https://doi.org/10.18653/v1/2025.emnlp-main.606 Should I share this translation? evaluating quality feedback for user reliance on machine translation . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12069--12092, Suzhou, China. Association for Computationa...

work page doi:10.18653/v1/2025.emnlp-main.606 2025
[31]

Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, and 10 others. 2025. https://doi.org/10.18653...

work page doi:10.18653/v1/2025.wmt-1.22 2025
[32]

Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . 2022. https://doi.org/10.18653/v1/2022.wmt-1.1...

work page doi:10.18653/v1/2022.wmt-1.1 2022
[33]

Proceedings of the National Academy of Sciences , author =

Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky, and Sharad Goel. 2020. https://doi.org/10.1073/pnas.1915768117 Racial disparities in automated speech recognition . Proceedings of the National Academy of Sciences, 117(14):7684--7689

work page doi:10.1073/pnas.1915768117 2020
[34]

Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, and Matteo Negri. 2025. https://aclanthology.org/2025.coling-main.455/ Speech foundation models and crowdsourcing for efficient, high-quality data collection . In Proceedings of the 31st International Conference on Computational Linguistics, pages 6816--6826, Abu Dhabi, UAE. Association for Co...

2025
[35]

Daniel Liebling, Katherine Heller, Samantha Robertson, and Wesley Deng. 2022. https://doi.org/10.18653/v1/2022.findings-naacl.17 Opportunities for human-centered evaluation of machine translation systems . In Findings of the Association for Computational Linguistics: NAACL 2022, pages 229--240, Seattle, United States. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-naacl.17 2022
[36]

Alexander H Liu, Andy Ehrenberg, Andy Lo, Cl \'e ment Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, and 1 others. 2025. Voxtral. arXiv preprint arXiv:2507.13264

arXiv 2025
[37]

Ting Liu, Chi-kiu Lo, Elizabeth Marshman, and Rebecca Knowles. 2024. https://aclanthology.org/2024.amta-research.17/ Evaluation briefs: Drawing on translation studies for human evaluation of MT . In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 190--208, Chicago, USA. Associ...

2024
[38]

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, and Hung-yi Lee. 2025. Developing instruction-following speech language model without speech instruction-tuning data. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

2025
[39]

McGrath, Oliver Lack, James Tisch, and Andreas Duenser

Melanie J. McGrath, Oliver Lack, James Tisch, and Andreas Duenser. 2025. https://doi.org/10.3389/frai.2025.1582880 Measuring trust in artificial intelligence: validation of an established scale and its short form . Frontiers in Artificial Intelligence, Volume 8 - 2025

work page doi:10.3389/frai.2025.1582880 2025
[40]

Nikita Mehandru, Sweta Agrawal, Yimin Xiao, Ge Gao, Elaine Khoong, Marine Carpuat, and Niloufar Salehi. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.712 Physician detection of clinical harm in machine translation: Quality estimation aids in reliance and backtranslation identifies critical errors . In Proceedings of the 2023 Conference on Empirical Me...

work page doi:10.18653/v1/2023.emnlp-main.712 2023
[41]

Mary Nurminen. 2021. Investigating the influence of context in the use and reception of raw machine translation

2021
[42]

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vil \'e m Zouhar, Carlos Escolano, Gerard I G \'a llego, Jorge Iranzo-S \'a nchez, Ahrii Kim, Dominik Mach \'a c ek, Patricia Schmidtova, and 1 others. 2025. Hearing to translate: The effectiveness of speech modality integration into llms. arXiv preprint arXiv:2512.16378

Pith/arXiv arXiv 2025
[43]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

2023
[44]

Ricardo Rei, Jos \'e G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and Andr \'e F. T. Martins. 2022. https://doi.org/10.18653/v1/2022.wmt-1.52 COMET -22: Unbabel- IST 2022 submission for the metrics shared task . In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578--5...

work page doi:10.18653/v1/2022.wmt-1.52 2022
[45]

Ricardo Rei, Nuno M Guerreiro, Jos \'e Pombal, Jo \ a o Alves, Pedro Teixeirinha, Amin Farajian, and Andr \'e FT Martins. 2025. Tower+: Bridging generality and translation specialization in multilingual llms. arXiv preprint arXiv:2506.17080

arXiv 2025
[46]

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.213 COMET : A neural framework for MT evaluation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685--2702, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.213 2020
[47]

Elizabeth Salesky, Marcello Federico, and Antonis Anastasopoulos, editors. 2025. https://doi.org/10.18653/v1/2025.iwslt-1.0 Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025) . Association for Computational Linguistics, Vienna, Austria (in-person and online)

work page doi:10.18653/v1/2025.iwslt-1.0 2025
[48]

Ariadna Sanchez, Alice Ross, and Nina Markl. 2024. Beyond the binary: Limitations and possibilities of gender-related speech technology research. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 526--532. IEEE

2024
[49]

Beatrice Savoldi, Sara Papi, Matteo Negri, Ana Guerberof-Arenas, and Luisa Bentivogli. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1002 What the harm? quantifying the tangible impact of gender bias in machine translation with a human-centered study . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 180...

work page doi:10.18653/v1/2024.emnlp-main.1002 2024
[50]

Beatrice Savoldi, Alan Ramponi, Matteo Negri, and Luisa Bentivogli. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.700 Translation in the hands of many: Centering lay users in machine translation interactions . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13876--13889, Suzhou, China. Association for C...

work page doi:10.18653/v1/2025.emnlp-main.700 2025
[51]

Breena R Taira, Vanessa Kreger, Aristides Orue, and Lisa C Diamond. 2021. A pragmatic assessment of google translate for emergency department instructions. Journal of General Internal Medicine, 36(11):3361--3365

2021
[52]

Ungless, Sunipa Dev, Cynthia L

Eddie L. Ungless, Sunipa Dev, Cynthia L. Bennett, Rebecca Gulotta, Jasmijn Bastings, and Remi Denton. 2025. https://doi.org/10.18653/v1/2025.acl-long.1001 Amplifying trans and nonbinary voices: A community-centred harm taxonomy for LLM s . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

work page doi:10.18653/v1/2025.acl-long.1001 2025
[53]

google translate is our best friend here

Susana Valdez and Ana Guerberof-Arenas. 2025. “google translate is our best friend here” a vignette-based interview study on machine translation use for health communication. Translation Spaces, 14(2):253--276

2025
[54]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[55]

Martindale, Charlotte Vaughn, Ge Gao, and Marine Carpuat

Yimin Xiao, Yongle Zhang, Dayeon Ki, Calvin Bao, Marianna J. Martindale, Charlotte Vaughn, Ge Gao, and Marine Carpuat. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1725 Toward machine translation literacy: How lay users perceive and rely on imperfect translations . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process...

work page doi:10.18653/v1/2025.emnlp-main.1725 2025
[56]

Emmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal, and Andre Martins. 2025. https://doi.org/10.18653/v1/2025.acl-long.1228 Watching the watchers: Exposing gender disparities in machine translation quality estimation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25261--25284, ...

work page doi:10.18653/v1/2025.acl-long.1228 2025
[57]

Yuanyuan Zhang, Yixuan Zhang, Bence Mark Halpern, Tanvina Patel, and Odette Scharenborg. 2022. Mitigating bias against non-native accents. In Interspeech, pages 3168--3172

2022
[58]

Lal Zimman. 2020. Sociophonetics. The International Encyclopedia of Linguistic Anthropology, pages 1--5

2020
[59]

Lal Zimman. 2021. Gender diversity and the voice. In The Routledge handbook of language, gender, and sexuality, pages 69--90. Routledge

2021

[1] [1]

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, and 1 others. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743

Pith/arXiv arXiv 2025

[2] [2]

AIMA . 2025. https://aima.gov.pt/media/pages/documents/fec4d6a712-1760603125/relatorio-migracoes-e-asilo-2024.pdf Relatório de migrações e asilo 2024 . Technical report, Agência para a Integração, Migrações e Asilo (AIMA I.P.), Lisboa. Edição Digital. Coordenação: Sílvia Lopes

2025

[3] [3]

Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, and Dirk Hovy. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1188 Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21318--21340, Miami, Florida, USA. As...

work page doi:10.18653/v1/2024.emnlp-main.1188 2024

[4] [4]

Jeffrey Basoah, Daniel Chechelnitsky, Tao Long, Katharina Reinecke, Chrysoula Zerva, Kaitlyn Zhou, Mark D \' az, and Maarten Sap. 2025. Not like us, hunty: Measuring perceptions and behavioral effects of minoritized anthropomorphic cues in llms. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 710--745

2025

[5] [5]

Anol Bhattacherjee. 2001. Understanding information systems continuance: An expectation-confirmation model1. MIS quarterly, 25(3):351--370

2001

[6] [6]

Marc Brysbaert. 2019. How many participants do we have to include in properly powered experiments? a tutorial of power analysis with reference tables. Journal of cognition, 2(1):16

2019

[7] [7]

Khoong, William D

Marine Carpuat, Omri Asscher, Kalika Bali, Luisa Bentivogli, Fr \'e d \'e ric Blain, Lynne Bowker, Monojit Choudhury, Hal Daum \'e III, Kevin Duh, Ge Gao, Alvin Grissom II, Marzena Karpinska, Elaine C. Khoong, William D. Lewis, Andr \'e F. T. Martins, Mary Nurminen, Douglas W. Oard, Maja Popovic, Michel Simard, and Fran c ois Yvon. 2025. https://doi.org/1...

work page doi:10.18653/v1/2025.emnlp-main.1164 2025

[8] [8]

Gary Charness, Uri Gneezy, and Michael A Kuhn. 2012. Experimental methods: Between-subject and within-subject design. Journal of economic behavior & organization, 81(1):1--8

2012

[9] [9]

Sonia Colina. 2009. Further evidence for a functionalist approach to translation quality evaluation. Target. International Journal of Translation Studies, 21(2):235--264

2009

[10] [10]

Amin Farajian, Ant \'o nio V

M. Amin Farajian, Ant \'o nio V. Lopes, Andr \'e F. T. Martins, Sameen Maruf, and Gholamreza Haffari. 2020. https://doi.org/10.18653/v1/2020.wmt-1.3 Findings of the WMT 2020 shared task on chat translation . In Proceedings of the Fifth Conference on Machine Translation, pages 65--75, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.wmt-1.3 2020

[11] [11]

Faiha Fareez, Tishya Parikh, Christopher Wavell, Saba Shahab, Meghan Chevalier, Scott Good, Isabella De Blasi, Rafik Rhouma, Christopher McMahon, Jean-Paul Lam, and 1 others. 2022. A dataset of simulated patient-physician medical interviews with a focus on respiratory cases. Scientific Data, 9(1):313

2022

[12] [12]

Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis, Andr \'e FT Martins, and Graham Neubig. 2025. Do llms understand your translations? evaluating paragraph-level mt with question answering. arXiv preprint arXiv:2504.07583

arXiv 2025

[13] [13]

Dennis Fucci, Marco Gaido, Matteo Negri, Luisa Bentivogli, Andre Martins, and Giuseppe Attanasio. 2025. https://doi.org/10.18653/v1/2025.acl-short.78 Different speech translation models encode and translate speaker gender differently . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), page...

work page doi:10.18653/v1/2025.acl-short.78 2025

[14] [14]

Susanne Fuchs and Martine Toda. 2010. Do differences in male versus female/s/reflect biological or sociophonetic factors . Turbulent sounds: An interdisciplinary guide, 21:281--302

2010

[15] [15]

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2020. https://doi.org/10.18653/v1/2020.coling-main.350 Breeding gender-aware direct speech translation systems . In Proceedings of the 28th International Conference on Computational Linguistics, pages 3951--3964, Barcelona, Spain (Online). International Committee on Computati...

work page doi:10.18653/v1/2020.coling-main.350 2020

[16] [16]

Ariana Genovese, Sahar Borna, Cesar A Gomez-Cabello, Syed Ali Haider, Srinivasagam Prabha, Antonio J Forte, and Benjamin R Veenstra. 2024. Artificial intelligence in clinical settings: a systematic review of its role in language translation and interpretation. Annals of Translational Medicine, 12(6):117

2024

[17] [17]

Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. https://aclanthology.org/W13-2305/ Continuous measurement scales in human evaluation of machine translation . In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33--41, Sofia, Bulgaria. Association for Computational Linguistics

2013

[18] [18]

and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F

Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and Andr \'e F. T. Martins. 2024. https://doi.org/10.1162/tacl_a_00683 x COMET : Transparent machine translation evaluation through fine-grained error detection . Transactions of the Association for Computational Linguistics, 12:979--995

work page doi:10.1162/tacl_a_00683 2024

[19] [19]

HyoJung Han, Kevin Duh, and Marine Carpuat. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1218 S peech QE : Estimating the quality of direct speech translation . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21852--21867, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.1218 2024

[20] [20]

Camille Harris, Chijioke Mgbahurike, Neha Kumar, and Diyi Yang. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.890 Modeling gender and dialect bias in automatic speech recognition . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15166--15184, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.890 2024

[21] [21]

Hoffman, Shane T

Robert R. Hoffman, Shane T. Mueller, Gary Klein, and Jordan Litman. 2023. https://doi.org/10.3389/fcomp.2023.1096257 Measures for explainable ai: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-ai performance . Frontiers in Computer Science, Volume 5 - 2023

work page doi:10.3389/fcomp.2023.1096257 2023

[22] [22]

Eduard Hovy, Margaret King, and Andrei Popescu-Belis. 2002. Principles of context-based machine translation evaluation. Machine Translation, 17(1):43--75

2002

[23] [23]

John Hutchins. 2005. Current commercial machine translation systems and computer-based translation tools: system types and their uses. International journal of translation, 17(1-2):5--38

2005

[24] [24]

ISO . 2018. https://www.iso.org/obp/ui/en/#iso:std:iso:9241:-11:ed-2:v1:en ISO 9241-11:2018 --- ergonomics of human-system interaction---part 11: Usability: Definitions and concepts . Standard, International Organization for Standardization

2018

[25] [25]

Eshin Jolly. 2018. Pymer4: Connecting r and python for linear mixed modeling. Journal of Open Source Software, 3(31):862

2018

[26] [26]

Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.35 M etric X -24: The G oogle submission to the WMT 2024 metrics shared task . In Proceedings of the Ninth Conference on Machine Translation, pages 492--504, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.wmt-1.35 2024

[27] [27]

Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, and Markus Freitag. 2025. https://doi.org/10.18653/v1/2025.wmt-1.70 M etric X -25 and G em S pan E val: G oogle T ranslate submissions to the WMT 25 evaluation shared task . In Proceedings of the Tenth Conference on Machine Translation, pages 957--9...

work page doi:10.18653/v1/2025.wmt-1.70 2025

[28] [28]

Elaine C Khoong, Eric Steinbrook, Cortlyn Brown, and Alicia Fernandez. 2019. Assessing the use of google translate for spanish and chinese translations of emergency department discharge instructions. JAMA internal medicine, 179(4):580--582

2019

[29] [29]

Dayeon Ki, Kevin Duh, and Marine Carpuat. 2025 a . https://doi.org/10.18653/v1/2025.findings-acl.899 A sk QE : Question answering as automatic evaluation for machine translation . In Findings of the Association for Computational Linguistics: ACL 2025, pages 17478--17515, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-acl.899 2025

[30] [30]

Dayeon Ki, Kevin Duh, and Marine Carpuat. 2025 b . https://doi.org/10.18653/v1/2025.emnlp-main.606 Should I share this translation? evaluating quality feedback for user reliance on machine translation . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12069--12092, Suzhou, China. Association for Computationa...

work page doi:10.18653/v1/2025.emnlp-main.606 2025

[31] [31]

Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, and 10 others. 2025. https://doi.org/10.18653...

work page doi:10.18653/v1/2025.wmt-1.22 2025

[32] [32]

Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . 2022. https://doi.org/10.18653/v1/2022.wmt-1.1...

work page doi:10.18653/v1/2022.wmt-1.1 2022

[33] [33]

Proceedings of the National Academy of Sciences , author =

Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky, and Sharad Goel. 2020. https://doi.org/10.1073/pnas.1915768117 Racial disparities in automated speech recognition . Proceedings of the National Academy of Sciences, 117(14):7684--7689

work page doi:10.1073/pnas.1915768117 2020

[34] [34]

Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, and Matteo Negri. 2025. https://aclanthology.org/2025.coling-main.455/ Speech foundation models and crowdsourcing for efficient, high-quality data collection . In Proceedings of the 31st International Conference on Computational Linguistics, pages 6816--6826, Abu Dhabi, UAE. Association for Co...

2025

[35] [35]

Daniel Liebling, Katherine Heller, Samantha Robertson, and Wesley Deng. 2022. https://doi.org/10.18653/v1/2022.findings-naacl.17 Opportunities for human-centered evaluation of machine translation systems . In Findings of the Association for Computational Linguistics: NAACL 2022, pages 229--240, Seattle, United States. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-naacl.17 2022

[36] [36]

Alexander H Liu, Andy Ehrenberg, Andy Lo, Cl \'e ment Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, and 1 others. 2025. Voxtral. arXiv preprint arXiv:2507.13264

arXiv 2025

[37] [37]

Ting Liu, Chi-kiu Lo, Elizabeth Marshman, and Rebecca Knowles. 2024. https://aclanthology.org/2024.amta-research.17/ Evaluation briefs: Drawing on translation studies for human evaluation of MT . In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 190--208, Chicago, USA. Associ...

2024

[38] [38]

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, and Hung-yi Lee. 2025. Developing instruction-following speech language model without speech instruction-tuning data. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

2025

[39] [39]

McGrath, Oliver Lack, James Tisch, and Andreas Duenser

Melanie J. McGrath, Oliver Lack, James Tisch, and Andreas Duenser. 2025. https://doi.org/10.3389/frai.2025.1582880 Measuring trust in artificial intelligence: validation of an established scale and its short form . Frontiers in Artificial Intelligence, Volume 8 - 2025

work page doi:10.3389/frai.2025.1582880 2025

[40] [40]

Nikita Mehandru, Sweta Agrawal, Yimin Xiao, Ge Gao, Elaine Khoong, Marine Carpuat, and Niloufar Salehi. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.712 Physician detection of clinical harm in machine translation: Quality estimation aids in reliance and backtranslation identifies critical errors . In Proceedings of the 2023 Conference on Empirical Me...

work page doi:10.18653/v1/2023.emnlp-main.712 2023

[41] [41]

Mary Nurminen. 2021. Investigating the influence of context in the use and reception of raw machine translation

2021

[42] [42]

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vil \'e m Zouhar, Carlos Escolano, Gerard I G \'a llego, Jorge Iranzo-S \'a nchez, Ahrii Kim, Dominik Mach \'a c ek, Patricia Schmidtova, and 1 others. 2025. Hearing to translate: The effectiveness of speech modality integration into llms. arXiv preprint arXiv:2512.16378

Pith/arXiv arXiv 2025

[43] [43]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

2023

[44] [44]

Ricardo Rei, Jos \'e G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and Andr \'e F. T. Martins. 2022. https://doi.org/10.18653/v1/2022.wmt-1.52 COMET -22: Unbabel- IST 2022 submission for the metrics shared task . In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578--5...

work page doi:10.18653/v1/2022.wmt-1.52 2022

[45] [45]

Ricardo Rei, Nuno M Guerreiro, Jos \'e Pombal, Jo \ a o Alves, Pedro Teixeirinha, Amin Farajian, and Andr \'e FT Martins. 2025. Tower+: Bridging generality and translation specialization in multilingual llms. arXiv preprint arXiv:2506.17080

arXiv 2025

[46] [46]

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.213 COMET : A neural framework for MT evaluation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685--2702, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.213 2020

[47] [47]

Elizabeth Salesky, Marcello Federico, and Antonis Anastasopoulos, editors. 2025. https://doi.org/10.18653/v1/2025.iwslt-1.0 Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025) . Association for Computational Linguistics, Vienna, Austria (in-person and online)

work page doi:10.18653/v1/2025.iwslt-1.0 2025

[48] [48]

Ariadna Sanchez, Alice Ross, and Nina Markl. 2024. Beyond the binary: Limitations and possibilities of gender-related speech technology research. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 526--532. IEEE

2024

[49] [49]

Beatrice Savoldi, Sara Papi, Matteo Negri, Ana Guerberof-Arenas, and Luisa Bentivogli. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1002 What the harm? quantifying the tangible impact of gender bias in machine translation with a human-centered study . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 180...

work page doi:10.18653/v1/2024.emnlp-main.1002 2024

[50] [50]

Beatrice Savoldi, Alan Ramponi, Matteo Negri, and Luisa Bentivogli. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.700 Translation in the hands of many: Centering lay users in machine translation interactions . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13876--13889, Suzhou, China. Association for C...

work page doi:10.18653/v1/2025.emnlp-main.700 2025

[51] [51]

Breena R Taira, Vanessa Kreger, Aristides Orue, and Lisa C Diamond. 2021. A pragmatic assessment of google translate for emergency department instructions. Journal of General Internal Medicine, 36(11):3361--3365

2021

[52] [52]

Ungless, Sunipa Dev, Cynthia L

Eddie L. Ungless, Sunipa Dev, Cynthia L. Bennett, Rebecca Gulotta, Jasmijn Bastings, and Remi Denton. 2025. https://doi.org/10.18653/v1/2025.acl-long.1001 Amplifying trans and nonbinary voices: A community-centred harm taxonomy for LLM s . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

work page doi:10.18653/v1/2025.acl-long.1001 2025

[53] [53]

google translate is our best friend here

Susana Valdez and Ana Guerberof-Arenas. 2025. “google translate is our best friend here” a vignette-based interview study on machine translation use for health communication. Translation Spaces, 14(2):253--276

2025

[54] [54]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[55] [55]

Martindale, Charlotte Vaughn, Ge Gao, and Marine Carpuat

Yimin Xiao, Yongle Zhang, Dayeon Ki, Calvin Bao, Marianna J. Martindale, Charlotte Vaughn, Ge Gao, and Marine Carpuat. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1725 Toward machine translation literacy: How lay users perceive and rely on imperfect translations . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process...

work page doi:10.18653/v1/2025.emnlp-main.1725 2025

[56] [56]

Emmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal, and Andre Martins. 2025. https://doi.org/10.18653/v1/2025.acl-long.1228 Watching the watchers: Exposing gender disparities in machine translation quality estimation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25261--25284, ...

work page doi:10.18653/v1/2025.acl-long.1228 2025

[57] [57]

Yuanyuan Zhang, Yixuan Zhang, Bence Mark Halpern, Tanvina Patel, and Odette Scharenborg. 2022. Mitigating bias against non-native accents. In Interspeech, pages 3168--3172

2022

[58] [58]

Lal Zimman. 2020. Sociophonetics. The International Encyclopedia of Linguistic Anthropology, pages 1--5

2020

[59] [59]

Lal Zimman. 2021. Gender diversity and the voice. In The Routledge handbook of language, gender, and sexuality, pages 69--90. Routledge

2021