pith. sign in

arxiv: 2606.06177 · v1 · pith:SZSWIASTnew · submitted 2026-06-04 · 💻 cs.CL · cs.HC

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

Pith reviewed 2026-06-28 01:08 UTC · model grok-4.3

classification 💻 cs.CL cs.HC
keywords speech translationusability evaluationuser-centered frameworkreal-world scenariosQA-based metricsdemographic differencesmachine translation evaluationsituated evaluation
0
0 comments X

The pith

A user study finds speech translation usable in only about half of real one-to-one interactions, with large gaps by demographic group and QA metrics as stronger predictors than standard scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ouvia to test how usable speech translation feels to users when they need to convey messages across languages in actual conversations. It collected more than 1750 English-to-Portuguese interactions through a custom web app in healthcare and everyday settings using four different systems and speakers from varied dialects and genders. Results show only around half the interactions rated usable, with clear differences in ratings across groups. QA-based metrics aligned much more closely with those user ratings than conventional quality measures. The work argues that evaluation must shift from abstract quality scores to situated user needs if the technology is to serve communication effectively.

Core claim

Ouvia measures user-perceived usability of speech translation by placing English speakers and Portuguese speakers in one-to-one request-conveying tasks that mimic real healthcare and daily situations. Data from over 1750 mediated interactions across four systems show modern speech translation produces usable results in only around half of cases, with substantial reported differences by English dialect and gender. Among available quality metrics, question-answering evaluation predicts these usability ratings substantially better than standard approaches.

What carries the argument

Ouvia, a framework that gathers usability ratings from situated one-to-one communication tasks run inside a custom web app.

If this is right

  • Only around half of speech translation interactions receive usable ratings in realistic settings.
  • Usability ratings differ significantly across demographic groups including English dialects and genders.
  • QA-based evaluation predicts real-world usability ratings substantially better than standard quality metrics.
  • Evaluation of speech translation benefits from moving beyond decontextualized test sets to user-centered, situated designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be applied to additional language pairs to test whether similar usability ceilings appear.
  • Demographic gaps may trace to uneven performance on particular dialects or speaker characteristics in current models.
  • Developers might adopt QA-style checks during system tuning to improve alignment with actual user experience.

Load-bearing premise

The custom web app and multi-phase study design with healthcare and everyday interactions accurately reflect real-world communication needs and user-perceived usability for the tested systems.

What would settle it

A follow-up study that uses different real-world scenarios or in-person interactions and reports usability rates well above or below 50 percent, or no demographic differences, would challenge the central findings.

Figures

Figures reproduced from arXiv: 2606.06177 by Andr\'e F.T. Martins, Beatrice Savoldi, Daniel Chechelnitsky, Giuseppe Attanasio, Maarten Sap, Marine Carpuat, Matteo Negri.

Figure 1
Figure 1. Figure 1: Study Workflow. 1 : A English-speaking sender reads aloud and records a conversation starter, which we translate with a speech translation system. 2 : A L1 Portuguese receiver answers up to 10 open-text questions about the translated text. 3 : A validator, fluent in both languages, assesses translation quality and flags the questions answered incorrectly in 2 . 4 : the initial sender rates their perceived … view at source ↗
Figure 2
Figure 2. Figure 2: (a) Survival curves of usability scores by language group. Each curve shows the fraction of the group whose usability rating reaches or exceeds a given score on the x-axis. Annotations at u = 4. (b) Fixed-effect estimates (95% CI) from a LMM predicting usability (u) (§4.1). Filled circles p < 0.05, diamonds 0.05 ≤ p < 0.1, hollow non-significant. Estimates annotated right of each interval. (c) Estimated ma… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Fixed-effect estimates (95% CI) from a linear mixed model predicting validator translation score. Reference: Hindi, Man, DeSTA2, Everyday, MED-MT. Filled circles p < 0.05, diamonds 0.05 ≤ p < 0.1, hollow non-significant. (b) Model-adjusted marginal means (±95% CI) by language group and gender, marginalized over topic, model, and source. (c) Descriptive means (±95% CI) by translation model and conversat… view at source ↗
Figure 4
Figure 4. Figure 4: Mean QA Score and COMET (±95% CI) among records whose usability meets or exceeds each threshold value on the x-axis. Error bars are 95% confi￾dence intervals. Annotations at thresholds 3, 4, and 5. ity predicts usability, but not all quality signals are equally informative. Standard quality estima￾tion metrics rank highly on leaderboards for over￾all quality prediction (Kocmi et al., 2025), yet in communic… view at source ↗
Figure 5
Figure 5. Figure 5: Causal Graph. The links establish how different factors interact in the study. Boxes describe contextual (yellow shading) and latent (gray) factors that shape the result variables (white), alongside the study participants. Help me generate a dataset of conversation starters. Each conversation takes place within a specific context, and the person is seeking a particular piece of information. The generated s… view at source ↗
Figure 6
Figure 6. Figure 6: Example prompt for synthetic data generation. Over-the-counter pharmacy scenario. You are a professional English to Portuguese translator, tasked with providing translations suitable for use in Portugal. Your goal is to accurately convey the meaning and nuances of the original English text while adhering to Portuguese grammar, vocabulary, and cultural sensitivities. ,→ ,→ Please translate the following Eng… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for Question Translation. We use it to prepare Portuguese questions in 2 . starters and 200 questions). The analysis was con￾ducted by one author with a background in trans￾lation studies. We verified the starter content and the grounding of such questions in the starter con￾tent. No issues were found with the starters. Out of 200 questions, only one raised an ambiguity, making it arguably unanswera… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for Question Generation. We use the generated questions in 2 . B.3 Scalar Quality Metric For the quality assessment, we use the SQM scale that features seven labeled tick marks indicating different quality labels combining accuracy and grammatical correctness described as follows: • 6: Perfect Meaning and Grammar: The mean￾ing of the translation is completely consistent with the source and the surro… view at source ↗
Figure 10
Figure 10. Figure 10: Mean usability scores by translation model and conversation topic. Left: estimated means ±95% CI for Everyday and Health conversations. Right: per-model difference ∆u = uHealth − uEveryday with 95% CI; positive values indicate higher usability for health￾related conversations. composite measure. Given the nested structure of the data—10 interaction-level observations per participant—we decompose variance … view at source ↗
Figure 11
Figure 11. Figure 11: Spearman’s ρ correlation between Usability (u) and translation quality metrics. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Ouvia, a user-centered evaluation framework for measuring the usability of speech translation (ST) outputs in real-world one-to-one communication scenarios (English speaker conveying requests to a Portuguese speaker). It describes a custom web app and multi-phase study collecting over 1,750 interactions across healthcare and everyday situations using four ST systems, with participants from three English dialects and two genders. Key claims are that modern ST is usable in only around half of interactions, with significant demographic gaps in usability ratings, and that QA-based metrics substantially outperform standard quality metrics as predictors of real-world usability.

Significance. If the empirical results hold after detailed validation, the work would provide concrete evidence that current ST systems have limited practical utility and that evaluation must move beyond decontextualized holistic scores to situated, user-centered measures that account for demographic factors. This could shift research priorities in speech translation and human-AI communication toward more ecologically valid testing protocols.

major comments (2)
  1. [§3] §3 (Study Design): The central claim that the custom web app and multi-phase protocol accurately capture real-world usability rests on an untested assumption that simulated healthcare/everyday interactions via the app generalize to natural face-to-face or device-mediated communication; without a validation study or comparison to field data, this undermines the external validity of the 50% usability rate and demographic-gap findings.
  2. [§4] §4 (Results): The assertion that QA-based evaluation is a 'substantially stronger predictor' of usability requires the specific correlation coefficients, regression models, and baseline comparisons (e.g., vs. BLEU, COMET) to be reported with confidence intervals and cross-validation details; the abstract alone does not allow assessment of whether this superiority is robust or driven by particular operationalizations of 'usable'.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from an explicit operational definition of 'usable' (e.g., rating threshold or binary criterion) to allow readers to interpret the 'around half' figure.
  2. [§4] Table or figure presenting the exact distribution of usability ratings across the four ST systems and demographic strata should be added for transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and describe planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: §3 (Study Design): The central claim that the custom web app and multi-phase protocol accurately capture real-world usability rests on an untested assumption that simulated healthcare/everyday interactions via the app generalize to natural face-to-face or device-mediated communication; without a validation study or comparison to field data, this undermines the external validity of the 50% usability rate and demographic-gap findings.

    Authors: We acknowledge that the study relies on a simulated web-app environment rather than direct field observations. This design enables ethical, large-scale data collection (1,750+ interactions) with controlled variables across demographics and scenarios. We agree this introduces an assumption about generalizability. In revision we will expand the limitations section to explicitly discuss the simulation's scope and implications for the reported usability rates and demographic differences. revision: partial

  2. Referee: §4 (Results): The assertion that QA-based evaluation is a 'substantially stronger predictor' of usability requires the specific correlation coefficients, regression models, and baseline comparisons (e.g., vs. BLEU, COMET) to be reported with confidence intervals and cross-validation details; the abstract alone does not allow assessment of whether this superiority is robust or driven by particular operationalizations of 'usable'.

    Authors: We agree that detailed statistical reporting is required for evaluation of the claim. The manuscript body contains the relevant analyses, but we will revise to present explicit correlation coefficients, regression models, confidence intervals, and cross-validation results in a dedicated table or subsection, ensuring readers can fully assess the comparison to BLEU, COMET, and other baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical user study collecting 1750+ real-world interactions via a custom web app across healthcare and everyday scenarios, with usability ratings provided directly by participants. Central claims (approximately 50% usability rate, demographic gaps, and QA metrics outperforming standard ones) rest on observed data and statistical comparisons rather than any derivation, equation, fitted parameter renamed as prediction, or self-citation chain. No mathematical model, ansatz, uniqueness theorem, or self-referential definition appears in the provided text; the evaluation framework is defined by the study protocol itself and is externally falsifiable through replication of the user ratings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities; the work is empirical and framework-based.

pith-pipeline@v0.9.1-grok · 5754 in / 1122 out tokens · 54173 ms · 2026-06-28T01:08:56.090415+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 28 canonical work pages

  1. [1]

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, and 1 others. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743

  2. [2]

    AIMA . 2025. https://aima.gov.pt/media/pages/documents/fec4d6a712-1760603125/relatorio-migracoes-e-asilo-2024.pdf Relatório de migrações e asilo 2024 . Technical report, Agência para a Integração, Migrações e Asilo (AIMA I.P.), Lisboa. Edição Digital. Coordenação: Sílvia Lopes

  3. [3]

    Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, and Dirk Hovy. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1188 Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21318--21340, Miami, Florida, USA. As...

  4. [4]

    Jeffrey Basoah, Daniel Chechelnitsky, Tao Long, Katharina Reinecke, Chrysoula Zerva, Kaitlyn Zhou, Mark D \' az, and Maarten Sap. 2025. Not like us, hunty: Measuring perceptions and behavioral effects of minoritized anthropomorphic cues in llms. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 710--745

  5. [5]

    Anol Bhattacherjee. 2001. Understanding information systems continuance: An expectation-confirmation model1. MIS quarterly, 25(3):351--370

  6. [6]

    Marc Brysbaert. 2019. How many participants do we have to include in properly powered experiments? a tutorial of power analysis with reference tables. Journal of cognition, 2(1):16

  7. [7]

    Khoong, William D

    Marine Carpuat, Omri Asscher, Kalika Bali, Luisa Bentivogli, Fr \'e d \'e ric Blain, Lynne Bowker, Monojit Choudhury, Hal Daum \'e III, Kevin Duh, Ge Gao, Alvin Grissom II, Marzena Karpinska, Elaine C. Khoong, William D. Lewis, Andr \'e F. T. Martins, Mary Nurminen, Douglas W. Oard, Maja Popovic, Michel Simard, and Fran c ois Yvon. 2025. https://doi.org/1...

  8. [8]

    Gary Charness, Uri Gneezy, and Michael A Kuhn. 2012. Experimental methods: Between-subject and within-subject design. Journal of economic behavior & organization, 81(1):1--8

  9. [9]

    Sonia Colina. 2009. Further evidence for a functionalist approach to translation quality evaluation. Target. International Journal of Translation Studies, 21(2):235--264

  10. [10]

    Amin Farajian, Ant \'o nio V

    M. Amin Farajian, Ant \'o nio V. Lopes, Andr \'e F. T. Martins, Sameen Maruf, and Gholamreza Haffari. 2020. https://doi.org/10.18653/v1/2020.wmt-1.3 Findings of the WMT 2020 shared task on chat translation . In Proceedings of the Fifth Conference on Machine Translation, pages 65--75, Online. Association for Computational Linguistics

  11. [11]

    Faiha Fareez, Tishya Parikh, Christopher Wavell, Saba Shahab, Meghan Chevalier, Scott Good, Isabella De Blasi, Rafik Rhouma, Christopher McMahon, Jean-Paul Lam, and 1 others. 2022. A dataset of simulated patient-physician medical interviews with a focus on respiratory cases. Scientific Data, 9(1):313

  12. [12]

    Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis, Andr \'e FT Martins, and Graham Neubig. 2025. Do llms understand your translations? evaluating paragraph-level mt with question answering. arXiv preprint arXiv:2504.07583

  13. [13]

    Dennis Fucci, Marco Gaido, Matteo Negri, Luisa Bentivogli, Andre Martins, and Giuseppe Attanasio. 2025. https://doi.org/10.18653/v1/2025.acl-short.78 Different speech translation models encode and translate speaker gender differently . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), page...

  14. [14]

    Susanne Fuchs and Martine Toda. 2010. Do differences in male versus female/s/reflect biological or sociophonetic factors . Turbulent sounds: An interdisciplinary guide, 21:281--302

  15. [15]

    Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2020. https://doi.org/10.18653/v1/2020.coling-main.350 Breeding gender-aware direct speech translation systems . In Proceedings of the 28th International Conference on Computational Linguistics, pages 3951--3964, Barcelona, Spain (Online). International Committee on Computati...

  16. [16]

    Ariana Genovese, Sahar Borna, Cesar A Gomez-Cabello, Syed Ali Haider, Srinivasagam Prabha, Antonio J Forte, and Benjamin R Veenstra. 2024. Artificial intelligence in clinical settings: a systematic review of its role in language translation and interpretation. Annals of Translational Medicine, 12(6):117

  17. [17]

    Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. https://aclanthology.org/W13-2305/ Continuous measurement scales in human evaluation of machine translation . In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33--41, Sofia, Bulgaria. Association for Computational Linguistics

  18. [18]

    and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F

    Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and Andr \'e F. T. Martins. 2024. https://doi.org/10.1162/tacl_a_00683 x COMET : Transparent machine translation evaluation through fine-grained error detection . Transactions of the Association for Computational Linguistics, 12:979--995

  19. [19]

    HyoJung Han, Kevin Duh, and Marine Carpuat. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1218 S peech QE : Estimating the quality of direct speech translation . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21852--21867, Miami, Florida, USA. Association for Computational Linguistics

  20. [20]

    Camille Harris, Chijioke Mgbahurike, Neha Kumar, and Diyi Yang. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.890 Modeling gender and dialect bias in automatic speech recognition . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15166--15184, Miami, Florida, USA. Association for Computational Linguistics

  21. [21]

    Hoffman, Shane T

    Robert R. Hoffman, Shane T. Mueller, Gary Klein, and Jordan Litman. 2023. https://doi.org/10.3389/fcomp.2023.1096257 Measures for explainable ai: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-ai performance . Frontiers in Computer Science, Volume 5 - 2023

  22. [22]

    Eduard Hovy, Margaret King, and Andrei Popescu-Belis. 2002. Principles of context-based machine translation evaluation. Machine Translation, 17(1):43--75

  23. [23]

    John Hutchins. 2005. Current commercial machine translation systems and computer-based translation tools: system types and their uses. International journal of translation, 17(1-2):5--38

  24. [24]

    ISO . 2018. https://www.iso.org/obp/ui/en/#iso:std:iso:9241:-11:ed-2:v1:en ISO 9241-11:2018 --- ergonomics of human-system interaction---part 11: Usability: Definitions and concepts . Standard, International Organization for Standardization

  25. [25]

    Eshin Jolly. 2018. Pymer4: Connecting r and python for linear mixed modeling. Journal of Open Source Software, 3(31):862

  26. [26]

    Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.35 M etric X -24: The G oogle submission to the WMT 2024 metrics shared task . In Proceedings of the Ninth Conference on Machine Translation, pages 492--504, Miami, Florida, USA. Association for Computational Linguistics

  27. [27]

    Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, and Markus Freitag. 2025. https://doi.org/10.18653/v1/2025.wmt-1.70 M etric X -25 and G em S pan E val: G oogle T ranslate submissions to the WMT 25 evaluation shared task . In Proceedings of the Tenth Conference on Machine Translation, pages 957--9...

  28. [28]

    Elaine C Khoong, Eric Steinbrook, Cortlyn Brown, and Alicia Fernandez. 2019. Assessing the use of google translate for spanish and chinese translations of emergency department discharge instructions. JAMA internal medicine, 179(4):580--582

  29. [29]

    Dayeon Ki, Kevin Duh, and Marine Carpuat. 2025 a . https://doi.org/10.18653/v1/2025.findings-acl.899 A sk QE : Question answering as automatic evaluation for machine translation . In Findings of the Association for Computational Linguistics: ACL 2025, pages 17478--17515, Vienna, Austria. Association for Computational Linguistics

  30. [30]

    Dayeon Ki, Kevin Duh, and Marine Carpuat. 2025 b . https://doi.org/10.18653/v1/2025.emnlp-main.606 Should I share this translation? evaluating quality feedback for user reliance on machine translation . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12069--12092, Suzhou, China. Association for Computationa...

  31. [31]

    Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, and 10 others. 2025. https://doi.org/10.18653...

  32. [32]

    Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . 2022. https://doi.org/10.18653/v1/2022.wmt-1.1...

  33. [33]

    Proceedings of the National Academy of Sciences , author =

    Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky, and Sharad Goel. 2020. https://doi.org/10.1073/pnas.1915768117 Racial disparities in automated speech recognition . Proceedings of the National Academy of Sciences, 117(14):7684--7689

  34. [34]

    Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, and Matteo Negri. 2025. https://aclanthology.org/2025.coling-main.455/ Speech foundation models and crowdsourcing for efficient, high-quality data collection . In Proceedings of the 31st International Conference on Computational Linguistics, pages 6816--6826, Abu Dhabi, UAE. Association for Co...

  35. [35]

    Daniel Liebling, Katherine Heller, Samantha Robertson, and Wesley Deng. 2022. https://doi.org/10.18653/v1/2022.findings-naacl.17 Opportunities for human-centered evaluation of machine translation systems . In Findings of the Association for Computational Linguistics: NAACL 2022, pages 229--240, Seattle, United States. Association for Computational Linguistics

  36. [36]

    Alexander H Liu, Andy Ehrenberg, Andy Lo, Cl \'e ment Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, and 1 others. 2025. Voxtral. arXiv preprint arXiv:2507.13264

  37. [37]

    Ting Liu, Chi-kiu Lo, Elizabeth Marshman, and Rebecca Knowles. 2024. https://aclanthology.org/2024.amta-research.17/ Evaluation briefs: Drawing on translation studies for human evaluation of MT . In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 190--208, Chicago, USA. Associ...

  38. [38]

    Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, and Hung-yi Lee. 2025. Developing instruction-following speech language model without speech instruction-tuning data. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

  39. [39]

    McGrath, Oliver Lack, James Tisch, and Andreas Duenser

    Melanie J. McGrath, Oliver Lack, James Tisch, and Andreas Duenser. 2025. https://doi.org/10.3389/frai.2025.1582880 Measuring trust in artificial intelligence: validation of an established scale and its short form . Frontiers in Artificial Intelligence, Volume 8 - 2025

  40. [40]

    Nikita Mehandru, Sweta Agrawal, Yimin Xiao, Ge Gao, Elaine Khoong, Marine Carpuat, and Niloufar Salehi. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.712 Physician detection of clinical harm in machine translation: Quality estimation aids in reliance and backtranslation identifies critical errors . In Proceedings of the 2023 Conference on Empirical Me...

  41. [41]

    Mary Nurminen. 2021. Investigating the influence of context in the use and reception of raw machine translation

  42. [42]

    Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vil \'e m Zouhar, Carlos Escolano, Gerard I G \'a llego, Jorge Iranzo-S \'a nchez, Ahrii Kim, Dominik Mach \'a c ek, Patricia Schmidtova, and 1 others. 2025. Hearing to translate: The effectiveness of speech modality integration into llms. arXiv preprint arXiv:2512.16378

  43. [43]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

  44. [44]

    Ricardo Rei, Jos \'e G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and Andr \'e F. T. Martins. 2022. https://doi.org/10.18653/v1/2022.wmt-1.52 COMET -22: Unbabel- IST 2022 submission for the metrics shared task . In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578--5...

  45. [45]

    Ricardo Rei, Nuno M Guerreiro, Jos \'e Pombal, Jo \ a o Alves, Pedro Teixeirinha, Amin Farajian, and Andr \'e FT Martins. 2025. Tower+: Bridging generality and translation specialization in multilingual llms. arXiv preprint arXiv:2506.17080

  46. [46]

    Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.213 COMET : A neural framework for MT evaluation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685--2702, Online. Association for Computational Linguistics

  47. [47]

    Elizabeth Salesky, Marcello Federico, and Antonis Anastasopoulos, editors. 2025. https://doi.org/10.18653/v1/2025.iwslt-1.0 Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025) . Association for Computational Linguistics, Vienna, Austria (in-person and online)

  48. [48]

    Ariadna Sanchez, Alice Ross, and Nina Markl. 2024. Beyond the binary: Limitations and possibilities of gender-related speech technology research. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 526--532. IEEE

  49. [49]

    Beatrice Savoldi, Sara Papi, Matteo Negri, Ana Guerberof-Arenas, and Luisa Bentivogli. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1002 What the harm? quantifying the tangible impact of gender bias in machine translation with a human-centered study . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 180...

  50. [50]

    Beatrice Savoldi, Alan Ramponi, Matteo Negri, and Luisa Bentivogli. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.700 Translation in the hands of many: Centering lay users in machine translation interactions . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13876--13889, Suzhou, China. Association for C...

  51. [51]

    Breena R Taira, Vanessa Kreger, Aristides Orue, and Lisa C Diamond. 2021. A pragmatic assessment of google translate for emergency department instructions. Journal of General Internal Medicine, 36(11):3361--3365

  52. [52]

    Ungless, Sunipa Dev, Cynthia L

    Eddie L. Ungless, Sunipa Dev, Cynthia L. Bennett, Rebecca Gulotta, Jasmijn Bastings, and Remi Denton. 2025. https://doi.org/10.18653/v1/2025.acl-long.1001 Amplifying trans and nonbinary voices: A community-centred harm taxonomy for LLM s . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  53. [53]

    google translate is our best friend here

    Susana Valdez and Ana Guerberof-Arenas. 2025. “google translate is our best friend here” a vignette-based interview study on machine translation use for health communication. Translation Spaces, 14(2):253--276

  54. [54]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

  55. [55]

    Martindale, Charlotte Vaughn, Ge Gao, and Marine Carpuat

    Yimin Xiao, Yongle Zhang, Dayeon Ki, Calvin Bao, Marianna J. Martindale, Charlotte Vaughn, Ge Gao, and Marine Carpuat. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1725 Toward machine translation literacy: How lay users perceive and rely on imperfect translations . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process...

  56. [56]

    Emmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal, and Andre Martins. 2025. https://doi.org/10.18653/v1/2025.acl-long.1228 Watching the watchers: Exposing gender disparities in machine translation quality estimation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25261--25284, ...

  57. [57]

    Yuanyuan Zhang, Yixuan Zhang, Bence Mark Halpern, Tanvina Patel, and Odette Scharenborg. 2022. Mitigating bias against non-native accents. In Interspeech, pages 3168--3172

  58. [58]

    Lal Zimman. 2020. Sociophonetics. The International Encyclopedia of Linguistic Anthropology, pages 1--5

  59. [59]

    Lal Zimman. 2021. Gender diversity and the voice. In The Routledge handbook of language, gender, and sexuality, pages 69--90. Routledge