arxiv: 2604.24794 · v1 · submitted 2026-04-25 · 💻 cs.CR · cs.AI· cs.CY· cs.ET· cs.HC

Recognition: unknown

V.O.I.C.E (Voice, Ownership, Identity, Control, Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data

Tanusree Sharma , Anish Krishnagiri , Lili Dudas , Ahmed Adnan , Visar Berisha

Authors on Pith no claims yet

Pith reviewed 2026-05-08 07:42 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CYcs.ETcs.HC

keywords synthetic voice generationrisk taxonomyvoice synthesis risksAI incident analysisvoice data privacythreat modelingempirical risk assessment

0 comments

The pith

The V.O.I.C.E taxonomy classifies risks from synthetic voice generation by drawing on empirical incidents, reports, and discussions to show how harms emerge and interact with exposure, visibility, and legal protections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative voice models introduce privacy, security, and governance risks that uniform threat models do not capture well. This paper presents the V.O.I.C.E taxonomy, built from 569 incidents in major databases, 1067 direct reports from diverse U.S. participants, and 2221 Reddit discussions. It models how risks arise and differ according to a person's degree of exposure, social visibility, and available legal protections. A reader would care because unconsented voice data reuse now affects voice actors, public figures, and ordinary people in distinct ways. The taxonomy supplies a grounded structure for anticipating and addressing these harms.

Core claim

The central claim is that a taxonomy named V.O.I.C.E (Voice, Ownership, Identity, Control, Expression), derived from multi-source real-world data, explicitly models the emergence of synthetic voice risks and their interactions with contextual factors such as degree of exposure, social visibility, and the availability of legal protections across affected groups, thereby filling gaps in existing uniform threat models.

What carries the argument

The V.O.I.C.E taxonomy, which stands for Voice, Ownership, Identity, Control, Expression, functions as the organizing framework that categorizes risks and traces their dependence on contextual factors like exposure and legal protections.

Load-bearing premise

The collected incidents, reports, and discussions are representative enough to produce a comprehensive taxonomy that captures all major risk interactions with exposure, visibility, and legal protections.

What would settle it

Discovery of a major synthetic voice risk scenario that cannot be placed in any V.O.I.C.E category or that does not interact with exposure, visibility, or legal protections in the manner the taxonomy predicts.

Figures

Figures reproduced from arXiv: 2604.24794 by Ahmed Adnan, Anish Krishnagiri, Lili Dudas, Tanusree Sharma, Visar Berisha.

**Figure 1.** Figure 1: Three-tier taxonomy of Voice Generation & Synthesis risks. This visualizes the hierarchy, mapping view at source ↗

read the original abstract

As generative voice models are rapidly advancing in both capabilities and public utilization, the unconsented collection, reuse, and synthesis of voice data are introducing new classes of privacy, security and governance risk that are poorly captured by existing, largely uniform threat models. To fill the gap, we present V.O.I.C.E, a taxonomy of voice generation risk grounded in a multi-source threat modeling effort with 569 incidents from major AI incident database, FTC and Internet Crime Complaint Center (IC3); 1067 direct incident reports from U.S. based participants across diverse groups (including voice actors, internet personalities, political personnel, and general public); and 2,221 Reddit discussions. Grounded in real-world data, our taxonomy explicitly models how risk emerges, interact with contextual factors such as degree of exposure, social visibility, and the availability of legal protections for various affected groups.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a data-driven V.O.I.C.E taxonomy for synthetic voice risks from incidents and discussions, but the derivation steps remain too thin to evaluate its strength.

read the letter

The main thing to know is that this paper puts forward a new taxonomy called V.O.I.C.E for risks tied to generative voice models. It pulls from 569 documented incidents, 1067 direct reports from U.S. participants including voice actors and public figures, plus 2221 Reddit threads, then maps how those risks shift with exposure, visibility, and legal protections. That empirical base is the clearest advance over the usual high-level threat models in this area. The categories line up with real voice-specific issues like identity and ownership, and the effort to tie risks to context factors gives it more practical flavor than abstract lists. The data collection itself shows some care in mixing official databases with user reports and public forums. That is a step forward for anyone tracking AI harms in HCI or security work. The soft spot sits in the methods. The abstract lists the sources but gives no account of how the data was coded, how categories were refined, or what checks were done for bias or completeness. Reddit threads in particular can skew toward certain voices, and without details on validation or testing the taxonomy against fresh cases, the claims about risk interactions read as asserted rather than demonstrated. The U.S. tilt in the reports also leaves open how well it travels elsewhere. This is worth a look for researchers working on voice AI policy, privacy design, or empirical risk taxonomies. A reader who wants concrete examples to build on would get value from the data summary even if the full analysis needs work. It deserves a serious referee because the topic is timely and the multi-source effort is substantive, though reviewers will need to press on the missing derivation details. I would send it to review with a request to expand the analysis section.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces V.O.I.C.E, a five-category taxonomy (Voice, Ownership, Identity, Control, Expression) for risks arising from synthetic voice generation. It is constructed from a multi-source empirical effort comprising 569 incidents drawn from AI incident databases, FTC, and IC3 records; 1,067 direct reports from U.S.-based participants across voice actors, internet personalities, political figures, and the general public; and 2,221 Reddit discussions. The central claim is that this taxonomy explicitly models how risks emerge and interact with contextual factors such as degree of exposure, social visibility, and availability of legal protections for different affected groups.

Significance. If the empirical construction and interaction modeling are substantiated with transparent methods, the taxonomy would represent a meaningful advance over existing uniform threat models in AI security and privacy. The scale and diversity of the data sources—particularly the inclusion of reports from voice professionals and public figures—provide a concrete empirical base that could inform policy, detection tools, and governance frameworks. This data-driven approach is a clear strength relative to purely conceptual taxonomies in the field.

major comments (2)

[§3 and §4] §3 Data Collection and §4 Taxonomy Derivation: The manuscript provides counts of incidents, reports, and discussions but contains no description of the thematic analysis or coding procedure used to map these sources onto the five V.O.I.C.E categories, nor any mention of inter-rater reliability, duplicate handling, or bias mitigation. This absence directly undermines the central claim that the taxonomy is 'grounded in real-world data' and 'explicitly models' risk emergence.
[§5] §5 Risk Interaction Analysis: The abstract asserts that the taxonomy models interactions between risks and contextual factors (exposure, visibility, legal protections), yet no concrete examples, tables, or diagrams illustrate these interactions for specific groups (e.g., how legal protections differentially affect voice actors versus the general public). Without such evidence the interaction-modeling claim remains unsupported.

minor comments (2)

[Abstract] Abstract: The V.O.I.C.E acronym is expanded but the paper does not provide even one-sentence definitions of each category, which would improve immediate readability.
[Related Work] Related Work: The discussion of prior voice-synthesis risk literature is brief; adding explicit comparison to existing taxonomies (e.g., those focused on deepfakes or biometric privacy) would clarify the claimed novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important opportunities to strengthen the transparency and evidentiary support in our manuscript. We address each major comment below and commit to revisions that will clarify our methods and substantiate our claims without altering the core contributions.

read point-by-point responses

Referee: [§3 and §4] §3 Data Collection and §4 Taxonomy Derivation: The manuscript provides counts of incidents, reports, and discussions but contains no description of the thematic analysis or coding procedure used to map these sources onto the five V.O.I.C.E categories, nor any mention of inter-rater reliability, duplicate handling, or bias mitigation. This absence directly undermines the central claim that the taxonomy is 'grounded in real-world data' and 'explicitly models' risk emergence.

Authors: We agree that the absence of a methods description for the thematic analysis is a significant omission that weakens the presentation of our empirical grounding. In the revised manuscript, we will insert a dedicated subsection in §4 that details the iterative coding process used to map the 569 incidents, 1,067 reports, and 2,221 discussions onto the V.O.I.C.E categories. This will include the initial open coding approach, how duplicates were identified and resolved across sources, and steps taken to mitigate researcher bias through team review. Although formal inter-rater reliability statistics were not calculated in the original exploratory analysis, we will explicitly describe the collaborative validation process and acknowledge this as a limitation. These additions will directly support the claim that the taxonomy is grounded in real-world data. revision: yes
Referee: [§5] §5 Risk Interaction Analysis: The abstract asserts that the taxonomy models interactions between risks and contextual factors (exposure, visibility, legal protections), yet no concrete examples, tables, or diagrams illustrate these interactions for specific groups (e.g., how legal protections differentially affect voice actors versus the general public). Without such evidence the interaction-modeling claim remains unsupported.

Authors: We concur that §5 currently provides insufficient concrete illustrations of the risk interactions with contextual factors, leaving the abstract's claims under-supported. In the revision, we will expand §5 with specific examples drawn from the empirical sources, a new table that contrasts differential impacts (such as legal protections for voice actors versus the general public), and a diagram showing how exposure, visibility, and legal context modulate the V.O.I.C.E categories. These elements will be tied directly to the collected data to make the interaction modeling explicit and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical taxonomy from external data sources

full rationale

The paper constructs the V.O.I.C.E taxonomy directly from multi-source external data (569 incidents from public databases, 1067 participant reports, 2221 Reddit discussions). No equations, fitted parameters, predictions, or self-referential derivations appear. The central claim is an inductive classification grounded in independent real-world sources rather than any reduction to the paper's own inputs or prior self-citations. This is a standard non-circular empirical effort.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the collected incidents and discussions sufficiently represent real-world voice synthesis risks and that the five categories plus contextual factors form a complete model.

axioms (1)

domain assumption Multi-source incident and report data is representative of the full range of synthetic voice risks and their interactions with exposure, visibility, and legal protections.
Invoked to support the claim that the taxonomy is 'grounded in real-world data' and 'explicitly models how risk emerges'.

invented entities (1)

V.O.I.C.E taxonomy (Voice, Ownership, Identity, Control, Expression categories) no independent evidence
purpose: To structure and categorize the identified risks.
New framework defined by the authors from the empirical data.

pith-pipeline@v0.9.0 · 5482 in / 1326 out tokens · 75751 ms · 2026-05-08T07:42:42.395684+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

94 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Biometric Information Privacy Act
[2]

Minnesota Consumer Privacy Act
[3]

Nurture Originals, Foster Art, and Keep Entertainment Safe Act of 2025

2025
[4]

Require the Exposure of AI–Led Political Advertisements Act
[5]

Stop Spying Bosses Act
[6]

Strengthening Artificial intelligence Normalization and Diffusion By Oversight and eXperimentation Act
[7]

Texas Responsible Artificial Intelligence Governance Act
[8]

Sound check: Auditing audio datasets

William Agnew,JuliaBarnett,AnnieChu,RachelHong,Michael Feffer, Robin Netzorg, Harry H Jiang, Ezra Awumey, and Sauvik Das. Sound check: Auditing audio datasets. arXiv preprint arXiv:2410.13114, 2024

work page arXiv 2024
[9]

Artificial intelligence risk management framework (ai rmf 1.0).URL: https://nvlpubs.nist

NIST AI. Artificial intelligence risk management framework (ai rmf 1.0).URL: https://nvlpubs.nist. gov/nistpubs/ai/nist.ai, pages 100–1, 2023

2023
[10]

Resemble ai: Generative voice cloning.https: //www.resemble.ai, 2024

Resemble AI. Resemble ai: Generative voice cloning.https: //www.resemble.ai, 2024. Accessed: 2024-05-20

2024
[11]

Artificial intelligence incident database (aiid)

AI Incident Database. Artificial intelligence incident database (aiid). https://incidentdatabase.ai/ , 2026. Incident repository indexing real-world AI harms and near-harms

2026
[12]

TheriseoftheAI-clonedvoicescam

JeffreyM.Allen. TheriseoftheAI-clonedvoicescam. American Bar Association, Voice of Experience, September 2025

2025
[13]

Areviewofmodernaudio deepfake detection methods: challenges and future directions

ZaynabAlmutairiandHebahElgibreen. Areviewofmodernaudio deepfake detection methods: challenges and future directions. Algorithms, 15(5):155, 2022

2022
[14]

Amazon Alexa.https://alexa.amazon.com/,2024

Amazon. Amazon Alexa.https://alexa.amazon.com/,2024. Accessed: 2024

2024
[15]

Deep voice 2: multi-speaker neural text-to-speech

SercanÖ.Arık,GregoryDiamos,AndrewGibiansky,JohnMiller, KainanPeng,WeiPing,JonathanRaiman,andYanqiZhou. Deep voice 2: multi-speaker neural text-to-speech. NIPS’17, page 2966–2974, Red Hook, NY, USA, 2017. Curran Associates Inc

2017
[16]

Athleticdirectorusedaitoframeprincipalwith racist remarks, police say, 2024

AssociatedPress. Athleticdirectorusedaitoframeprincipalwith racist remarks, police say, 2024. Accessed: 2026-02-05

2024
[17]

Why speech deepfake detectors won’t generalize: The limits of detection in an open world.arXivpreprint arXiv:2509.20405, 2025

Visar Berisha, Prad Kadambi, and Isabella Lenz. Why speech deepfake detectors won’t generalize: The limits of detection in an open world.arXivpreprint arXiv:2509.20405, 2025

work page arXiv 2025
[18]

Tortoise-TTS: A Multi-Voice TTS System.https: //github.com/neonbjb/tortoise- tts , 2022

James Betker. Tortoise-TTS: A Multi-Voice TTS System.https: //github.com/neonbjb/tortoise- tts , 2022. Accessed: 2024

2022
[19]

Using thematic analysis in psychology.Qualitativeresearchin psychology, 3(2):77–101, 2006

Virginia Braun and Victoria Clarke. Using thematic analysis in psychology.Qualitativeresearchin psychology, 3(2):77–101, 2006

2006
[20]

arXiv:1802.07228, 2018

Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. The malicious use of artificial intel- ligence: Forecasting, prevention, and mitigation.arXivpreprint arXiv:1802.07228, 2018

work page arXiv 2018
[21]

Gendershades:Intersectional accuracy disparities in commercial gender classification

JoyBuolamwiniandTimnitGebru. Gendershades:Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018

2018
[22]

Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice con- version for everyone

Edresson Casanova, Julian Weber, Christopher D Shulby, Ar- naldo Candido Junior, Eren Gölge, and Moacir A Ponti. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice con- version for everyone. InInternational conference on machine learning, pages 2709–2720. PMLR, 2022

2022
[23]

AI Watch: Global regulatory tracker - United States | White & Case LLP — whitecase.com.https://www

White & Case. AI Watch: Global regulatory tracker - United States | White & Case LLP — whitecase.com.https://www. whitecase.com/insight-our-thinking/ai-watch-globa l-regulatory-tracker-united-states , 2025. [Accessed 29-01-2026]

2025
[24]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei- Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

work page arXiv 2021
[25]

One-shot voiceconversionbyseparatingspeakerandcontentrepresentations with instance normalization.arXiv preprint arXiv:1904.05742, 2019

Ju-chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee. One-shot voiceconversionbyseparatingspeakerandcontentrepresentations with instance normalization.arXiv preprint arXiv:1904.05742, 2019

work page arXiv 1904
[26]

Between truth and power

Julie E Cohen. Between truth and power. Oxford University Press, 2019

2019
[27]

Congress

U.S. Congress. H.r.5388 – american artificial intelligence lead- ership and uniformity act.https://www.congress.gov/bil l/119th-congress/house-bill/5388 , 2025. Introduced in 119th Congress; referred to committee

2025
[28]

Coqui TTS: Deep Learning Toolkit for Text-to-Speech

Coqui AI. Coqui TTS: Deep Learning Toolkit for Text-to-Speech. https://github.com/coqui-ai/TTS, 2024. Accessed: 2024

2024
[29]

The us approach to ai regulation: federal laws, policies,andstrategiesexplained

Tatevik Davtyan. The us approach to ai regulation: federal laws, policies,andstrategiesexplained. JournalofLaw,Technology,& the Internet, 16(2):223, 2025

2025
[30]

Defending speaker verification against deepfake: A multi-step approach

Thien-Phuc Doan, Hung Dinh-Xuan, Inho Kim, Woongjae Lee, SeongKyu Han, and Souhwan Jung. Defending speaker verification against deepfake: A multi-step approach. In Proceedingsofthe1stACMWorkshoponDeepfake,Deception, andDisinformationSecurity, pages 44–51, 2025

2025
[31]

Oxford Uni- versity Press, 2020

Sam Dubberley, Alexa Koenig, and Daragh Murray.Digital witness: using open source information for human rights investigation,documentation, and accountability. Oxford Uni- versity Press, 2020

2020
[32]

Trump says he’ll sign executive order blocking state AI regulations, despite safety fears | CNN Business

Clare Duffy. Trump says he’ll sign executive order blocking state AI regulations, despite safety fears | CNN Business
[33]

ElevenLabs: Prime Voice AI.https://elevenla bs.io/, 2024

ElevenLabs. ElevenLabs: Prime Voice AI.https://elevenla bs.io/, 2024. Accessed: 2024

2024
[34]

General data protection regulation (gdpr).https: //gdpr-info.eu/, 2018

EU GDPR. General data protection regulation (gdpr).https: //gdpr-info.eu/, 2018. Official consolidated text of the EU data protection regulation; accessed 2026-01-29

2018
[35]

European Commission. Regulation (eu) 2024/1689 of the eu- ropean parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations(ec)no300/2008. http://data.europa.eu/eli/r eg/2024/1689/oj, 2024. Addition to the GDPR about AI from the EU

2024
[36]

Preventing Woke AI in the Federal Government

Executive Office of the President. Preventing Woke AI in the Federal Government
[37]

Promoting the Export of the American AI Technology Stack

Executive Office of the President. Promoting the Export of the American AI Technology Stack
[38]

Creating, using, misusing, and detecting deep fakes

Hany Farid. Creating, using, misusing, and detecting deep fakes. Journal ofOnlineTrustandSafety, 2022

2022
[39]

Internet crime complaint center (ic3), n.d

Federal Bureau of Investigation. Internet crime complaint center (ic3), n.d. Accessed: 2026-02-05. 14

2026
[40]

Ftc voice cloning challenge.https: //www.ftc.gov/news-events/contests/ftc-voice-clo ning-challenge , 2026

Federal Trade Commission. Ftc voice cloning challenge.https: //www.ftc.gov/news-events/contests/ftc-voice-clo ning-challenge , 2026. FTC initiative addressing consumer harms from AI-enabled voice cloning

2026
[41]

Timit acoustic-phonetic continuous speech corpus.(NoTitle), 1993

JohnSGarofolo,LoriFLamel,WilliamMFisher,DavidSPallett, Nancy L Dahlgren, Victor Zue, and Jonathan G Fiscus. Timit acoustic-phonetic continuous speech corpus.(NoTitle), 1993

1993
[42]

Google Cloud Text-to-Speech.https://cloud.goog le.com/text-to-speech, 2024

Google. Google Cloud Text-to-Speech.https://cloud.goog le.com/text-to-speech, 2024. Accessed: 2024

2024
[43]

Pressprotect: Helping journalists navigate social media in the face of online harassment

Catherine Han, Anne Li, Deepak Kumar, and Zakir Durumeric. Pressprotect: Helping journalists navigate social media in the face of online harassment. Proceedings of the ACM on Human-Computer Interaction, 8(CSCW2):1–34, 2024

2024
[44]

Not myvoice!ataxonomyofethicalandsafetyharmsofspeechgener- ators

Wiebke Hutiri, Orestis Papakyriakopoulos, and Alice Xiang. Not myvoice!ataxonomyofethicalandsafetyharmsofspeechgener- ators. InProceedingsofthe2024ACMConferenceonFairness, Accountability,andTransparency,FAccT’24,page359–376,New York, NY, USA, 2024. Association for Computing Machinery

2024
[45]

Transfer learning from speaker verification to multispeaker text-to-speech synthesis

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural informationprocessingsystems, 31, 2018

2018
[46]

Libri-light: A benchmark for asr with limited or no supervision

Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky,RonanCollobert,ChristianFuegen,etal. Libri-light: A benchmark for asr with limited or no supervision. InICASSP 2020-2020IEEEInternationalConferenceonAcoustics,Speech andSignalProcessing(ICASSP), pages 7669–7673. IEEE, 2020

2020
[47]

Librivox: Free public domain audiobooks, 2014

Jodi Kearns. Librivox: Free public domain audiobooks, 2014

2014
[48]

Conditional varia- tionalautoencoderwithadversariallearningforend-to-endtext-to- speech

Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional varia- tionalautoencoderwithadversariallearningforend-to-endtext-to- speech. InInternationalConferenceonMachineLearning,pages 5530–5540. PMLR, 2021

2021
[49]

Thesimpleeconomicsofcybercrimes

NirKshetri. Thesimpleeconomicsofcybercrimes. IEEESecurity &Privacy, 4(1):33–39, 2006

2006
[50]

Voicebox: Text-guided multilingual univer- sal speech generation at scale.Advancesin neural information processingsystems, 36:14005–14034, 2023

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, RashelMoritz,MaryWilliamson,VimalManohar,YossiAdi,Jay Mahadeokar, et al. Voicebox: Text-guided multilingual univer- sal speech generation at scale.Advancesin neural information processingsystems, 36:14005–14034, 2023

2023
[51]

A hierarchical speaker representation framework for one-shot singing voice conversion

Xu Li, Shansong Liu, and Ying Shan. A hierarchical speaker representation framework for one-shot singing voice conversion. arXivpreprint arXiv:2206.13762, 2022

work page arXiv 2022
[52]

Fish-speech:Leveraginglargelan- guage models for advanced multilingual text-to-speech synthesis,

ShijiaLiao,YuxuanWang,TianyuLi,YifanCheng,RuoyiZhang, RongzhiZhou,andYijinXing. Fish-speech:Leveraginglargelan- guage models for advanced multilingual text-to-speech synthesis,
[53]

URLhttps://arxiv.org/abs/2411.01156

work page arXiv
[54]

Autoregressive diffusion transformer for text-to-speech synthesis

Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, and Haizhou Li. Autoregressive diffusion transformer for text-to-speech synthesis. arXivpreprint arXiv:2406.05551, 2024

work page arXiv 2024
[55]

Warning: Humans cannot reliably detect speech deepfakes.Plos one, 18(8):e0285333, 2023

Kimberly T Mai, Sergi Bray, Toby Davies, and Lewis D Griffin. Warning: Humans cannot reliably detect speech deepfakes.Plos one, 18(8):e0285333, 2023

2023
[56]

The creation and detection of deepfakes: A survey.ACMcomputingsurveys(CSUR),54(1):1– 41, 2021

Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey.ACMcomputingsurveys(CSUR),54(1):1– 41, 2021

2021
[57]

The economics of cybersecurity: Principles and policy options

Tyler Moore. The economics of cybersecurity: Principles and policy options. International Journal of Critical Infrastructure Protection, 3(3-4):103–117, 2010

2010
[58]

Does Audio Deepfake Detection Generalize?

Nicolas M Müller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, and Konstantin Böttinger. Does audio deepfake detec- tion generalize?arXiv preprint arXiv:2203.16263, 2022

work page arXiv 2022
[59]

Humanper- ceptionofaudiodeepfakes

NicolasMMüller,KarlaPizzi,andJenniferWilliams. Humanper- ceptionofaudiodeepfakes. In Proceedingsofthe1stinternational workshopondeepfakedetectionforaudiomultimedia,pages85– 91, 2022

2022
[60]

V oxceleb: a large-scale speaker identification dataset,

Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Vox- celeb: a large-scale speaker identification dataset.arXivpreprint arXiv:1706.08612, 2017

work page arXiv 2017
[61]

Aim: Ai incidents and hazards monitor (incidents portal).https://oecd.ai/en/incidents,

OECD AI Policy Observatory. Aim: Ai incidents and hazards monitor (incidents portal).https://oecd.ai/en/incidents,
[62]

OECD portal documenting AI incidents and hazards
[63]

OpenAI Text-to-Speech API.https://platform.o penai.com/docs/guides/text-to-speech, 2024

OpenAI. OpenAI Text-to-Speech API.https://platform.o penai.com/docs/guides/text-to-speech, 2024. Accessed: 2024

2024
[64]

Librispeech: an asr corpus based on public domain au- dio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khu- danpur. Librispeech: an asr corpus based on public domain au- dio books. In2015 IEEE international conferenceon acoustics, speechandsignalprocessing(ICASSP),pages5206–5210.IEEE, 2015

2015
[65]

InProceedings of 5th International ConferenceonSpeechProsody(SpeechProsody2010),Chicago, Illinois, 2010

KishorePrahallad,EVeeraRaghavendra,andAlanWBlack.Semi- supervised learning of acoustic driven prosodic phrase breaks for text-to-speech systems. InProceedings of 5th International ConferenceonSpeechProsody(SpeechProsody2010),Chicago, Illinois, 2010

2010
[66]

Openvoice: Versatile instant voice cloning

ZengyiQin,WenliangZhao,XuminYu,andXinSun. Openvoice: Versatileinstantvoicecloning. arXivpreprintarXiv:2312.01479, 2023

work page arXiv 2023
[67]

Qwen3-tts technical report.https://github.com /QwenLM/Qwen3-TTS/blob/main/assets/Qwen3_TTS.pdf

Qwen Team. Qwen3-tts technical report.https://github.com /QwenLM/Qwen3-TTS/blob/main/assets/Qwen3_TTS.pdf . GitHub repository PDF, accessed Feb. 5, 2026

2026
[68]

Robust speech recognition via large-scale weak supervision

AlecRadford,JongWookKim,TaoXu,GregBrockman,Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machinelearning, pages 28492–28518. PMLR, 2023

2023
[69]

Millions}of people are watching{you

Patrawat Samermit, Anna Turner, Patrick Gage Kelley, Tara Matthews, Vanessia Wu, Sunny Consolvo, and Kurt Thomas. {“Millions}of people are watching{you”}: Understanding the {Digital-Safety}needsandpracticesofcreators. In 32ndUSENIX Security Symposium(USENIXSecurity 23), pages 5629–5645, 2023

2023
[70]

Human perception of audio deepfakes: the roleoflanguageandspeakingstyle

Eugenia San Segundo, Aurora López-Jareño, Xin Wang, and Junichi Yamagishi. Human perception of audio deepfakes: the roleoflanguageandspeakingstyle. AvailableatSSRN5954496, 2025

2025
[71]

Verifying humanness: Personhood credentials for the digital identity crisis

Tanusree Sharma. Verifying humanness: Personhood credentials for the digital identity crisis. 2025

2025
[72]

An analysis of phish- ing emails and how the human vulnerabilities are exploited

Tanusree Sharma and Masooda Bashir. An analysis of phish- ing emails and how the human vulnerabilities are exploited. In International Conference on Applied Human Factors and Ergonomics, pages 49–55. Springer, 2020

2020
[73]

Aligning ai with public values: Deliberation and decision-making for governing multimodal llms in political video analysis

Tanusree Sharma, Yujin Potter, Zachary Kilhoffer, Yun Huang, Dawn Song, and Yang Wang. Aligning ai with public values: Deliberation and decision-making for governing multimodal llms in political video analysis. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 2345– 2359, 2025

2025
[74]

Disability- first design and creation of a dataset showing private visual information collected with people who are blind

Tanusree Sharma, Abigale Stangl, Lotus Zhang, Yu-Yun Tseng, InanXu,LeahFindlater,DannaGurari,andYangWang. Disability- first design and creation of a dataset showing private visual information collected with people who are blind. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2023. 15

2023
[75]

be- fore, i asked my mom, now i ask chatgpt

TanusreeSharma,Yu-YunTseng,LotusZhang,AyaeIde,KellyAv- ery Mack, Leah Findlater, Danna Gurari, and Yang Wang. “be- fore, i asked my mom, now i ask chatgpt”: Visual privacy management with generative ai for blind and low-vision peo- ple. InProceedingsofthe27thInternationalACMSIGACCESS Conferenceon Computersand Accessibility, pages 1–14, 2025

2025
[76]

Prac3(privacy, reputation, accountability, consent, credit, compensation): Voice actors in ai data-economy

TanusreeSharma,YihaoZhou,andVisarBerisha. Prac3(privacy, reputation, accountability, consent, credit, compensation): Voice actors in ai data-economy. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 2360– 2372, 2025

2025
[77]

Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction

Renee Shelby, Shalaleh Rismani, Kathryn Henne, AJung Moon, Negar Rostamzadeh, Paul Nicholas, N’Mah Yilla-Akbari, Jess Gallegos, Andrew Smart, Emilio Garcia, et al. Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. InProceedings of the 2023 AAAI/ACMConference onAI, Ethics,andSociety, pages 723–741, 2023

2023
[78]

Tacolm: Gated attention equipped codec language model are efficient zero-shot text to speech synthesizers

Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Guanrou Yang, and Xie Chen. Tacolm: Gated attention equipped codec language model are efficient zero-shot text to speech synthesizers. arXivpreprint arXiv:2406.15752, 2024

work page arXiv 2024
[79]

Data is the new what? popularmetaphors&professionalethicsinemergingdataculture

Luke Stark and Anna Lauren Hoffmann. Data is the new what? popularmetaphors&professionalethicsinemergingdataculture. Journal of Cultural Analytics, 4(1), 2019

2019
[80]

Natu- ralspeech: End-to-end text-to-speech synthesis with human-level quality

XuTan,JiaweiChen,HaoheLiu,JianCong,ChenZhang,Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Natu- ralspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024

2024

Showing first 80 references.