arxiv: 2605.02223 · v1 · submitted 2026-05-04 · 💻 cs.SD · cs.CV

Recognition: unknown

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Tung Vu , Yen Nguyen , Hai Nguyen , Cuong Pham , Cong Tran

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:03 UTC · model grok-4.3

classification 💻 cs.SD cs.CV

keywords speech inpainting forensicsmulti-region tampering localizationdeepfake detectionaudio tamperingMIST datasetiterative segment analysissegment-level F1 metricvoice cloning

0 comments

The pith

Partial speech inpainting at word granularity evades existing deepfake detectors, but a new iterative method recovers the tampered regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the MIST dataset of multilingual speech utterances containing one to three independently inpainted word segments that form only two to seven percent of the total content, generated by LLM-guided semantic replacement and neural voice cloning. It presents the ISA framework that runs coarse-to-fine sliding-window classification, proposes candidate regions while tolerating gaps, and refines boundaries to locate all tampered segments without knowing their number in advance. Existing utterance-level classifiers trained on fully synthesized speech assign near-zero fake probability to these partially altered samples, showing that current detectors miss this form of manipulation. The work also introduces the SF1@tau metric, which scores both the accuracy of the number of regions found and the precision of their time locations using temporal IoU matching. This addresses a gap where small meaning-altering edits can bypass binary detection while remaining forensically relevant.

Core claim

Existing deepfake detectors trained on fully synthesized speech assign near zero fake probability to MIST utterances containing only 2-7% manipulated content at the word level. The ISA method, which performs iterative segment analysis via gap-tolerant region proposal and boundary refinement, recovers all tampered regions without prior knowledge of their count and outperforms non-iterative baselines when evaluated with the SF1@tau segment-level F1 metric based on temporal IoU matching.

What carries the argument

ISA (Iterative Segment Analysis) framework that applies coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to localize an unknown number of inpainted segments.

Load-bearing premise

The MIST utterances generated via LLM-guided semantic replacement and neural voice cloning represent realistic adversarial partial tampering, and the gap-tolerant proposals in ISA can recover every region without prior knowledge of how many exist.

What would settle it

Running utterance-level classifiers trained on full synthesis on MIST samples and confirming they assign near-zero fake scores, or measuring whether ISA's SF1@tau scores exceed those of non-iterative baselines on the same test set.

Figures

Figures reproduced from arXiv: 2605.02223 by Cong Tran, Cuong Pham, Hai Nguyen, Tung Vu, Yen Nguyen.

**Figure 1.** Figure 1: Overview of the MIST generation pipeline. Given a genuine utterance with word-level alignment from either Multilingual LibriSpeech (EN/FR/DE/IT/ES) or LEMAS-Dataset (VI), (1) target words are selected based on duration and spacing constraints, (2) semantically divergent replacements are generated via an LLM, (3) replacement words are synthesized using speaker-conditioned voice cloning (CosyVoice 3 for EN… view at source ↗

**Figure 2.** Figure 2: Distribution of MIST samples by language and variant type. Each language contributes approximately equal amounts of source data (∼30 GB). The 3-word variant is only generated for utterances ≥10 s, which explains its smaller share. 3.4 Multilingual Voice Cloning Strategy A key challenge in constructing a multilingual inpainting dataset is ensuring high-quality, speaker-consistent synthesis across diverse … view at source ↗

**Figure 3.** Figure 3: Duration distributions of MIST audio. Left: original utterance durations per language.Right: inpainted utterance durations by variant view at source ↗

**Figure 4.** Figure 4: Proportional breakdown of the MIST fake subset.(a) Distribution by language. (b) Distribution by variant view at source ↗

**Figure 5.** Figure 5: Dataset size by language (hours).Grey bars: original (real) audio. Red bars: inpainted (fake) audio. Stage 1: Coarse Scan — A sliding window with large window size sweeps across the waveform; a binary classifier scores each window, producing a framelevel confidence map. Stage 2: Region Proposal — The confidence map is thresholded and clustered into contiguous candidate regions via gap-tolerant merging.… view at source ↗

**Figure 6.** Figure 6: Distribution of fake ratio (%) by variant and language view at source ↗

**Figure 7.** Figure 7: Duration distribution of individual replacement (fake) word segments across all languages and variants. 4.6 Backbone Classifier ISA treats fθ as a black-box scoring function, making it compatible with any audio deepfake detector that accepts a fixedlength waveform segment and outputs a spoofing probability. In our experiments, we evaluate three architectures spanning different feature extraction paradigm… view at source ↗

**Figure 8.** Figure 8: Mel-spectrogram comparison for an English utterance with 2-word inpainting (fake2w variant). Top: original utterance. Bottom: inpainted utterance; red boxes mark the tampered regions view at source ↗

**Figure 9.** Figure 9: Iterative Segment Analysis (ISA) pipeline illustrated on a 2-word inpainted utterance. Stage 1: A sliding window (W=0.5 s, S=0.25 s) produces a coarse confidence map; windows exceeding δ=0.6 are flagged (red). Stage 2: Flagged windows are merged with gap tolerance g=2, yielding candidate regions (orange boxes). Stage 3: Each candidate is re-analyzed with finer windows (W′=0.15 s, S ′=0.05 s) and threshold … view at source ↗

**Figure 10.** Figure 10: Illustrative example of SF1@τ computation. An utterance with N=2 ground-truth segments (red) receives Nˆ=3 predictions (blue). Left: Temporal alignment showing IoU overlaps. Right: Greedy matching at τ=0.5: σˆ1 matches σ ∗ 1 (IoU = 0.78), σˆ2 fails to match (IoU = 0.18<0.5), and σˆ3 is a pure false positive. Result: TP=1, FP=2, FN=1, SF1@0.5=0.40 view at source ↗

read the original abstract

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a solid new benchmark for multi-region speech tampering localization with released artifacts, though the synthetic data's fidelity to real attacks is an open question.

read the letter

The main point is that this paper gives us MIST, a dataset for multi-region word-level speech inpainting, along with the ISA method to find those regions without knowing their number ahead of time, and it shows existing detectors miss small partial changes. They generate the data across six languages by using an LLM to pick semantically fitting replacements and then clone the voice to insert them, so the fakes are only 2 to 7 percent of the utterance. ISA does coarse sliding-window detection, proposes candidate regions that can have gaps, and then refines the boundaries to recover everything. The SF1@tau metric scores how well the predicted segments match the true ones in count and location using IoU. This is new compared to the binary or single-region work that came before, and releasing the dataset, code, and toolkit makes it usable right away. On the soft side, the generation relies on current voice cloning tech, so there could be artifacts at the edit points that aren't in real tampering scenarios. That lines up with the stress-test note, and without human listening tests or comparisons to manual edits, it's not clear how much the zero-shot failure generalizes. The abstract also skips the quantitative results, so we have to wait for the full tables to see the actual gains from the iterative approach over simpler baselines. This work is for people building better speech forensics systems. It deserves peer review because it opens up a practical problem with concrete resources, even if the dataset validation could be stronger. Recommendation: Send it out, but ask reviewers to focus on whether the synthetic attacks are representative.

Referee Report

2 major / 1 minor

Summary. The paper introduces the MIST dataset, a large-scale multilingual collection of speech utterances with 1-3 independently inpainted word-level segments (constituting 2-7% of content) generated via LLM-guided semantic replacement and neural voice cloning. It proposes the ISA framework, a backbone-agnostic iterative method using coarse-to-fine sliding-window classification, gap-tolerant region proposal, and boundary refinement to localize an unknown number of tampered regions. It also defines the SF1@tau metric, a segment-level F1 score based on temporal IoU matching that evaluates both region count accuracy and localization precision. Zero-shot experiments show existing utterance-level deepfake detectors assign near-zero fake probability to MIST samples, while ISA outperforms non-iterative baselines; the dataset, code, and toolkit are released publicly.

Significance. If the central claims hold, the work fills an important gap in speech forensics by shifting focus from utterance-level binary detection or single-region tampering to multi-region word-granularity localization under realistic partial manipulation. The public release of MIST, ISA implementation, and SF1@tau evaluation toolkit is a clear strength that supports reproducibility and future benchmarking. The empirical demonstration that current detectors fail on low-percentage inpainting highlights a practical limitation and motivates fine-grained approaches.

major comments (2)

[§3] §3 (MIST Dataset Generation): The claim that partial inpainting at word granularity remains unsolved by existing detectors rests on MIST being a faithful proxy for realistic adversarial tampering. However, the generation pipeline (LLM semantic replacement + neural voice cloning) is presented without human naturalness ratings, ablations across cloning backbones, or comparisons against real-world editing tools. This is load-bearing for interpreting the near-zero detection rates as evidence of an inherent limitation rather than a possible artifact of the specific synthesis process.
[§5] §5 (Evaluation and Baselines): The abstract states that ISA consistently outperforms non-iterative baselines and that existing detectors fail, yet the manuscript provides insufficient quantitative tables, error analysis, or implementation details for the baselines and potential dataset biases. Without these, the central empirical claims cannot be fully verified or reproduced from the reported results.

minor comments (1)

[§4] The definition and computation of SF1@tau (including the role of the tau threshold) could be clarified with an explicit equation or pseudocode to avoid ambiguity in the IoU matching procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the importance of addressing multi-region word-level speech inpainting. We respond to each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: The claim that partial inpainting at word granularity remains unsolved by existing detectors rests on MIST being a faithful proxy for realistic adversarial tampering. However, the generation pipeline (LLM semantic replacement + neural voice cloning) is presented without human naturalness ratings, ablations across cloning backbones, or comparisons against real-world editing tools. This is load-bearing for interpreting the near-zero detection rates as evidence of an inherent limitation rather than a possible artifact of the specific synthesis process.

Authors: We acknowledge that additional validation would further support the claim that low detection rates reflect the inherent difficulty of partial manipulations. The MIST pipeline uses established, reproducible components (LLM-guided replacement and neural cloning) to create semantically coherent, speaker-consistent edits at 2-7% content. The near-zero probabilities are consistent across multiple independent detectors. In revision we will add human naturalness ratings on a sampled subset, ablations across cloning backbones, and a discussion relating the pipeline to real-world editing tools. These additions will clarify that the results are not artifacts of the specific synthesis process. revision: yes
Referee: The abstract states that ISA consistently outperforms non-iterative baselines and that existing detectors fail, yet the manuscript provides insufficient quantitative tables, error analysis, or implementation details for the baselines and potential dataset biases. Without these, the central empirical claims cannot be fully verified or reproduced from the reported results.

Authors: We agree that expanded empirical details will improve verifiability. Section 5 reports the main zero-shot and comparative results, but we will augment it with additional quantitative tables (including per-language and per-region-count breakdowns), systematic error analysis (e.g., failure modes by region size and count), full hyper-parameter and implementation details for all baselines, and an explicit discussion of potential dataset biases such as language effects or cloning artifacts. The already-released code and toolkit enable exact reproduction of the reported experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical dataset, method, and metric contributions are self-contained

full rationale

The paper's core contributions consist of constructing the MIST dataset via LLM-guided replacement and neural voice cloning, proposing the ISA iterative framework for multi-region localization, and defining the SF1@tau metric as a segment-level F1 based on temporal IoU matching. These are presented as engineering and evaluation steps rather than a derivation chain. No equations, fitted parameters renamed as predictions, or self-citations appear as load-bearing elements in the provided text. The zero-shot failure claim on existing detectors and ISA's outperformance are tied directly to experimental results on the new dataset and baselines, without reducing to self-definition or imported uniqueness theorems. The metric follows a standard IoU formulation without claiming novel derivation. This is a typical empirical forensics paper with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the LLM+voice-cloning generation process for real attacks and on standard assumptions in sliding-window detection; no major free parameters or invented entities are introduced beyond typical method hyperparameters and the definition of the new metric.

free parameters (1)

tau threshold in SF1@tau
IoU matching threshold for segment-level scoring, chosen to balance count accuracy and localization precision.

axioms (1)

domain assumption LLM-guided semantic replacement combined with neural voice cloning produces inpainted segments that mimic realistic adversarial tampering.
Invoked in the dataset construction and zero-shot evaluation sections of the abstract.

pith-pipeline@v0.9.0 · 5594 in / 1461 out tokens · 98978 ms · 2026-05-08T03:03:36.029138+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset,

Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Mu- nawar Hayat, Abhinav Dhall, and Kalin Stefanov. A V- Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset.arXiv preprint arXiv:2311.15308,

work page arXiv
[2]

Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. CosyV oice: A scalable multilingual zero-shot text-to- speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407,

work page arXiv
[3]

Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, et al. CosyV oice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589,

work page arXiv
[4]

LlamaPartialSpoof: An LLM-driven fake speech dataset simulating disinformation generation.arXiv preprint arXiv:2409.14743,

Hieu-Thi Luong, Haoqin Chua, Junlin Lee, Haibin Lin, et al. LlamaPartialSpoof: An LLM-driven fake speech dataset simulating disinformation generation.arXiv preprint arXiv:2409.14743,

work page arXiv
[5]

ASVspoof 2019: Spoofing countermeasures for the detection of synthesized, converted and replayed speech

Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kin- nunen, Ville Vestman, Massimiliano Todisco, H´ector Del- gado, Md Sahidullah, Junichi Yamagishi, and Kong Aik Lee. ASVspoof 2019: Spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Transactions on Biometrics, Behavior, and Identity Science, 3(2):252–265,

2019
[6]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans. Automatic speaker verification spoofing and deepfake detection us- ing wav2vec 2.0 and data augmentation.arXiv preprint arXiv:2202.12233,

work page arXiv
[7]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Jo- han Schalkwyk, Andrew M Dai, Neil Houlsby, et al. Gem- ini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review arXiv
[8]

ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech.Computer Speech & Language, 64:101114,

Xin Wang, Junichi Yamagishi, Massimiliano Todisco, H´ector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidul- lah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al. ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech.Computer Speech & Language, 64:101114,

2019
[9]

ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection

Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, and H´ector Delgado. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. InProc. ASVspoof Workshop,

2021
[10]

Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,

Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, and H´ector Delgado. Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,

2021
[11]

Fooled Twice: People Cannot Detect Deepfakes but Think They Can

Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cun- hang Fan, et al. Audio deepfake detection: A survey.arXiv preprint arXiv:2308.14970,

work page arXiv
[12]

Lemas: Large a 150k-hour large-scale extensible multilingual audio suite with generative speech models.arXiv preprint arXiv:2601.04233, 2026

Zhiyuan Zhao, Lijian Lin, Ye Zhu, Kai Xie, Yunfei Liu, and Yu Li. LEMAS: A 150k-hour large-scale extensible mul- tilingual audio suite with generative speech models.arXiv preprint arXiv:2601.04233, 2026

work page arXiv 2026