Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition

Hsuan Su; Hung-yi Lee; Shang-Tse Chen; Tzu-Quan Lin; Yi-Cheng Lin; Yu-Hsuan Li Liang; Yun-Nung Chen

arxiv: 2510.08047 · v2 · submitted 2025-10-09 · 📡 eess.AS · cs.CL

Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition

Yi-Cheng Lin , Yu-Hsuan Li Liang , Hsuan Su , Tzu-Quan Lin , Shang-Tse Chen , Yun-Nung Chen , Hung-yi Lee This is my paper

Pith reviewed 2026-05-18 09:17 UTC · model grok-4.3

classification 📡 eess.AS cs.CL

keywords pseudo-label correctionautomatic speech recognitiondomain adaptationtask arithmeticaccent robustnessword error rateWhisper model

0 comments

The pith

The difference between two models trained on real versus pseudo labels forms a correction vector that fixes systematic errors when applied to speech recognition models on new accents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in a source domain with both accurate and approximate labels, the parameter difference between two models fine-tuned from the same starting point captures the consistent biases introduced by pseudo-labeling. Subtracting this difference from a model trained only on pseudo labels in a target domain reduces word error rates on speech with unseen accents. A reader would care because this offers a way to improve automatic speech recognition systems that encounter new speaking styles without collecting expensive ground-truth transcripts for every domain. The approach treats the correction as a simple arithmetic operation in weight space rather than retraining or filtering data.

Core claim

In a source domain containing both real and pseudo-labeled data, two ASR models are fine-tuned from the same initialization, one on ground-truth labels and the other on pseudo-labels, and their weight difference forms a correction vector that captures pseudo-label biases. When applied to a pseudo-labeled target model, this vector enhances recognition, achieving up to a 35% relative Word Error Rate (WER) reduction on AfriSpeech-200 across ten African accents with the Whisper tiny model.

What carries the argument

The correction vector, which is the weight difference between a ground-truth fine-tuned model and a pseudo-label fine-tuned model started from the same initialization in the source domain, that is added to the target model parameters to offset pseudo-label biases.

If this is right

The correction improves recognition on target data without requiring any ground-truth labels from that domain.
The method produces up to 35 percent relative WER reduction across ten different African accents using the Whisper tiny model on AfriSpeech-200.
The same vector derived from a source domain can be reused on multiple target domains that share similar pseudo-labeling procedures.
Task arithmetic in parameter space transfers bias corrections across accent shifts without retraining the full model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subtraction approach could be tested on other sequence tasks such as machine translation where pseudo-labeling also introduces recurring errors.
One could check whether averaging correction vectors from several source domains produces a more stable fix than a single source.
The result suggests that pseudo-label errors often point in a consistent direction in model space that can be isolated and subtracted rather than requiring per-domain retraining.

Load-bearing premise

The biases captured by the weight difference in one domain will match and correct the biases that appear when the same kind of pseudo-labeling is used in a completely different target domain.

What would settle it

Applying the correction vector to the target-domain pseudo-labeled model produces no reduction or an increase in word error rate on the unseen accents.

read the original abstract

Robust ASR under domain shift is crucial because real-world systems encounter unseen accents and domains with limited labeled data. Although pseudo-labeling offers a practical workaround, it often introduces systematic, accent-specific errors that filtering fails to fix. We ask: How can we correct these recurring biases without target ground truth? We propose a simple parameter-space correction: in a source domain containing both real and pseudo-labeled data, two ASR models are fine-tuned from the same initialization, one on ground-truth labels and the other on pseudo-labels, and their weight difference forms a correction vector that captures pseudo-label biases. When applied to a pseudo-labeled target model, this vector enhances recognition, achieving up to a 35% relative Word Error Rate (WER) reduction on AfriSpeech-200 across ten African accents with the Whisper tiny model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Pseudo2Real, a parameter-space correction technique for ASR pseudo-labeling under domain shift. In a source domain with both ground-truth and pseudo-labeled data, two models are fine-tuned from the same initialization; their weight difference defines a correction vector that is added to a pseudo-labeled model trained on an unseen target domain. The central empirical claim is that this vector corrects systematic accent-specific biases, yielding up to a 35% relative WER reduction on the AfriSpeech-200 corpus across ten African accents when applied to Whisper-tiny.

Significance. If the reported gains prove robust, the method supplies a lightweight, training-free correction that exploits task-arithmetic ideas to improve pseudo-labeling without target ground truth. This could be practically valuable for low-resource accent adaptation in ASR, where collecting labeled data remains expensive. The construction is simple and does not introduce new free parameters beyond the choice of source domain.

major comments (3)

[§4] §4 (Experiments): The abstract and results claim a 35% relative WER reduction on AfriSpeech-200, yet the manuscript supplies no baseline comparisons (e.g., standard pseudo-label filtering, self-training, or domain-adversarial methods), no statistical significance tests, no error bars across multiple runs, and no ablation on the magnitude or direction of the correction vector. These omissions make it impossible to determine whether the gain is attributable to the proposed vector or to other unstated factors.
[§3.2] §3.2 (Correction Vector Construction): The central assumption—that the source-domain difference θ_GT_source − θ_pseudo_source aligns with the dominant error modes induced by pseudo-labeling on African accents—is not supported by any analysis of phoneme-confusion matrices, embedding-space distances, or error-pattern overlap between source (standard English read speech) and target domains. If the source and target bias subspaces differ, the vector corrects the wrong directions and the reported improvement may not generalize.
[§4.3] §4.3 (Target Domain Results): No cross-validation or sensitivity study is presented on the choice of source domain or the number of fine-tuning steps used to compute the correction vector. Without such controls, it remains unclear whether the 35% figure is stable or an artifact of a particularly favorable source-target pairing.

minor comments (2)

[§3] Notation for the correction vector is introduced without an explicit equation number; adding a numbered definition would improve traceability when the vector is later added to target-model weights.
[§4.1] The manuscript does not state the exact source-domain corpus or the pseudo-label generation procedure used to train the source pseudo-model; these details should be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and outline the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments): The abstract and results claim a 35% relative WER reduction on AfriSpeech-200, yet the manuscript supplies no baseline comparisons (e.g., standard pseudo-label filtering, self-training, or domain-adversarial methods), no statistical significance tests, no error bars across multiple runs, and no ablation on the magnitude or direction of the correction vector. These omissions make it impossible to determine whether the gain is attributable to the proposed vector or to other unstated factors.

Authors: We agree that the original submission would benefit from additional baselines, statistical validation, and ablations to more rigorously attribute the observed gains to the correction vector. In the revised manuscript we will add comparisons to standard pseudo-label filtering and self-training baselines. We will also report mean WER with standard deviation across three independent runs with different random seeds and include paired t-tests for statistical significance. Finally, we will include an ablation varying the scaling factor applied to the correction vector to demonstrate its contribution and directionality. revision: yes
Referee: [§3.2] §3.2 (Correction Vector Construction): The central assumption—that the source-domain difference θ_GT_source − θ_pseudo_source aligns with the dominant error modes induced by pseudo-labeling on African accents—is not supported by any analysis of phoneme-confusion matrices, embedding-space distances, or error-pattern overlap between source (standard English read speech) and target domains. If the source and target bias subspaces differ, the vector corrects the wrong directions and the reported improvement may not generalize.

Authors: We acknowledge that an explicit analysis of error-pattern alignment would strengthen the justification for the method. In the revision we will add a supplementary analysis comparing phoneme-level confusion matrices and common substitution patterns between the source-domain pseudo-label errors and the target-domain errors (both before and after applying the correction vector). This will provide direct evidence that the dominant biases overlap. The consistent relative gains across ten phonetically diverse African accents already suggest that the vector captures general pseudo-labeling artifacts rather than source-specific idiosyncrasies, but the added analysis will make this explicit. revision: yes
Referee: [§4.3] §4.3 (Target Domain Results): No cross-validation or sensitivity study is presented on the choice of source domain or the number of fine-tuning steps used to compute the correction vector. Without such controls, it remains unclear whether the 35% figure is stable or an artifact of a particularly favorable source-target pairing.

Authors: We agree that sensitivity to source-domain choice and fine-tuning duration is an important robustness check. In the revised version we will add a sensitivity study using an alternative source domain (LibriSpeech) and will vary the number of fine-tuning steps (from 1k to 10k) used to derive the correction vector, reporting the resulting WER reductions on AfriSpeech-200. These controls will demonstrate that the reported gains are not tied to a single favorable pairing. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical construction remains independent of its measured outcomes

full rationale

The paper defines a correction vector explicitly as the parameter difference between two source-domain fine-tunes (ground-truth vs. pseudo-label) and applies it to a separate target-domain model. This construction is stated directly in the abstract and method description without any equation that equates the vector to the target improvement by definition. No self-citation is invoked as a uniqueness theorem or load-bearing premise for the central claim; the reported WER reductions are presented as empirical observations on AfriSpeech-200 rather than algebraic consequences of the vector definition. The derivation chain therefore consists of an observable computation followed by external validation and does not reduce to a fitted input renamed as prediction or to any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transfer-learning assumptions rather than new free parameters or invented entities; the central premise is that weight-space differences isolate label-type biases in a transferable way.

axioms (1)

domain assumption Fine-tuning two models from identical initialization on ground-truth versus pseudo-labels isolates the systematic bias attributable to pseudo-labeling.
This premise is invoked to justify forming the correction vector from the weight difference.

pith-pipeline@v0.9.0 · 5693 in / 1407 out tokens · 37495 ms · 2026-05-18T09:17:40.169608+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI
eess.AS 2026-05 accept novelty 7.0

The paper delivers a unified framework for fairness in speech technologies by formalizing seven definitions, organizing research into three paradigms, diagnosing pipeline-specific biases, and mapping mitigations to th...