Kinship Verification Using Voice

Jagabandhu Mishra; Tomi H. Kinnunen

arxiv: 2606.01704 · v1 · pith:MQ4552NXnew · submitted 2026-06-01 · 📡 eess.AS

Kinship Verification Using Voice

Jagabandhu Mishra , Tomi H. Kinnunen This is my paper

Pith reviewed 2026-06-28 13:01 UTC · model grok-4.3

classification 📡 eess.AS

keywords kinship verificationspeaker embeddingsvoice biometricsfamilial cuesequal error ratezero-shot verificationfamily-disjoint evaluationaudio-visual dataset

0 comments

The pith

Speaker embeddings encode familial cues that allow kinship verification from voice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that voice recordings hold detectable signals of biological relatedness and that standard speaker embedding models can extract them for kinship verification. It creates a controlled test protocol on a large audio-visual collection using family-disjoint splits to limit interference from age, gender, or recording conditions. A sympathetic reader would care because this turns an existing biometric tool into a new relational detector. The work finds that genealogical closeness affects speaker verification and kinship tasks in opposite ways. If the claim holds, voice data becomes a source of family information that current systems already capture without explicit training for it.

Core claim

Genealogical similarity of speaker pairs plays opposite roles in speaker verification and kinship verification tasks. Neural speaker embedding extractors applied to speech from the KAN-AV dataset under a revised family-disjoint protocol demonstrate that embeddings carry familial cues, with zero-shot performance reaching 20.8 percent equal error rate when same-speaker trials are included and 39.7 percent when they are excluded; trainable back-ends that process embedding pairs asymmetrically to reduce age effects reach 32.0 percent.

What carries the argument

Neural speaker embedding extractors combined with back-end processors for embedding pairs, evaluated under a family-disjoint train-test split that controls for identity overlap and other confounders.

If this is right

Existing speaker verification pipelines already contain information usable for kinship decisions without retraining the front-end.
Back-end designs that treat embedding pairs asymmetrically can offset age-related mismatches in relational tasks.
Kinship verification becomes feasible in zero-shot settings using models trained only for individual speaker discrimination.
Strict trial definitions that exclude same-speaker pairs expose the remaining difficulty of pure cross-speaker family detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Public voice datasets may unintentionally expose family relationships if embedding models are applied without safeguards.
The same embedding space could be probed for other relational signals such as shared environment or accent inheritance.
Combining the voice approach with face-based kinship methods might produce higher accuracy through complementary cues.
Testing the protocol on datasets from different languages or recording environments would check whether the familial cues generalize beyond the current collection.

Load-bearing premise

The audio-visual dataset supplies accurate biological kinship labels and its family-disjoint split plus other controls fully remove age, gender, recording conditions, and speaker identity as alternative explanations for any detected patterns.

What would settle it

Re-running the embedding extractors and back-ends on the same pairs after randomly reassigning the kinship labels while preserving all metadata would show whether detection rates fall to chance.

Figures

Figures reproduced from arXiv: 2606.01704 by Jagabandhu Mishra, Tomi H. Kinnunen.

**Figure 2.** Figure 2: Pipeline for constructing the curated speech subset from KAN-AV. The filtering stages include language and quality selection, single-speaker filtering, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Distributions of five trial-level confounding factors for kin-target and kin non-target pairs in the test set. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset split and trial design. Figure reports the number of utterances, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Proposed Asymmetric Affine Projection (AS-AP) trainable backend. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Score distributions and DET plot of RedimNet SV system with [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Score distributions (SD) and DET plots of the RedimNet SV system [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: DET curve: comparison with zero-shot and trainable backends. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: EER of KV∗ systems with constrained (≤) target-trial age difference; shaded regions denote the ± 95% confidence interval. sibling relations may be easier than cross-generation parent– child relations. However, the mixed-gender relations show a less uniform pattern. In our results, MS trials are easier than BS and FD trials. It is also instructive to compare our findings to the earlier scarce literature. Be… view at source ↗

read the original abstract

Kinship verification (KV) from voice, the task of determining whether two speakers are biologically related, has received only little attention. Our work establishes a foundational basis for this emerging frontier, contributing to both performance evaluation and detection methodologies. First, leveraging the speech recordings of the large-scale audio-visual dataset, KAN-AV, we propose a revised evaluation protocol that controls for various confounders and adopts a family-disjoint train--test split to address open-set KV. Second, we analyze the close connection between speaker verification and KV, showing that genealogical similarity of speaker pairs plays opposite roles in the two tasks. Third, we tackle KV using three neural speaker embedding extractors (ECAPA-TDNN, WavLM-ECAPA, and ReDimNet) combined with various back-ends. In zero-shot KV including same-speaker target trials, ReDimNet achieves the lowest equal error rate (EER) of $20.8\%$; however, performance degrades to $39.7\%$ under strict kin trials, where same-speaker target trials are excluded. Our best trainable back-end, which applies asymmetric processing of the embedding pair to mitigate age-difference effects, obtains an EER of $32.0\%$ ($18.6\%$ with speaker target trials included). These results highlight the difficulty of KV while showing that speaker embeddings encode familial cues, offering a promising foundation for voice-based kinship analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sets the first baseline for voice kinship verification with a family-disjoint protocol and shows embeddings carry some familial signal, but the results may still reflect dataset artifacts rather than pure kinship.

read the letter

The one thing to know is that this is the first work to report concrete EER numbers for voice-based kinship verification on KAN-AV under a family-disjoint split, along with an analysis that genealogical similarity works in opposite directions for speaker verification versus kinship verification.

What is actually new is the task framing itself plus the revised protocol that aims at open-set KV and the explicit contrast between the two tasks. They test ECAPA-TDNN, WavLM-ECAPA, and ReDimNet, then combine them with back-ends, reaching 20.8% EER zero-shot (including same-speaker trials) and 39.7% on strict kin trials with ReDimNet; the best trainable asymmetric back-end hits 32.0%. The paper does a straightforward job of documenting the difficulty and showing the performance drop when same-speaker trials are removed.

The soft spot is the load-bearing assumption that the KAN-AV labels and family-disjoint split plus other controls fully remove age, gender, channel, and speaker-identity confounds. The abstract states they control for these but gives no ablations, error bars, or verification that the distributions match between kin and non-kin pairs after the split. If residual correlations remain, the gap above chance no longer demonstrates familial encoding in the embeddings. That concern is real and directly affects how much weight the numbers can carry.

This is for people already working on speaker embeddings who want to explore a narrow biometric side application. A reader in audio biometrics or forensics gets a usable starting protocol and baseline numbers. It deserves a serious referee because the empirical setup is new and the results are falsifiable, even though the data controls need more scrutiny.

I would send it to peer review with a request to show the age and channel distributions before and after the split and to add at least one ablation on the controls.

Referee Report

2 major / 2 minor

Summary. The paper claims to establish a foundational basis for kinship verification (KV) from voice by introducing a revised protocol on the KAN-AV dataset that uses a family-disjoint train-test split and controls for confounders. It analyzes the opposing roles of genealogical similarity in speaker verification versus KV, then evaluates three embedding extractors (ECAPA-TDNN, WavLM-ECAPA, ReDimNet) with multiple back-ends, reporting concrete EERs such as 20.8% (ReDimNet zero-shot including same-speaker trials), 39.7% (strict-kin trials), and 32.0% (best trainable asymmetric back-end).

Significance. If the central empirical results hold after verification that residual confounders have been removed, the work supplies the first substantial set of reproducible EER numbers on multiple extractors and back-ends for an open-set KV task, demonstrating that speaker embeddings encode familial cues while underscoring the task's difficulty relative to speaker verification.

major comments (2)

[Evaluation Protocol] Evaluation Protocol section: the claim that the family-disjoint split plus unspecified controls for age/gender/channel suffice to attribute EERs below 50% to genealogical similarity rather than residual correlations is load-bearing; without explicit checks (e.g., Kolmogorov-Smirnov tests or histograms comparing age-difference distributions between kin and non-kin pairs post-split), the performance gap cannot be interpreted as evidence of familial encoding.
[Experiments] Experiments section, results tables: the reported EER values (20.8%, 39.7%, 32.0%) lack error bars, bootstrap intervals, or multiple-run statistics, so the reliability of the ReDimNet and asymmetric back-end claims cannot be assessed.

minor comments (2)

[Abstract] Abstract and Experiments: the asymmetric back-end is described as mitigating age-difference effects, yet no ablation isolating this component versus a symmetric baseline is shown.
[Experiments] Notation and tables: the distinction between 'zero-shot KV including same-speaker target trials' and 'strict kin trials' is used throughout but would benefit from an explicit definition table or equation clarifying trial composition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements.

read point-by-point responses

Referee: [Evaluation Protocol] Evaluation Protocol section: the claim that the family-disjoint split plus unspecified controls for age/gender/channel suffice to attribute EERs below 50% to genealogical similarity rather than residual correlations is load-bearing; without explicit checks (e.g., Kolmogorov-Smirnov tests or histograms comparing age-difference distributions between kin and non-kin pairs post-split), the performance gap cannot be interpreted as evidence of familial encoding.

Authors: We agree that explicit verification of the controls is necessary to support the attribution of performance to genealogical similarity. The manuscript describes the family-disjoint split and controls for age, gender, and channel, but does not include statistical confirmation of balance. In the revised version, we will add Kolmogorov-Smirnov tests and histograms comparing age-difference distributions (and similarly for gender and channel where applicable) between kin and non-kin pairs after the split. revision: yes
Referee: [Experiments] Experiments section, results tables: the reported EER values (20.8%, 39.7%, 32.0%) lack error bars, bootstrap intervals, or multiple-run statistics, so the reliability of the ReDimNet and asymmetric back-end claims cannot be assessed.

Authors: We acknowledge that the absence of uncertainty estimates limits assessment of result reliability. We will compute and report bootstrap confidence intervals for the EER values (or statistics from multiple independent runs where feasible) in the revised experiments section and tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical EER measurements on held-out family-disjoint data

full rationale

The paper reports direct experimental results from applying existing embedding extractors (ECAPA-TDNN, WavLM-ECAPA, ReDimNet) and back-ends to the KAN-AV dataset under a family-disjoint protocol. No equations, predictions, or uniqueness claims are presented that reduce by construction to fitted inputs, self-citations, or ansatzes. All reported EER figures (20.8%, 39.7%, 32.0%) are measured outcomes on test pairs, not derived quantities. The work is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that KAN-AV kinship labels are accurate and that the listed confounders are controlled by the family-disjoint split; no free parameters are explicitly fitted in the abstract, and no new entities are postulated.

axioms (2)

domain assumption KAN-AV dataset contains accurate biological kinship labels for speaker pairs
Invoked when defining the evaluation protocol and reporting EERs on kin trials
domain assumption Family-disjoint train-test split plus other controls remove identity, age, gender and recording confounders
Central to the claim that the protocol addresses open-set KV

pith-pipeline@v0.9.1-grok · 5781 in / 1361 out tokens · 12043 ms · 2026-06-28T13:01:09.845568+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references

[1]

What is kinship about,

D. Schneider, “What is kinship about,”Kinship studies in the Morgan centennial, 1972

1972
[2]

How to estimate kinship,

J. Goudet, T. Kay, and B. S. Weir, “How to estimate kinship,”Molecular ecology, vol. 27, no. 20, pp. 4121–4135, 2018

2018
[3]

Kinship studies in late twentieth-century anthropology,

M. G. Peletz, “Kinship studies in late twentieth-century anthropology,” Annual review of anthropology, vol. 24, no. 1, pp. 343–372, 1995

1995
[4]

A survey on kinship verification,

W. Wang, S. You, S. Karaoglu, and T. Gevers, “A survey on kinship verification,”Neurocomputing, vol. 525, pp. 1–28, 2023

2023
[5]

Human ability to detect kinship in strangers’ faces: effects of the degree of relatedness,

G. Kaminski, S. Dridi, C. Graff, and E. Gentaz, “Human ability to detect kinship in strangers’ faces: effects of the degree of relatedness,” Proceedings of the Royal Society B: Biological Sciences, vol. 276, no. 1670, pp. 3193–3200, 2009

2009
[6]

Kan-av dataset for audio-visual face and speech analysis in the wild,

T. Kefalas, E. Fotiadou, M. Georgopoulos, Y . Panagakis, P. Ma, S. Petridis, T. Stafylakis, and M. Pantic, “Kan-av dataset for audio-visual face and speech analysis in the wild,”Image and Vision Computing, vol. 140, p. 104839, 2023

2023
[7]

Families in wild multimedia: A multimodal database for recognizing kinship,

J. P. Robinson, Z. Khan, Y . Yin, M. Shao, and Y . Fu, “Families in wild multimedia: A multimodal database for recognizing kinship,”IEEE Transactions on Multimedia, vol. 24, pp. 3582–3594, 2021

2021
[8]

Perceptual and acoustic similarities between the voices of family members: an approach to synthesize a voice based on family- shared f0 characteristics,

E. Rykova, “Perceptual and acoustic similarities between the voices of family members: an approach to synthesize a voice based on family- shared f0 characteristics,” Master’s thesis, University of Eastern Finland, Joensuu, Finland, 2018

2018
[9]

Language of kin relations and relationlessness,

C. Ball, “Language of kin relations and relationlessness,”Annual Review of Anthropology, vol. 47, no. 1, pp. 47–60, 2018

2018
[10]

Language in the constitution of kinship,

I. Keen, “Language in the constitution of kinship,”Anthropological Linguistics, vol. 56, no. 1, pp. 1–53, 2014

2014
[11]

T. F. Quatieri,Discrete-time speech signal processing: principles and practice. Pearson Education India, 2002

2002
[12]

Nolan,The Phonetic Bases of Speaker Recognition

F. Nolan,The Phonetic Bases of Speaker Recognition. Cambridge University Press, Oct. 1983

1983
[13]

Automatic speaker recognition of identical twins,

H. J. K ¨unzel, “Automatic speaker recognition of identical twins,”Inter- national Journal of Speech Language and The Law, vol. 17, pp. 251– 277, 2011

2011
[14]

Identical twins, different voices,

F. Nolan and T. Oh, “Identical twins, different voices,”The International Journal of Speech, Language and the Law, vol. 3, no. 1, pp. 39–49, June 1996

1996
[15]

Measurement of the impact of identical twin voices on automatic speaker recognition,

S. B. Sabatier, M. R. Trester, and J. M. Dawson, “Measurement of the impact of identical twin voices on automatic speaker recognition,” Measurement, vol. 134, pp. 385–389, 2019

2019
[16]

Effect of identical twins on deep speaker embeddings based forensic voice comparison,

M. H. Alsalihi and D. Sztah ´o, “Effect of identical twins on deep speaker embeddings based forensic voice comparison,”Int. J. Speech Technol., vol. 27, no. 2, p. 341–351, Jun. 2024. JOURNAL OF CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 13

2024
[17]

Principles of linguistic change. volume 2: Social factors,

L. William, “Principles of linguistic change. volume 2: Social factors,” 2001

2001
[18]

Two decades of speaker recognition evaluation at the national institute of standards and technology,

C. S. Greenberg, L. P. Mason, S. O. Sadjadi, and D. A. Reynolds, “Two decades of speaker recognition evaluation at the national institute of standards and technology,”Computer Speech and Language, vol. 60, p. 101032, 2020

2020
[19]

Speaker recognition—identifying people by their voices,

G. R. Doddington, “Speaker recognition—identifying people by their voices,”Proceedings of the IEEE, vol. 73, pp. 1651–1664, 1985

1985
[20]

Ecapa-tdnn: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification,” inProc. Interspeech 2020, 2020, pp. 3830–3834

2020
[21]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen and et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[22]

Reshape Dimensions Network for Speaker Recognition,

I. Yakovlev, R. Makarov, A. Balykin, P. Malov, A. Okhotnikov, and N. Torgashov, “Reshape Dimensions Network for Speaker Recognition,” inInterspeech 2024, 2024, pp. 3235–3239

2024
[23]

Identification of correlation between blood relations using speech signal,

P. Padmini, S. Tripathi, and K. Bhowmick, “Identification of correlation between blood relations using speech signal,” in2017 IEEE Interna- tional Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES). IEEE, 2017, pp. 1–6

2017
[24]

Audio- visual kinship verification in the wild,

X. Wu, E. Granger, T. H. Kinnunen, X. Feng, and A. Hadid, “Audio- visual kinship verification in the wild,” in2019 international conference on biometrics (ICB). IEEE, 2019, pp. 1–8

2019
[25]

Audio-visual kinship verification: a new dataset and a unified adaptive adversarial multimodal learning approach,

X. Wu, X. Zhang, X. Feng, M. B. Lopez, and L. Liu, “Audio-visual kinship verification: a new dataset and a unified adaptive adversarial multimodal learning approach,”IEEE Transactions on Cybernetics, vol. 54, no. 3, pp. 1523–1536, 2022

2022
[26]

Audio- based kinship verification using age domain conversion,

Q. Sun, A. Akman, X. Jing, M. Milling, and B. W. Schuller, “Audio- based kinship verification using age domain conversion,”IEEE Signal Processing Letters, 2024

2024
[27]

Front- end factor analysis for speaker verification,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- end factor analysis for speaker verification,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010

2010
[28]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” inInterspeech 2018, 2018, pp. 1086–1090

2018
[29]

Pyannote. audio: neu- ral building blocks for speaker diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote. audio: neu- ral building blocks for speaker diarization,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7124–7128

2020
[30]

X- vectors: Robust dnn embeddings for speaker recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X- vectors: Robust dnn embeddings for speaker recognition,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333

2018
[31]

CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Con- version,

T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Con- version,” inInterspeech 2020, 2020, pp. 2017–2021

2020
[32]

NIST 2024 speaker recognition evaluation plan,

National Institute of Standards and Technology, “NIST 2024 speaker recognition evaluation plan,” National Institute of Standards and Technology, Gaithersburg, MD, USA, Evaluation Plan, 2024, accessed: 2026-05-25. [Online]. Available: https://www.nist.gov/itl/iad/ mig/speaker-recognition

2024
[33]

Speaker identification and verification using gaussian mixture speaker models,

D. A. Reynolds, “Speaker identification and verification using gaussian mixture speaker models,”Speech Communication, vol. 17, no. 1, pp. 91–108, 1995

1995
[34]

Technical forensic speaker recognition: Evaluation, types and testing of evidence,

P. Rose, “Technical forensic speaker recognition: Evaluation, types and testing of evidence,”Computer Speech and Language, vol. 20, no. 2, pp. 159–191, 2006, odyssey 2004: The speaker and Language Recognition Workshop

2006
[35]

Consensus on validation of forensic voice comparison,

G. S. Morrison, E. Enzinger, V . Hughes, M. Jessen, D. Meuwly, C. Neumann, S. Planting, W. C. Thompson, D. van der Vloed, R. J. Ypma, C. Zhang, A. Anonymous, and B. Anonymous, “Consensus on validation of forensic voice comparison,”Science and Justice, vol. 61, no. 3, pp. 299–309, 2021

2021
[36]

The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age,

V . Hughes and P. Foulkes, “The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age,” Speech Communication, vol. 66, pp. 218–230, 2015

2015
[37]

Automatic speaker recognition of Spanish siblings: (monozygotic and dizygotic) twins and non-twin brothers,

E. S. Segundo and H. K ¨unzel, “Automatic speaker recognition of Spanish siblings: (monozygotic and dizygotic) twins and non-twin brothers,” Loquens, vol. 2, no. 2, July 2015

2015
[38]

Euclidean distances as measures of speaker similarity including identical twin pairs: A forensic investigation using source and filter voice characteristics,

E. San Segundo, A. Tsanas, and P. G ´omez-Vilda, “Euclidean distances as measures of speaker similarity including identical twin pairs: A forensic investigation using source and filter voice characteristics,”Forensic Science International, vol. 270, pp. 25–38, 2017

2017
[39]

Discrimination of voices of twins and siblings for speaker verification,

M. M. Homayounpour and G. Chollet, “Discrimination of voices of twins and siblings for speaker verification,” in4th European Conference on Speech Communication and Technology (Eurospeech 1995), 1995, pp. 345–348

1995
[40]

A test of the effectiveness of speaker verification for differentiating between identical twins,

A. Ariyaeeinia, C. Morrison, A. Malegaonkar, and S. Black, “A test of the effectiveness of speaker verification for differentiating between identical twins,”Science and Justice, vol. 48, no. 4, pp. 182–186, 2008

2008
[41]

Shortcut learning in deep neural networks,

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,”Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020

2020
[42]

Pearl,Causality: Models, Reasoning, and Inference, 2nd ed

J. Pearl,Causality: Models, Reasoning, and Inference, 2nd ed. Cam- bridge: Cambridge University Press, 2009

2009
[43]

M. A. Hern ´an and J. M. Robins,Causal Inference: What If. Boca Raton, FL: Chapman and Hall/CRC, 2020, available at https://www. hsph.harvard.edu/miguel-hernan/causal-inference-book/

2020
[44]

Investigating bias in deep face analysis: The kanface dataset and empirical study,

M. Georgopoulos, Y . Panagakis, and M. Pantic, “Investigating bias in deep face analysis: The kanface dataset and empirical study,”Image and vision computing, vol. 102, p. 103954, 2020

2020
[45]

pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,

H. Bredin, “pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” in24th INTERSPEECH Conference (INTER- SPEECH 2023). ISCA, 2023, pp. 1983–1987

2023
[46]

Ast: Audio spectrogram trans- former,

Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,” inInterspeech 2021, 2021, pp. 571–575

2021
[47]

Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis,

C. Kim and R. M. Stern, “Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis,” inInterspeech 2008, 2008, pp. 2598–2601

2008
[48]

Shortcut learning in binary classifier black boxes: Applications to voice anti-spoofing and biometrics,

M. Sahidullah, H.-j. Shim, R. G. Hautam ¨aki, and T. H. Kinnunen, “Shortcut learning in binary classifier black boxes: Applications to voice anti-spoofing and biometrics,”IEEE Journal of Selected Topics in Signal Processing, 2025

2025
[49]

rvad: An unsupervised segment-based robust voice activity detection method,

Z.-H. Tan, N. Dehaket al., “rvad: An unsupervised segment-based robust voice activity detection method,”Computer speech and language, vol. 59, pp. 1–21, 2020

2020
[50]

Overview of speaker modeling and its applications: From the lens of deep speaker representation learning,

S. Wang, Z. Chen, K. A. Lee, Y . Qian, and H. Li, “Overview of speaker modeling and its applications: From the lens of deep speaker representation learning,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 4971–4998, 2024

2024
[51]

Deep learning on small datasets without pre- training using cosine loss,

B. Barz and J. Denzler, “Deep learning on small datasets without pre- training using cosine loss,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1371–1380

2020
[52]

Dimensionality reduction by learning an invariant mapping,

R. Hadsell, S. Chopra, and Y . LeCun, “Dimensionality reduction by learning an invariant mapping,” in2006 IEEE computer society con- ference on computer vision and pattern recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742

2006
[53]

Metric learning: A survey,

B. Kulis, “Metric learning: A survey,”Foundations and Trends® in Machine Learning, vol. 5, no. 4, pp. 287–364, 2013

2013
[54]

Information- theoretic metric learning,

J. V . Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information- theoretic metric learning,” inProceedings of the 24th international conference on Machine learning, 2007, pp. 209–216

2007
[55]

V oxceleb: Large- scale speaker verification in the wild,

A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxceleb: Large- scale speaker verification in the wild,”Computer Speech and Language, vol. 60, p. 101027, 2020

2020
[56]

Speaker verification using adapted gaussian mixture models,

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,”Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000

2000
[57]

Phoneme recognition using time-delay neural networks,

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,”IEEE Trans- actions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989

1989
[58]

Reshape Dimensions Network for Speaker Recognition,

I. Yakovlev, R. Makarov, A. Balykin, P. Malov, A. Okhotnikov, and N. Torgashov, “Reshape Dimensions Network for Speaker Recognition,” inProc. Interspeech 2024, 2024, pp. 3235–3239

2024
[59]

A statistical significance test for person authentication,

S. Bengio and J. Mari ´ethoz, “A statistical significance test for person authentication,” inProceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, Toledo, Spain, 2004

2004

[1] [1]

What is kinship about,

D. Schneider, “What is kinship about,”Kinship studies in the Morgan centennial, 1972

1972

[2] [2]

How to estimate kinship,

J. Goudet, T. Kay, and B. S. Weir, “How to estimate kinship,”Molecular ecology, vol. 27, no. 20, pp. 4121–4135, 2018

2018

[3] [3]

Kinship studies in late twentieth-century anthropology,

M. G. Peletz, “Kinship studies in late twentieth-century anthropology,” Annual review of anthropology, vol. 24, no. 1, pp. 343–372, 1995

1995

[4] [4]

A survey on kinship verification,

W. Wang, S. You, S. Karaoglu, and T. Gevers, “A survey on kinship verification,”Neurocomputing, vol. 525, pp. 1–28, 2023

2023

[5] [5]

Human ability to detect kinship in strangers’ faces: effects of the degree of relatedness,

G. Kaminski, S. Dridi, C. Graff, and E. Gentaz, “Human ability to detect kinship in strangers’ faces: effects of the degree of relatedness,” Proceedings of the Royal Society B: Biological Sciences, vol. 276, no. 1670, pp. 3193–3200, 2009

2009

[6] [6]

Kan-av dataset for audio-visual face and speech analysis in the wild,

T. Kefalas, E. Fotiadou, M. Georgopoulos, Y . Panagakis, P. Ma, S. Petridis, T. Stafylakis, and M. Pantic, “Kan-av dataset for audio-visual face and speech analysis in the wild,”Image and Vision Computing, vol. 140, p. 104839, 2023

2023

[7] [7]

Families in wild multimedia: A multimodal database for recognizing kinship,

J. P. Robinson, Z. Khan, Y . Yin, M. Shao, and Y . Fu, “Families in wild multimedia: A multimodal database for recognizing kinship,”IEEE Transactions on Multimedia, vol. 24, pp. 3582–3594, 2021

2021

[8] [8]

Perceptual and acoustic similarities between the voices of family members: an approach to synthesize a voice based on family- shared f0 characteristics,

E. Rykova, “Perceptual and acoustic similarities between the voices of family members: an approach to synthesize a voice based on family- shared f0 characteristics,” Master’s thesis, University of Eastern Finland, Joensuu, Finland, 2018

2018

[9] [9]

Language of kin relations and relationlessness,

C. Ball, “Language of kin relations and relationlessness,”Annual Review of Anthropology, vol. 47, no. 1, pp. 47–60, 2018

2018

[10] [10]

Language in the constitution of kinship,

I. Keen, “Language in the constitution of kinship,”Anthropological Linguistics, vol. 56, no. 1, pp. 1–53, 2014

2014

[11] [11]

T. F. Quatieri,Discrete-time speech signal processing: principles and practice. Pearson Education India, 2002

2002

[12] [12]

Nolan,The Phonetic Bases of Speaker Recognition

F. Nolan,The Phonetic Bases of Speaker Recognition. Cambridge University Press, Oct. 1983

1983

[13] [13]

Automatic speaker recognition of identical twins,

H. J. K ¨unzel, “Automatic speaker recognition of identical twins,”Inter- national Journal of Speech Language and The Law, vol. 17, pp. 251– 277, 2011

2011

[14] [14]

Identical twins, different voices,

F. Nolan and T. Oh, “Identical twins, different voices,”The International Journal of Speech, Language and the Law, vol. 3, no. 1, pp. 39–49, June 1996

1996

[15] [15]

Measurement of the impact of identical twin voices on automatic speaker recognition,

S. B. Sabatier, M. R. Trester, and J. M. Dawson, “Measurement of the impact of identical twin voices on automatic speaker recognition,” Measurement, vol. 134, pp. 385–389, 2019

2019

[16] [16]

Effect of identical twins on deep speaker embeddings based forensic voice comparison,

M. H. Alsalihi and D. Sztah ´o, “Effect of identical twins on deep speaker embeddings based forensic voice comparison,”Int. J. Speech Technol., vol. 27, no. 2, p. 341–351, Jun. 2024. JOURNAL OF CLASS FILES, VOL. 14, NO. 8, AUGUST 2023 13

2024

[17] [17]

Principles of linguistic change. volume 2: Social factors,

L. William, “Principles of linguistic change. volume 2: Social factors,” 2001

2001

[18] [18]

Two decades of speaker recognition evaluation at the national institute of standards and technology,

C. S. Greenberg, L. P. Mason, S. O. Sadjadi, and D. A. Reynolds, “Two decades of speaker recognition evaluation at the national institute of standards and technology,”Computer Speech and Language, vol. 60, p. 101032, 2020

2020

[19] [19]

Speaker recognition—identifying people by their voices,

G. R. Doddington, “Speaker recognition—identifying people by their voices,”Proceedings of the IEEE, vol. 73, pp. 1651–1664, 1985

1985

[20] [20]

Ecapa-tdnn: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification,” inProc. Interspeech 2020, 2020, pp. 3830–3834

2020

[21] [21]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen and et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[22] [22]

Reshape Dimensions Network for Speaker Recognition,

I. Yakovlev, R. Makarov, A. Balykin, P. Malov, A. Okhotnikov, and N. Torgashov, “Reshape Dimensions Network for Speaker Recognition,” inInterspeech 2024, 2024, pp. 3235–3239

2024

[23] [23]

Identification of correlation between blood relations using speech signal,

P. Padmini, S. Tripathi, and K. Bhowmick, “Identification of correlation between blood relations using speech signal,” in2017 IEEE Interna- tional Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES). IEEE, 2017, pp. 1–6

2017

[24] [24]

Audio- visual kinship verification in the wild,

X. Wu, E. Granger, T. H. Kinnunen, X. Feng, and A. Hadid, “Audio- visual kinship verification in the wild,” in2019 international conference on biometrics (ICB). IEEE, 2019, pp. 1–8

2019

[25] [25]

Audio-visual kinship verification: a new dataset and a unified adaptive adversarial multimodal learning approach,

X. Wu, X. Zhang, X. Feng, M. B. Lopez, and L. Liu, “Audio-visual kinship verification: a new dataset and a unified adaptive adversarial multimodal learning approach,”IEEE Transactions on Cybernetics, vol. 54, no. 3, pp. 1523–1536, 2022

2022

[26] [26]

Audio- based kinship verification using age domain conversion,

Q. Sun, A. Akman, X. Jing, M. Milling, and B. W. Schuller, “Audio- based kinship verification using age domain conversion,”IEEE Signal Processing Letters, 2024

2024

[27] [27]

Front- end factor analysis for speaker verification,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- end factor analysis for speaker verification,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010

2010

[28] [28]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” inInterspeech 2018, 2018, pp. 1086–1090

2018

[29] [29]

Pyannote. audio: neu- ral building blocks for speaker diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote. audio: neu- ral building blocks for speaker diarization,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7124–7128

2020

[30] [30]

X- vectors: Robust dnn embeddings for speaker recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X- vectors: Robust dnn embeddings for speaker recognition,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333

2018

[31] [31]

CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Con- version,

T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Con- version,” inInterspeech 2020, 2020, pp. 2017–2021

2020

[32] [32]

NIST 2024 speaker recognition evaluation plan,

National Institute of Standards and Technology, “NIST 2024 speaker recognition evaluation plan,” National Institute of Standards and Technology, Gaithersburg, MD, USA, Evaluation Plan, 2024, accessed: 2026-05-25. [Online]. Available: https://www.nist.gov/itl/iad/ mig/speaker-recognition

2024

[33] [33]

Speaker identification and verification using gaussian mixture speaker models,

D. A. Reynolds, “Speaker identification and verification using gaussian mixture speaker models,”Speech Communication, vol. 17, no. 1, pp. 91–108, 1995

1995

[34] [34]

Technical forensic speaker recognition: Evaluation, types and testing of evidence,

P. Rose, “Technical forensic speaker recognition: Evaluation, types and testing of evidence,”Computer Speech and Language, vol. 20, no. 2, pp. 159–191, 2006, odyssey 2004: The speaker and Language Recognition Workshop

2006

[35] [35]

Consensus on validation of forensic voice comparison,

G. S. Morrison, E. Enzinger, V . Hughes, M. Jessen, D. Meuwly, C. Neumann, S. Planting, W. C. Thompson, D. van der Vloed, R. J. Ypma, C. Zhang, A. Anonymous, and B. Anonymous, “Consensus on validation of forensic voice comparison,”Science and Justice, vol. 61, no. 3, pp. 299–309, 2021

2021

[36] [36]

The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age,

V . Hughes and P. Foulkes, “The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age,” Speech Communication, vol. 66, pp. 218–230, 2015

2015

[37] [37]

Automatic speaker recognition of Spanish siblings: (monozygotic and dizygotic) twins and non-twin brothers,

E. S. Segundo and H. K ¨unzel, “Automatic speaker recognition of Spanish siblings: (monozygotic and dizygotic) twins and non-twin brothers,” Loquens, vol. 2, no. 2, July 2015

2015

[38] [38]

Euclidean distances as measures of speaker similarity including identical twin pairs: A forensic investigation using source and filter voice characteristics,

E. San Segundo, A. Tsanas, and P. G ´omez-Vilda, “Euclidean distances as measures of speaker similarity including identical twin pairs: A forensic investigation using source and filter voice characteristics,”Forensic Science International, vol. 270, pp. 25–38, 2017

2017

[39] [39]

Discrimination of voices of twins and siblings for speaker verification,

M. M. Homayounpour and G. Chollet, “Discrimination of voices of twins and siblings for speaker verification,” in4th European Conference on Speech Communication and Technology (Eurospeech 1995), 1995, pp. 345–348

1995

[40] [40]

A test of the effectiveness of speaker verification for differentiating between identical twins,

A. Ariyaeeinia, C. Morrison, A. Malegaonkar, and S. Black, “A test of the effectiveness of speaker verification for differentiating between identical twins,”Science and Justice, vol. 48, no. 4, pp. 182–186, 2008

2008

[41] [41]

Shortcut learning in deep neural networks,

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,”Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020

2020

[42] [42]

Pearl,Causality: Models, Reasoning, and Inference, 2nd ed

J. Pearl,Causality: Models, Reasoning, and Inference, 2nd ed. Cam- bridge: Cambridge University Press, 2009

2009

[43] [43]

M. A. Hern ´an and J. M. Robins,Causal Inference: What If. Boca Raton, FL: Chapman and Hall/CRC, 2020, available at https://www. hsph.harvard.edu/miguel-hernan/causal-inference-book/

2020

[44] [44]

Investigating bias in deep face analysis: The kanface dataset and empirical study,

M. Georgopoulos, Y . Panagakis, and M. Pantic, “Investigating bias in deep face analysis: The kanface dataset and empirical study,”Image and vision computing, vol. 102, p. 103954, 2020

2020

[45] [45]

pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,

H. Bredin, “pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” in24th INTERSPEECH Conference (INTER- SPEECH 2023). ISCA, 2023, pp. 1983–1987

2023

[46] [46]

Ast: Audio spectrogram trans- former,

Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,” inInterspeech 2021, 2021, pp. 571–575

2021

[47] [47]

Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis,

C. Kim and R. M. Stern, “Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis,” inInterspeech 2008, 2008, pp. 2598–2601

2008

[48] [48]

Shortcut learning in binary classifier black boxes: Applications to voice anti-spoofing and biometrics,

M. Sahidullah, H.-j. Shim, R. G. Hautam ¨aki, and T. H. Kinnunen, “Shortcut learning in binary classifier black boxes: Applications to voice anti-spoofing and biometrics,”IEEE Journal of Selected Topics in Signal Processing, 2025

2025

[49] [49]

rvad: An unsupervised segment-based robust voice activity detection method,

Z.-H. Tan, N. Dehaket al., “rvad: An unsupervised segment-based robust voice activity detection method,”Computer speech and language, vol. 59, pp. 1–21, 2020

2020

[50] [50]

Overview of speaker modeling and its applications: From the lens of deep speaker representation learning,

S. Wang, Z. Chen, K. A. Lee, Y . Qian, and H. Li, “Overview of speaker modeling and its applications: From the lens of deep speaker representation learning,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 4971–4998, 2024

2024

[51] [51]

Deep learning on small datasets without pre- training using cosine loss,

B. Barz and J. Denzler, “Deep learning on small datasets without pre- training using cosine loss,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1371–1380

2020

[52] [52]

Dimensionality reduction by learning an invariant mapping,

R. Hadsell, S. Chopra, and Y . LeCun, “Dimensionality reduction by learning an invariant mapping,” in2006 IEEE computer society con- ference on computer vision and pattern recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742

2006

[53] [53]

Metric learning: A survey,

B. Kulis, “Metric learning: A survey,”Foundations and Trends® in Machine Learning, vol. 5, no. 4, pp. 287–364, 2013

2013

[54] [54]

Information- theoretic metric learning,

J. V . Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information- theoretic metric learning,” inProceedings of the 24th international conference on Machine learning, 2007, pp. 209–216

2007

[55] [55]

V oxceleb: Large- scale speaker verification in the wild,

A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxceleb: Large- scale speaker verification in the wild,”Computer Speech and Language, vol. 60, p. 101027, 2020

2020

[56] [56]

Speaker verification using adapted gaussian mixture models,

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,”Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000

2000

[57] [57]

Phoneme recognition using time-delay neural networks,

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,”IEEE Trans- actions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989

1989

[58] [58]

Reshape Dimensions Network for Speaker Recognition,

I. Yakovlev, R. Makarov, A. Balykin, P. Malov, A. Okhotnikov, and N. Torgashov, “Reshape Dimensions Network for Speaker Recognition,” inProc. Interspeech 2024, 2024, pp. 3235–3239

2024

[59] [59]

A statistical significance test for person authentication,

S. Bengio and J. Mari ´ethoz, “A statistical significance test for person authentication,” inProceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, Toledo, Spain, 2004

2004