ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Aditya Kommineni; Anfeng Xu; Catherine Lord; Daniel Messinger; Helen Tager-Flusberg; Lynn K. Perry; Megan Micheletti; Mi Zhang; Shakhrul Iman Siam; Shrikanth Narayanan

arxiv: 2605.29257 · v1 · pith:DGC77HNRnew · submitted 2026-05-28 · 💻 cs.SD

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Tiantian Feng , Anfeng Xu , Xuan Shi , Aditya Kommineni , Shakhrul Iman Siam , Megan Micheletti , Zhonghao Shi , Helen Tager-Flusberg

show 5 more authors

Mi Zhang Lynn K. Perry Catherine Lord Daniel Messinger Shrikanth Narayanan

This is my paper

Pith reviewed 2026-06-29 06:06 UTC · model grok-4.3

classification 💻 cs.SD

keywords ChildVoxchild audio benchmarkdevelopmental acousticsvocalization classificationspeech recognitionaudio-language modelsphysiological sounds

0 comments

The pith

ChildVox benchmark shows audio models achieve high performance on recognizing sounds made by children from birth to school age.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ChildVox as a benchmark spanning the full developmental range of children's acoustic signals. It unifies 17 datasets into more than 20 sub-tasks that cover physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. Evaluations of self-supervised, ASR-oriented, and large audio-language models on classification, modeling, and recognition tasks produce strong results. These outcomes point to practical uses in assessing children's language levels and monitoring speech changes with age.

Core claim

ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. It integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets to enable systematic cross-corpus and cross-domain comparison. Evaluation of self-supervised, ASR-oriented, and large audio-language models on physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition shows high performance in recognizing acoustic signals from children.

What carries the argument

ChildVox benchmark, which unifies 17 datasets and over 20 sub-tasks along the developmental trajectory from physiological sounds to spoken language.

If this is right

High-performance models from the benchmark can be applied to characterize children's language levels.
The same models can track changes in speech production as children grow older.
Systematic comparisons across different child audio datasets and task domains become feasible.
The benchmark identifies models suitable for pediatric audio analysis applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could support development of clinical tools that flag early speech production differences.
Adding datasets from more varied cultural or linguistic backgrounds would test whether current results hold more broadly.
Pairing the audio tasks with age-matched language milestone data might strengthen links to developmental assessment.

Load-bearing premise

The 17 selected datasets together cover the full developmental trajectory from birth through school age in a representative way without major gaps in physiological sounds, non-linguistic vocalizations, or spoken language.

What would settle it

A new collection of infant physiological sounds where models fine-tuned on the ChildVox tasks show low accuracy would indicate that the selected datasets leave significant gaps.

Figures

Figures reproduced from arXiv: 2605.29257 by Aditya Kommineni, Anfeng Xu, Catherine Lord, Daniel Messinger, Helen Tager-Flusberg, Lynn K. Perry, Megan Micheletti, Mi Zhang, Shakhrul Iman Siam, Shrikanth Narayanan, Tiantian Feng, Xuan Shi, Zhonghao Shi.

**Figure 2.** Figure 2: Training data distribution of each sub-dataset [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: LoRA fine-tune in encoderbased pre-trained models. and WavLM apply self-supervised learning to learn representations from generic audio events, non-verbal vocalizations, and speech, respectively. These pre-trained models have shown strong performance on audio event detection, speech recognition, and speaker diarization tasks. ASR Models We primarily evaluate models from the Whisper family, an encoder-deco… view at source ↗

**Figure 4.** Figure 4: Macro-F1 on ChildVox-Balanced test set. Zero-shot proprietary models (Gemini 2.5/3.5 Flash) underperform the ChildVox -trained encoder and Qwen2- Audio baselines across all five datasets. producing “The electromagnet is not working.” 6.3 Comparison with Proprietary Baselines We compare zero-shot Gemini 2.5 Flash and Gemini 3.5 Flash against our ChildVox-trained encoder (best one) and Qwen2-Audio on five … view at source ↗

**Figure 5.** Figure 5: Utterance rate (utterances per minute) derived [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChildVox unifies 17 existing child audio datasets into a multi-task benchmark, but the abstract supplies no methods, splits, or numbers so the coverage and performance claims stay uncheckable.

read the letter

ChildVox combines 17 child audio datasets into more than 20 tasks that run from physiological sounds at birth through spoken language at school age. That unification for cross-corpus comparison is the actual new piece.

The work is useful in one narrow sense: it lists a coherent set of child-specific tasks and applies a range of current audio and audio-language models to them. Anyone who already works on child speech or developmental monitoring will recognize the task categories and see why a shared testbed could reduce duplicated effort.

The soft spots are straightforward. The abstract gives no information on dataset selection criteria, age sampling balance, demographic coverage, or how the sub-tasks were defined and split. Without those details the claim that the collection represents the full trajectory without major gaps cannot be evaluated, and the stress-test concern about under-represented early physiological sounds remains open. The statement that the benchmark yields “high-performance models” is also unsupported here because no metrics, baselines, or error bars appear. The paper is therefore an empirical benchmark paper whose central assertions rest on work that is not shown.

This is the sort of contribution that matters to a small group of researchers who need standardized child-audio evaluation. A reader already familiar with the 17 source datasets will get the task inventory; everyone else will need the full methods and results sections before the paper becomes usable.

It deserves peer review. The idea of a unified child benchmark is worth referee time provided the authors supply the missing curation details, data splits, and quantitative results.

Referee Report

1 major / 1 minor

Summary. The paper introduces ChildVox, a benchmark integrating 17 child-centered audio and speech datasets spanning more than 20 sub-tasks across the developmental trajectory from birth to school age. It covers physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language, and evaluates self-supervised, ASR-oriented, and large audio-language models on tasks including physiological sound classification, vocalization modeling, and speech quality assessment. The central claim is that the benchmark yields high-performance models for recognizing child acoustic signals, thereby supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Significance. If the datasets provide representative coverage without major gaps or biases, ChildVox would fill an important gap by enabling systematic cross-corpus evaluation of audio models on pediatric signals, where existing benchmarks are limited. The multi-domain evaluation across foundation model families is a constructive contribution that could guide model selection for child-specific applications. No machine-checked proofs or parameter-free derivations are present, as expected for an empirical benchmark paper.

major comments (1)

[Abstract and dataset integration section] Abstract and dataset integration section: The claim that the 17 datasets together provide representative, unbiased coverage of the full birth-to-school-age trajectory (including physiological sounds and non-linguistic vocalizations) is load-bearing for the downstream-application assertions. No explicit analysis of age sampling density, demographic balance, or gaps in early-infancy physiological data is referenced, so the generalization to 'characterizing children's language levels' does not automatically follow from per-task performance numbers.

minor comments (1)

[Abstract] Abstract: the phrase 'more than 20 sub-tasks' is used without enumeration; a summary table listing task names, dataset sources, and metrics would improve immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the need for explicit coverage analysis. We address the major comment below and will revise the manuscript to incorporate additional analysis of dataset demographics and age distributions.

read point-by-point responses

Referee: [Abstract and dataset integration section] Abstract and dataset integration section: The claim that the 17 datasets together provide representative, unbiased coverage of the full birth-to-school-age trajectory (including physiological sounds and non-linguistic vocalizations) is load-bearing for the downstream-application assertions. No explicit analysis of age sampling density, demographic balance, or gaps in early-infancy physiological data is referenced, so the generalization to 'characterizing children's language levels' does not automatically follow from per-task performance numbers.

Authors: We agree that an explicit analysis of age sampling density, demographic balance, and gaps (particularly in early-infancy physiological data) would strengthen the load-bearing claims about representative coverage and better support assertions regarding downstream applications. In the revised manuscript, we will add a new subsection in the dataset integration section that provides: aggregated and per-dataset age histograms or summary statistics spanning birth to school age; available demographic metadata (e.g., gender or other reported attributes); and an explicit discussion of limitations and gaps. This will be derived from the source dataset metadata and will clarify the basis for generalizations to tasks such as characterizing language levels and tracking speech production with age. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivations or fitted predictions

full rationale

The paper aggregates 17 existing child audio datasets into >20 sub-tasks and evaluates off-the-shelf audio/speech/LALM models on them. No equations, parameter fitting, or first-principles derivations are claimed; results are direct empirical measurements. The central claim (high-performance models support downstream uses) rests on the benchmark numbers themselves rather than any reduction to inputs by construction. No self-citation load-bearing steps or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that repurposed existing datasets can be combined without introducing major domain-specific biases, but introduces no new free parameters, invented entities, or non-standard axioms beyond typical machine learning evaluation practices.

axioms (1)

domain assumption Existing child audio datasets from separate studies can be aggregated into a coherent benchmark without significant labeling inconsistencies or selection effects.
Invoked when the abstract states integration of 17 datasets into more than 20 sub-tasks for cross-corpus comparison.

pith-pipeline@v0.9.1-grok · 5735 in / 1394 out tokens · 35533 ms · 2026-06-29T06:06:23.353666+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313

Gama: A large audio-language model with ad- vanced audio understanding and complex reasoning abilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313. Christina E Gildersleeve-Neumann, Ellen S Kester, Bar- bara L Davis, and Elizabeth D Peña. 2008. English speech sound development in preschool-age...

work page arXiv 2024
[2]

Nonie K Lesaux and Linda S Siegel

Babar: from phoneme recognition to develop- mental measures of young children’s speech produc- tion.arXiv preprint arXiv:2603.05213. Nonie K Lesaux and Linda S Siegel. 2003. The de- velopment of reading in children who speak english as a second language.Developmental psychology, 39(6):1005. Jialu Li, Mark Hasegawa-Johnson, and Nancy L McEl- wain. 2021. An...

work page arXiv 2003

[1] [1]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313

Gama: A large audio-language model with ad- vanced audio understanding and complex reasoning abilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313. Christina E Gildersleeve-Neumann, Ellen S Kester, Bar- bara L Davis, and Elizabeth D Peña. 2008. English speech sound development in preschool-age...

work page arXiv 2024

[2] [2]

Nonie K Lesaux and Linda S Siegel

Babar: from phoneme recognition to develop- mental measures of young children’s speech produc- tion.arXiv preprint arXiv:2603.05213. Nonie K Lesaux and Linda S Siegel. 2003. The de- velopment of reading in children who speak english as a second language.Developmental psychology, 39(6):1005. Jialu Li, Mark Hasegawa-Johnson, and Nancy L McEl- wain. 2021. An...

work page arXiv 2003