arxiv: 2604.12647 · v1 · submitted 2026-04-14 · 💻 cs.SD · cs.CL

Recognition: unknown

Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

Aaqib Saeed, Herman Teun den Dekker, Lin-Lin Chen, Neil Zeghidour, Tsai-Ning Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:24 UTC · model grok-4.3

classification 💻 cs.SD cs.CL

keywords zero-shot classificationrespiratory audiotest-time scalingtiered reasoningaudio-text embeddingsconfidence routinglarge language models

0 comments

The pith

A tiered zero-shot framework routes respiratory audio samples to progressively richer reasoning stages based on confidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an approach that avoids applying the same expensive reasoning to every audio sample by first checking a quick similarity score in a shared audio-text space. Samples that score confidently exit immediately with a label, while less certain ones move to matching against detailed clinician-style descriptions and, if still uncertain, to retrieval-augmented language model analysis. This routing concentrates extra computation only where it improves results, raising average performance across nine different respiratory classification tasks. A reader would care because medical audio analysis has long been held back by the need for large labeled datasets and by the high cost of running heavy models on every input.

Core claim

TRIAGE achieves better zero-shot performance on respiratory audio tasks by using a confidence router to finalize easy predictions after fast label-cosine scoring in a joint embedding space and to escalate ambiguous inputs first to structured clinician descriptors and then to retrieval-augmented large language model reasoning, allowing nearly half the samples to exit at the cheapest stage while lifting results on uncertain cases.

What carries the argument

The confidence-based router that decides whether to stop after embedding cosine similarity, move to clinician-style descriptor matching, or invoke retrieval-augmented LLM reasoning.

If this is right

Nearly half of inputs can be classified using only the lowest-cost embedding comparison.
Performance gains appear mainly on ambiguous samples rather than on already confident ones.
The approach reaches mean performance levels that match or exceed some fully supervised models on individual tasks.
Test-time compute is allocated proportionally to input difficulty rather than uniformly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar routing logic could be tested on other zero-shot audio or multimodal tasks where input difficulty is uneven.
In settings with limited compute, such as mobile health devices, the average cost per sample drops while peak accuracy on hard cases is preserved.
The same tier structure might be adapted to non-audio domains by swapping the embedding space and descriptor sources.

Load-bearing premise

The initial quick check reliably separates inputs that need only simple similarity from those that genuinely benefit from deeper stages without creating new errors.

What would settle it

Running the highest tier on every sample produces the same or lower overall performance than the routed version, or direct measurement shows the router frequently sends easy cases to expensive tiers or easy cases to cheap ones that fail.

Figures

Figures reproduced from arXiv: 2604.12647 by Aaqib Saeed, Herman Teun den Dekker, Lin-Lin Chen, Neil Zeghidour, Tsai-Ning Wang.

**Figure 2.** Figure 2: Overview of TRIAGE. A frozen audio–text encoder embeds recordings and medical text. Tier-L performs label cosine scoring with margin-based early exit; Tier-M matches clinical descriptors with rule voting; Tier-H retrieves nearest-neighbor reports and queries an off-the-shelf LLM, e.g., Gemini or GPT. 1. Introduction Automated analysis of heart and lung sounds increasingly relies on audio–text embedding mo… view at source ↗

**Figure 3.** Figure 3: Top-3 clinical reports retrieved for auscultation clips (Tier-H). Left: query spectrogram and its reference (or [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

read the original abstract

Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every input regardless of difficulty. We introduce TRIAGE, a tiered zero-shot framework that adaptively scales test-time compute by routing each audio sample through progressively richer reasoning stages: fast label-cosine scoring in a joint audio-text embedding space (Tier-L), structured matching with clinician-style descriptors (Tier-M), and retrieval-augmented large language model reasoning (Tier-H). A confidence-based router finalizes easy predictions early while allocating additional computation to ambiguous inputs, enabling nearly half of all samples to exit at the cheapest tier. Across nine respiratory classification tasks without task-specific training, TRIAGE achieves a mean AUROC of 0.744, outperforming prior zero-shot methods and matching or exceeding supervised baselines on multiple tasks. Our analysis show that test-time scaling concentrates gains where they matter: uncertain cases see up to 19% relative improvement while confident predictions remain unchanged at minimal cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRIAGE's tiered routing for zero-shot respiratory audio is a sensible compute-saving idea with competitive AUROC numbers, but the adaptive gains rest on an unvalidated router.

read the letter

The core contribution here is TRIAGE, which routes each audio clip through three escalating stages without any task-specific training: quick cosine similarity in an audio-text embedding space, then matching against clinician-style descriptors, and finally retrieval-augmented LLM reasoning. Nearly half the samples stop at the first tier, the mean AUROC across nine tasks reaches 0.744, and uncertain cases show up to 19% relative improvement while easy ones stay cheap. That targeted scaling is the part worth noticing; it directly addresses the uniform-compute problem in zero-shot medical audio work and produces numbers that sometimes match supervised baselines. The staging itself is straightforward and reuses existing models, so the implementation burden looks low. The soft spot is the router. The abstract states that the confidence threshold decides when to escalate, yet the description gives no calibration metrics, no confusion matrix against an oracle difficulty label, and no ablation that compares adaptive routing to always running the highest tier or to random routing. Without those, it is impossible to tell whether the reported gains come from the adaptation mechanism or simply from applying richer methods to a lucky subset of samples. The stress-test concern lands cleanly on the available text. This is useful reading for anyone working on efficient inference or zero-shot audio classification in healthcare. A reader who already knows the embedding and LLM baselines will see the incremental engineering clearly. The paper deserves peer review because the idea is practical and the headline numbers are competitive, but any referee should insist on router ablations and per-tier error breakdowns before acceptance.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces TRIAGE, a tiered zero-shot framework for respiratory audio classification that adaptively routes each sample through progressively richer stages: fast label-cosine scoring in a joint audio-text embedding space (Tier-L), structured matching with clinician-style descriptors (Tier-M), and retrieval-augmented LLM reasoning (Tier-H). A confidence-based router exits easy predictions early, with the claim that nearly half the samples use only Tier-L. Across nine respiratory classification tasks without task-specific training, the method reports a mean AUROC of 0.744, outperforming prior zero-shot methods and matching or exceeding supervised baselines on multiple tasks, with up to 19% relative improvement concentrated on uncertain cases.

Significance. If the router's ability to identify inputs that benefit from higher tiers is validated and the gains are shown to arise from the adaptive mechanism rather than uniform high-tier performance, the work would offer a practical advance in efficient zero-shot audio analysis for medical screening. It addresses the uniform-compute limitation of existing zero-shot approaches and could reduce inference costs in data-scarce domains while preserving performance.

major comments (3)

Abstract: The headline claims (mean AUROC of 0.744, up to 19% relative improvement on uncertain cases, nearly half the samples exiting at Tier-L) are presented without any description of the nine tasks, the supervised and zero-shot baselines, statistical significance testing, or dataset characteristics, making it impossible to assess whether the results support the central claim of adaptive scaling benefits.
Abstract and experimental sections: No router calibration metrics, confusion matrix against oracle difficulty, or ablation comparing adaptive routing to always-Tier-H or random routing are reported. This directly undermines the claim that the confidence router reliably identifies easy versus ambiguous inputs and that gains are isolated to uncertain cases rather than arising from higher tiers being stronger overall.
Abstract: The assertion that 'test-time scaling concentrates gains where they matter' rests on the unverified premise that Tier-M and Tier-H deliver genuine improvements without introducing new errors or biases on routed samples; direct per-tier performance breakdowns and error analysis are required to substantiate this.

minor comments (1)

Abstract: 'Our analysis show' should be corrected to 'Our analysis shows' for subject-verb agreement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where additional context and validation would strengthen the manuscript. We have revised the abstract for clarity and added new experimental analyses to directly address the concerns about router validation and per-tier performance. Below we respond point by point.

read point-by-point responses

Referee: Abstract: The headline claims (mean AUROC of 0.744, up to 19% relative improvement on uncertain cases, nearly half the samples exiting at Tier-L) are presented without any description of the nine tasks, the supervised and zero-shot baselines, statistical significance testing, or dataset characteristics, making it impossible to assess whether the results support the central claim of adaptive scaling benefits.

Authors: We agree that the original abstract was overly concise and omitted essential context. In the revised manuscript we have expanded the abstract to include a brief characterization of the nine tasks (covering asthma detection, COPD classification, COVID-19 screening, and other respiratory conditions across multiple public datasets), the zero-shot baselines (primarily CLAP-based cosine similarity and related embedding methods), the supervised baselines (task-specific CNNs and transformers trained on the respective datasets), and a note that reported improvements were assessed for statistical significance via bootstrap resampling. Full dataset statistics and task descriptions remain in Section 3, but the added abstract sentences now allow readers to evaluate the headline numbers. revision: yes
Referee: Abstract and experimental sections: No router calibration metrics, confusion matrix against oracle difficulty, or ablation comparing adaptive routing to always-Tier-H or random routing are reported. This directly undermines the claim that the confidence router reliably identifies easy versus ambiguous inputs and that gains are isolated to uncertain cases rather than arising from higher tiers being stronger overall.

Authors: We acknowledge that the original submission lacked explicit quantitative validation of the router. We have added a new subsection (Section 4.3) containing: (i) router calibration plots and metrics showing the correlation between the model's exit confidence and downstream accuracy; (ii) a confusion matrix that compares the router's tier decisions against an oracle that knows the ground-truth label and the minimal tier needed for correct classification; and (iii) controlled ablations that compare adaptive routing against always-Tier-H (same performance at higher cost) and against random routing at matched compute budgets. These results confirm that the router preferentially routes ambiguous samples to higher tiers and that the observed gains exceed those obtainable by random allocation. revision: yes
Referee: Abstract: The assertion that 'test-time scaling concentrates gains where they matter' rests on the unverified premise that Tier-M and Tier-H deliver genuine improvements without introducing new errors or biases on routed samples; direct per-tier performance breakdowns and error analysis are required to substantiate this.

Authors: We agree that per-tier breakdowns and error analysis are required to substantiate the claim. In the revised experiments we now report AUROC and accuracy stratified by the tier at which each sample exited, together with a qualitative and quantitative error analysis. The analysis shows that Tier-M and Tier-H correct a substantial fraction of Tier-L errors on uncertain inputs while introducing only minimal new errors or systematic biases (primarily on rare edge cases that are discussed). These additions directly support the statement that gains are concentrated where they matter. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external pre-trained models and benchmarks

full rationale

The paper describes an empirical tiered framework (TRIAGE) evaluated on nine respiratory tasks using pre-trained audio-text embeddings, clinician descriptors, and LLMs. No equations, fitted parameters, or self-citations are shown that reduce the reported AUROC (0.744) or tier-exit statistics to quantities defined by the paper itself. The confidence router is a heuristic applied to external model outputs; performance gains are measured against standard external baselines rather than by construction. This matches the default expectation of self-contained empirical work with no load-bearing self-definitional or fitted-input reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes pre-trained joint audio-text embeddings and LLMs can be composed into effective zero-shot reasoning without task-specific fine-tuning; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Zero-shot inference is feasible by composing pre-trained audio-text embeddings with structured descriptors and retrieval-augmented LLM reasoning.
Invoked as the basis for all tiers without task-specific training.

pith-pipeline@v0.9.0 · 5502 in / 1255 out tokens · 33534 ms · 2026-05-10T14:24:40.285336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 4 internal anchors

[1]

Deep Learning Scaling is Predictable, Empirically

URL https://arxiv.org/abs/1712.00409. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models,

work page internal anchor Pith review arXiv
[2]

URL https://arxiv.org/abs/2001.08361. Andreas Koukounas, Georgios Mastrapas, Michael G¨ unther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Mart´ ınez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, and Han Xiao. Jina clip: Your clip model is also your text retriever, 2024. URL https://arxiv.org/abs...

work page internal anchor Pith review Pith/arXiv arXiv 2001
[3]

gpt-oss-120b & gpt-oss-20b Model Card

URL https://huggingface.co/mistral ai/Mistral- Small- 3.2- 24B- Instruct- 2506 . Hugging Face model card/repository. Moonshot AI. Kimi-k2-instruct. Hugging Face model repository, July 2025. URL https://huggingface. co/moonshotai/Kimi-K2-Instruct . Initial repo commit Jul 11, 2025. Accessed 2026-01-20. 11 Adaptive Test-Time Scaling for Zero-Shot Respirator...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.15254 2025
[4]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

URL https://arxiv.org/abs/2408.03314. Zhiqiang Sun. ICBHI 2017 challenge, 2023. Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghidour, and Aaqib Saeed. Careaqa: A cardiac and respiratory audio question answering model for open-ended diagnostic reasoning, 2025a. URL https://arxiv. org/abs/2505.01199. Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghidour, and Aaqib Saeed. La...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.1486 2017