Mixture of Experts for Recognizing Depression from Interview and Reading Tasks

Dimitris Askounis; Loukas Ilias

arxiv: 2502.20213 · v2 · submitted 2025-02-27 · 💻 cs.LG · cs.CY

Mixture of Experts for Recognizing Depression from Interview and Reading Tasks

Loukas Ilias , Dimitris Askounis This is my paper

Pith reviewed 2026-05-23 01:18 UTC · model grok-4.3

classification 💻 cs.LG cs.CY

keywords depression recognitionmixture of expertsspeech analysismultimodal fusionspontaneous speechread speechlog-Mel spectrogramAlexNet

0 comments

The pith

Combining representations from both spontaneous interview speech and read speech via multimodal fusion and a Mixture of Experts layer reaches 87 percent accuracy on depression detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that depression recognition from speech improves when a single neural network processes audio from both interview tasks and reading tasks together. It converts each audio file to log-Mel spectrogram plus delta and delta-delta features, routes them through shared AlexNet models, applies multimodal fusion, and then routes the fused vector through a Mixture of Experts module. Prior methods are limited to spontaneous speech only and often depend on transcripts. A sympathetic reader would care because the approach removes the need for error-prone transcripts while using readily available audio, potentially enabling more reliable early screening. The reported results are 87.00 percent accuracy and 86.66 percent F1-score on the Androids corpus.

Core claim

This is the first study in the depression recognition task that obtains representations of both spontaneous and read speech, utilizes multimodal fusion methods, and employs Mixture of Experts models inside a single deep neural network. Audio files from interview and reading tasks are converted into log-Mel spectrogram, delta, and delta-delta representations; these image-like inputs pass through shared AlexNet models whose outputs are fused multimodally; the resulting vector then enters a MoE module using either sparsely-gated or multilinear variants based on factorization. The approach produces 87.00 percent accuracy and 86.66 percent F1-score on the Androids corpus.

What carries the argument

The Mixture of Experts (MoE) module placed after multimodal fusion of AlexNet outputs from log-Mel spectrograms of both tasks, which performs input-conditional computation using sparsely-gated or multilinear factorization-based variants.

If this is right

Audio from both spontaneous interview speech and structured reading tasks supplies complementary information that multimodal fusion can exploit.
Avoiding transcripts altogether removes dependence on manual annotation or automatic speech recognition with high error rates.
Placing a Mixture of Experts layer after fusion enables the network to route different input combinations to specialized sub-networks.
The same architecture can be trained end-to-end without separate feature engineering steps for each speech type.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the dual-task input proves robust, the same pipeline could be adapted to detect other conditions whose speech signatures appear in both spontaneous and read contexts, such as anxiety or cognitive decline.
Real-world deployment would require checking whether the MoE component still adds value when the model is compressed for mobile devices.
A controlled ablation that removes either the reading-task branch or the MoE layer on the same corpus would clarify which element drives the reported gain.

Load-bearing premise

The Androids corpus is a representative and unbiased collection of speech recordings whose performance numbers reflect genuine generalization rather than overfitting to the particular audio features or participant pool.

What would settle it

Retraining and testing the identical pipeline on an independent speech dataset collected under different conditions or from a different population and obtaining accuracy or F1-score below 70 percent would show the reported performance does not generalize.

read the original abstract

Depression is a mental disorder and can cause a variety of symptoms, including psychological, physical, and social. Speech has been proved an objective marker for the early recognition of depression. For this reason, many studies have been developed aiming to recognize depression through speech. However, existing methods rely on the usage of only the spontaneous speech neglecting information obtained via read speech, use transcripts which are often difficult to obtain (manual) or come with high word-error rates (automatic), and do not focus on input-conditional computation methods. To resolve these limitations, this is the first study in depression recognition task obtaining representations of both spontaneous and read speech, utilizing multimodal fusion methods, and employing Mixture of Experts (MoE) models in a single deep neural network. Specifically, we use audio files corresponding to both interview and reading tasks and convert each audio file into log-Mel spectrogram, delta, and delta-delta. Next, the image representations of the two tasks pass through shared AlexNet models. The outputs of the AlexNet models are given as input to a multimodal fusion method. The resulting vector is passed through a MoE module. In this study, we employ three variants of MoE, namely sparsely-gated MoE and multilinear MoE based on factorization. Findings suggest that our proposed approach yields an Accuracy and F1-score of 87.00% and 86.66% respectively on the Androids corpus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Abstract claims first use of spontaneous plus read speech with MoE fusion for 87% depression detection accuracy, but gives no splits, baselines or corpus stats to support it.

read the letter

The main thing here is an empirical claim: 87% accuracy and 86.66% F1 on the Androids corpus by running log-Mel spectrograms from both interview and reading tasks through shared AlexNets, fusing the outputs, and routing them through one of three MoE variants. The authors say this is the first network to handle both speech types plus multimodal fusion plus MoE together. That combination is the stated novelty. The paper does a reasonable job naming real limitations in prior work—most studies stick to spontaneous speech and often need transcripts—and it sketches a direct audio pipeline that avoids those issues. Using delta and delta-delta features alongside the base spectrogram is a standard but sensible choice for capturing dynamics. The soft spots are large and central. The abstract contains no dataset size, no subject count, no train/test protocol, no statement on whether splits are subject-independent, no baselines, no error bars, and no ablation results. Without those, the 87% number cannot be read as evidence that the MoE or the read-speech addition actually helps rather than reflecting corpus artifacts or an unstated split. The novelty claim also cannot be checked against the literature from the abstract alone. This work is aimed at researchers already doing audio-based depression detection who might want to test whether adding read speech changes performance. A reader in that narrow area could extract the basic architecture idea, but the missing controls mean the result is not usable or citable yet. I would send the full manuscript to peer review if it supplies proper validation, subject-independent splits, and direct comparisons; the underlying idea of mixing the two speech types is worth testing even if this version is incomplete.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to be the first study to combine representations from both spontaneous (interview) and read speech tasks for depression recognition, using multimodal fusion and Mixture of Experts (MoE) models within a single deep neural network. Audio files from both tasks are converted to log-Mel spectrograms, delta, and delta-delta features; these pass through shared AlexNet models, followed by multimodal fusion and one of three MoE variants (sparsely-gated or multilinear based on factorization). The central empirical result is 87.00% accuracy and 86.66% F1-score on the Androids corpus.

Significance. If validated with appropriate controls, the integration of spontaneous and read speech with input-conditional MoE computation could represent a meaningful architectural advance over prior single-task or non-MoE approaches in speech-based depression detection. The work explicitly positions itself as novel in combining these elements, which would be a strength if the performance gain is shown to be attributable to the proposed components rather than corpus artifacts.

major comments (2)

[Abstract] Abstract: the central claim rests on an empirical performance result (87.00% Acc / 86.66% F1) but supplies no dataset statistics, train/test protocol (including whether splits are subject-independent), baselines, error bars, or statistical tests. Without these, it is impossible to determine whether the reported numbers support the architectural contribution or arise from data leakage, weak controls, or overfitting.
[Abstract] Abstract: no description is provided of the Androids corpus (participant count, label distribution, task durations, or how interview and reading audio are paired or fused), which is load-bearing for assessing whether the multimodal + MoE design generalizes beyond the specific data used.

minor comments (1)

[Abstract] Abstract: the sentence beginning 'Findings suggest that our proposed approach yields' is vague; results should be stated directly as observed values rather than hedged.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and agree to revise the abstract to include the requested contextual information.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim rests on an empirical performance result (87.00% Acc / 86.66% F1) but supplies no dataset statistics, train/test protocol (including whether splits are subject-independent), baselines, error bars, or statistical tests. Without these, it is impossible to determine whether the reported numbers support the architectural contribution or arise from data leakage, weak controls, or overfitting.

Authors: We agree that the abstract would benefit from additional context to support the reported performance metrics. In the revised manuscript, we will expand the abstract to briefly describe the Androids corpus (e.g., number of participants and label distribution), the train/test protocol including subject-independent splits, and mention the comparison to baselines with statistical tests where applicable. This will help readers evaluate the results more effectively. The full details are already present in the methods and results sections of the paper. revision: yes
Referee: [Abstract] Abstract: no description is provided of the Androids corpus (participant count, label distribution, task durations, or how interview and reading audio are paired or fused), which is load-bearing for assessing whether the multimodal + MoE design generalizes beyond the specific data used.

Authors: We acknowledge the need for a concise description of the dataset in the abstract. We will add a sentence summarizing the key characteristics of the Androids corpus, including participant numbers, label balance, task durations, and the nature of the interview and reading tasks with how audio is paired and fused. Details on audio pairing and fusion are described in the methodology section, but a high-level overview will be included in the abstract for completeness. revision: yes

Circularity Check

0 steps flagged

No derivation chain; purely empirical performance report

full rationale

The paper contains no mathematical derivations, equations, predictions, or first-principles results. It describes a pipeline (spectrograms through AlexNet, multimodal fusion, MoE variants) and reports an empirical accuracy/F1 on the Androids corpus. None of the enumerated circularity patterns apply because there are no claimed derivations that could reduce to inputs by construction, no fitted parameters renamed as predictions, and no self-citation chains invoked to justify uniqueness or ansatzes. The result is a standard empirical claim whose validity depends on external factors like data splits and baselines, not internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no access to methods section to identify fitted parameters, background axioms, or new entities introduced. The approach relies on standard deep learning components like AlexNet and MoE without introducing new entities.

pith-pipeline@v0.9.0 · 5757 in / 1305 out tokens · 70056 ms · 2026-05-23T01:18:52.981667+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

employing Mixture of Experts (MoE) models in a single deep neural network... yields an Accuracy and F1-score of 87.00% and 86.66% respectively on the Androids corpus

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.