Do Multimodal Large Language Models Need Reasoning to Classify Dementia from Speech?

Bradford C. Dickerson; James Glass; Liming Wang; Neguine Rezaii

arxiv: 2607.00260 · v2 · pith:FMW2Z3KWnew · submitted 2026-06-30 · 📡 eess.AS

Do Multimodal Large Language Models Need Reasoning to Classify Dementia from Speech?

Liming Wang , Neguine Rezaii , Bradford C. Dickerson , James Glass This is my paper

Pith reviewed 2026-07-02 16:30 UTC · model grok-4.3

classification 📡 eess.AS

keywords dementia classificationspeech analysismultimodal large language modelsreasoning modelsadaptor frameworkreinforcement learningautomatic diagnosisinternal representations

0 comments

The pith

Internal representations from reasoning MLLMs improve dementia classification from speech when accessed via adaptor and reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether reasoning capabilities in multimodal large language models benefit automatic dementia classification from voice recordings. It demonstrates that generating text-based rationales often produces hallucinations and inconsistent diagnoses, resulting in performance worse than LLM-free baselines. The authors introduce DeTAiL, which instead extracts signals directly from the models' internal representations using a nonlinear adaptor and reinforcement learning. This method outperforms strong baselines and text-rationale approaches on two dementia datasets that differ in test format and label detail. Readers would care because it identifies a practical way to use advanced models for medical tasks while sidestepping the unreliability of their generated explanations.

Core claim

Naive strategies such as relying on text-based rationales from reasoning MLLMs lead to hallucinated and inconsistent rationales for diagnosis and yield inferior automatic dementia classification performance compared with LLM-free baselines. DeTAiL, an adaptor-based framework that exploits the internal representations of reasoning MLLMs, consistently outperforms strong baselines and methods that rely on text-based rationales across two dementia datasets with distinct test formats and label granularities.

What carries the argument

DeTAiL, an adaptor-based framework using a nonlinear adaptor and reinforcement learning to extract dementia-relevant signals from the internal representations of reasoning MLLMs.

If this is right

DeTAiL achieves higher accuracy than baselines and text-rationale methods on speech-based dementia classification.
Internal representations avoid the hallucinations and inconsistencies of generated text rationales.
The framework maintains gains across datasets that vary in recording format and diagnostic label granularity.
Reasoning MLLMs contribute to the task through their hidden states rather than their explicit reasoning chains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar adaptor extraction could be tested on other voice-based medical classification tasks to check whether the pattern generalizes.
The results suggest that for some diagnostic applications the value of MLLMs may lie more in their embeddings than in their generated rationales.
One could examine whether the same internal-representation approach works with MLLMs that were not explicitly trained for reasoning.

Load-bearing premise

Internal representations of reasoning MLLMs contain dementia-relevant signals that can be extracted by a nonlinear adaptor and reinforcement learning without the hallucinations seen in text outputs.

What would settle it

A replication study on a third dementia dataset where DeTAiL shows no improvement over strong baselines or where the extracted internal signals fail to correlate with clinical labels.

Figures

Figures reproduced from arXiv: 2607.00260 by Bradford C. Dickerson, James Glass, Liming Wang, Neguine Rezaii.

**Figure 1.** Figure 1: Overall Architecture of DeTAiL. (a) In the distillation and GRPO stages, the MLLM learns to generate both the cognitive label and the textual rationale that explains its prediction; (b) in the MLP adaptor stage, a small MLP classifier is trained on the hidden representation of the MLLM given the prompt and the generated rationale. Although a pretrained LLM can be prompted as an ADC, it often underperforms … view at source ↗

**Figure 3.** Figure 3: Reliability of the most frequent evidence types in the rationale for DeTAiL on ADReSS. Reliability is estimated by computing the percentage of correct predictions using a given evidence. For the two-class setting, earlier layers tend to work better, suggesting that relatively low-level representations are sufficient for distinguishing controls from cognitively impaired participants. The trend differs for … view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have emerged as a promising approach for improving the accuracy, transferability, and explainability of automatic dementia classification (ADC) systems from voice recordings. Yet it remains unclear whether their reasoning capabilities are beneficial for ADC, and how such capabilities should be leveraged. In this paper, we conduct a careful evaluation of reasoning MLLMs for ADC and show that naive strategies, such as relying on text-based rationales, can lead to hallucinated and inconsistent rationales for diagnosis and yield inferior ADC performance compared with LLM-free baselines. To overcome this limitation, we propose \textbf{De}mentia \textbf{T}hinker with Nonlinear \textbf{A}daptor and Re\textbf{i}nforcement \textbf{L}earning (DeTAiL), an adaptor-based framework that exploits the internal representations of reasoning MLLMs for improved dementia classification. Across two dementia datasets with distinct test formats and label granularities, DeTAiL consistently outperforms strong baselines and methods that rely on text-based rationales. Code and demo will be released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows text rationales from reasoning MLLMs hallucinate and underperform on dementia speech classification while DeTAiL extracts better signals from internal representations via adaptor and RL, but lacks ablations to confirm the reasoning component drives the gains.

read the letter

The main takeaway is that asking reasoning MLLMs for text rationales on dementia classification from speech produces hallucinations and worse results than simple baselines, while DeTAiL pulls from internal states with a nonlinear adaptor plus RL and beats those baselines on two datasets with different formats and label types.

They do a solid job documenting the failure mode of the text route in this specific medical task and offering a practical adaptor-based workaround that delivers consistent gains. That part is useful for anyone trying to apply these models to clinical speech data.

The soft spot is exactly the one in the stress-test note: no comparison to non-reasoning multimodal models or even linear probes on the same hidden states. Without those controls, the performance edge could come from generic multimodal features rather than anything tied to reasoning. The abstract also gives no numbers on effect sizes, variance, or statistical tests, so the strength of the outperformance claim is difficult to judge from what's visible.

This is aimed at the clinical speech processing crowd and people building multimodal tools for medical screening. A reader already working on voice-based dementia detection would pick up the caution about rationales and the adaptor idea.

It deserves peer review because the task is concrete, the proposed fix is straightforward to test, and the experiments need scrutiny on the missing ablations and details before the central claim can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper claims that text-based rationales from reasoning MLLMs for automatic dementia classification (ADC) from speech are prone to hallucinations and yield inferior performance compared to LLM-free baselines. It proposes DeTAiL, an adaptor framework that extracts from internal representations of reasoning MLLMs using a nonlinear adaptor and reinforcement learning, and reports that this approach consistently outperforms baselines and text-rationale methods across two dementia datasets with different test formats and label granularities.

Significance. If the central empirical claim holds after addressing experimental gaps, the work would be significant for ADC by demonstrating a practical way to leverage MLLM reasoning capabilities without the documented problems of text rationales. The cross-dataset evaluation with distinct characteristics is a positive. The commitment to release code upon acceptance supports reproducibility, which is a strength in this empirical domain.

major comments (2)

[Experiments (results and ablations)] The central claim that performance gains derive specifically from reasoning-derived internal representations (rather than generic multimodal features) requires ablations against non-reasoning MLLMs and linear probes on the same hidden states; without these, gains could be explained by the adaptor/RL alone. This is load-bearing for the title and abstract claim.
[Results tables] Table reporting cross-dataset results: the claim of consistent outperformance needs accompanying statistical tests (e.g., p-values, confidence intervals) and effect sizes to establish that gains are not due to post-hoc choices or dataset-specific tuning.

minor comments (2)

[Method] Notation for the nonlinear adaptor and RL components should be defined explicitly with equations in the method section for clarity.
[Abstract] The abstract states code will be released upon acceptance; the manuscript should include a footnote or URL placeholder for the promised demo.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The two major comments identify important gaps in the experimental validation of our central claims. We will perform a major revision to address both points with additional experiments and statistical reporting, as detailed below. These changes will strengthen the evidence that performance gains stem from reasoning-derived internal representations.

read point-by-point responses

Referee: [Experiments (results and ablations)] The central claim that performance gains derive specifically from reasoning-derived internal representations (rather than generic multimodal features) requires ablations against non-reasoning MLLMs and linear probes on the same hidden states; without these, gains could be explained by the adaptor/RL alone. This is load-bearing for the title and abstract claim.

Authors: We agree that the current experiments do not fully isolate the contribution of reasoning-derived representations from the adaptor and RL components. To address this, the revised manuscript will include new ablations: (1) direct comparisons against non-reasoning MLLMs (e.g., standard multimodal encoders without chain-of-thought or reasoning prompts) using the same DeTAiL adaptor framework, and (2) linear probes applied to the identical hidden states extracted from the reasoning MLLMs. These additions will clarify whether the observed gains require the reasoning process or can be achieved with generic multimodal features. revision: yes
Referee: [Results tables] Table reporting cross-dataset results: the claim of consistent outperformance needs accompanying statistical tests (e.g., p-values, confidence intervals) and effect sizes to establish that gains are not due to post-hoc choices or dataset-specific tuning.

Authors: We concur that statistical validation is necessary to support claims of consistent outperformance. In the revised tables, we will report paired statistical tests (e.g., Wilcoxon signed-rank or McNemar tests appropriate for the classification setting), p-values, 95% confidence intervals, and effect sizes (Cohen's d or similar) for all cross-dataset comparisons. This will be applied to both the primary results and the new ablation experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with no derivation chain or self-referential predictions.

full rationale

The paper is an empirical study proposing DeTAiL (nonlinear adaptor + RL on MLLM internals) and reporting performance gains on two datasets. No equations, first-principles derivations, or 'predictions' are presented that could reduce to inputs by construction. Claims rest on experimental comparisons rather than any mathematical reduction or self-citation load-bearing step. Absence of ablations is a methodological concern but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described or extractable.

pith-pipeline@v0.9.1-grok · 5730 in / 940 out tokens · 27762 ms · 2026-07-02T16:30:51.592308+00:00 · methodology

Do Multimodal Large Language Models Need Reasoning to Classify Dementia from Speech?

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)