pith. machine review for the scientific record. sign in

arxiv: 2601.22197 · v3 · submitted 2026-01-29 · 💻 cs.LG · cs.AI· eess.SP

Recognition: 2 theorem links

· Lean Theorem

Neural Signals Generate Clinical Notes in the Wild

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIeess.SP
keywords EEGclinical report generationfoundation modelsmultimodal learningneurology AIsignal-to-textmedical report automation
0
0 comments X

The pith

A foundation model translates extended EEG recordings into coherent clinical reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that pairing pretrained EEG encoders with language models enables direct generation of structured clinical notes from variable-length, long-duration EEG signals. A sympathetic reader would care because manual summarization of EEG data is time-intensive and limits access to expert interpretation in neurology clinics. CELM is trained on nearly 10,000 paired EEG-report examples spanning 11,000 hours and 9,000 patients, then shown to beat prior methods on automated metrics while also scoring higher on expert human ratings for coherence and diagnostic alignment. The release of the model and the automated structuring pipeline for the dataset is intended to support further scaling of this capability.

Core claim

CELM is the first end-to-end EEG-to-language foundation model that ingests long, variable-duration EEG recordings and outputs clinically structured reports summarizing abnormal patterns, diagnostic findings, and interpretations; it is trained on a newly curated dataset of 9,922 reports paired with approximately 11,000 hours of EEG from 9,048 patients and outperforms prior methods on both automated and expert human evaluations.

What carries the argument

CELM architecture that fuses pretrained EEG foundation models with language models for scalable multimodal summarization of raw EEG signals into text reports.

If this is right

  • Clinical report generation for long-term EEG monitoring becomes automated at scale.
  • Neurology workflows gain the ability to produce structured notes without manual review of every recording segment.
  • A public benchmark with an automated structuring pipeline is now available for comparing future EEG-to-text models.
  • Integration of existing EEG encoders with language models proves sufficient for clinically usable output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion approach could be tested on shorter or ambulatory EEG datasets to check whether duration is the main limiting factor.
  • If the model generalizes across hospitals, it could support remote or resource-limited settings where neurologists are scarce.
  • Extending the output to include suggested follow-up tests or medication notes would be a direct next step not explored here.

Load-bearing premise

The collected set of paired EEG recordings and clinical reports is representative of the range of real-world EEG patterns and expert interpretations.

What would settle it

Expert clinicians performing blinded pairwise comparisons on new, unseen long EEG recordings consistently rate CELM-generated reports as less coherent or diagnostically accurate than manually written reports.

Figures

Figures reproduced from arXiv: 2601.22197 by Jathurshan Pradeepkumar, Jimeng Sun, Zheng Chen.

Figure 1
Figure 1. Figure 1: Overview of our framework. (a) EEG–Report benchmark construction pipeline, including clinical report structuring (Section 3.1), matching reports to EEG sessions, and examples of standardized report sections. (b) The proposed Clinical EEG Language Model (CELM) comprises (b.1) Epoch-Aggregated Tokenization, (b.2) Sequence-Aware Alignment, and (b.3) Prompt Fusion and Generation. Harvard Electroencephalography… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Report generation performance of different projector variants. (b) Training dynamics of each variant, including training loss, validation loss, and perplexity curves. 5.3. Performance Analysis by Report Section Clinical EEG reports comprise multiple sections, each serv￾ing a distinct purpose: EEG description/details provides a detailed narrative of observed waveforms and patterns, impression/interpreta… view at source ↗
Figure 2
Figure 2. Figure 2: Section-wise comparison of report generation perfor￾mance between CELM and the best-performing baselines from different LLM families. for S0002, performance improves from 0.2886 to 0.6408 relative to the strongest baseline. Although CELM-SCC outperforms the baselines, its performance gap relative to the non-compressed model (0.4487 vs 0.6408) underscores an important open challenge. This leads to an import… view at source ↗
Figure 4
Figure 4. Figure 4: Examples of generated reports on S0002. Comparisons between the unimodal baseline (text + EEG features), the linear-projector alignment variant, CELM, and the ground-truth reports. (a) EEG description/details; (b) Impression/interpretation. achieves the best performance across all evaluation met￾rics. Analysis of training dynamics in Figure 3b further highlights the importance of modeling inter-epoch tempo… view at source ↗
Figure 5
Figure 5. Figure 5: Dataset statistics of the filtered and constructed EEG-Report Benchmark used in our study 11 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for unimodal + text only baselines. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for unimodal + text and EEG features as input baselines. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for CELM. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of various metrics for overall report generation in the S0002 dataset. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of various metrics for overall report generation in the S0001 dataset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: , 12, 13, and 14, we compare generated reports from the unimodal baseline, CELM-SCC, and CELM against ground-truth reports across multiple EEG report sections and datasets. Qwen3-4B (Unimodal + Text + EEG Features) Ours - SCC Ours Ground Truth a. b. c. d [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Section-wise generation examples on S0001. (a) Epileptiform abnormalities, (b) Interictal epileptiform abnormalities, (c) Seizures, and (d) Background activity. We compare outputs from the unimodal baseline (text + EEG features), CELM-SCC, CELM, and the ground-truth reports. Qwen3-4B (Unimodal + Text + EEG Features) Ours - SCC Ours Ground Truth a. b. c. d [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: EEG description/details generation examples on S0002. Examples are ordered from (a) to (d) by decreasing ROUGE-1 score. We compare outputs from the unimodal baseline (text + EEG features), CELM-SCC, CELM, and the ground-truth reports. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

Generating clinical reports that summarize abnormal patterns, diagnostic findings, and clinical interpretations from long-term EEG recordings remains labor-intensive. We present CELM, the first clinical EEG-to-Language foundation model capable of summarizing long-duration, variable-length EEG recordings and performing end-to-end clinical report generation at multiple scales. CELM integrates pretrained EEG foundation models with language models to enable scalable multimodal learning. We curate a large-scale clinical EEG dataset containing 9,922 reports paired with approximately 11,000 hours of EEG recordings from 9,048 patients to train CELM, and release the benchmark with an automated report-structuring pipeline to facilitate future research. Experimental results show that CELM consistently outperforms existing methods across all evaluation settings. Importantly, we further conduct human evaluation with clinical experts, demonstrating that CELM generates reports that are more clinically coherent, diagnostically reliable, and better aligned with expert interpretation. We release our model and benchmark construction pipeline at https://github.com/Jathurshan0330/CELM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces CELM, the first clinical EEG-to-Language foundation model that integrates pretrained EEG foundation models with language models for end-to-end generation of clinical reports from long-duration, variable-length EEG recordings. It curates and releases a benchmark dataset of 9,922 paired reports and ~11,000 hours of EEG from 9,048 patients, reports that CELM consistently outperforms existing methods across evaluation settings, and presents human evaluations by clinical experts indicating superior clinical coherence, diagnostic reliability, and alignment with expert interpretation.

Significance. If the performance and human-evaluation claims hold after detailed verification, the work could meaningfully advance automated clinical reporting in neurology by reducing the labor of summarizing long EEG recordings. The public release of the model weights and benchmark-construction pipeline is a concrete strength that would support reproducibility and extension by the community.

major comments (3)
  1. [Dataset curation] Dataset curation section: the 9,048-patient cohort is described only with aggregate counts (9,922 reports, ~11k hours); no stratification by age, sex, diagnosis prevalence, or single- versus multi-center sourcing is provided. This information is load-bearing for the claim that CELM generalizes and outperforms baselines on representative clinical data.
  2. [Experimental results] Experimental results section: the central claim that CELM 'consistently outperforms existing methods across all evaluation settings' is stated without accompanying tables or text reporting concrete metrics (BLEU, ROUGE, clinical accuracy), baseline model specifications, statistical tests, or confidence intervals. These details are required to assess the magnitude and reliability of the reported gains.
  3. [Human evaluation] Human evaluation section: the expert ratings of clinical coherence and diagnostic reliability are presented without any description of rater blinding, inter-rater agreement statistics, or the precise rating protocol. Absence of blinding details directly affects the credibility of the claim that generated reports are 'better aligned with expert interpretation.'
minor comments (1)
  1. [Abstract and conclusion] The GitHub link for model and pipeline release should be confirmed to contain the full benchmark-construction code and any preprocessing scripts referenced in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional details.

read point-by-point responses
  1. Referee: [Dataset curation] Dataset curation section: the 9,048-patient cohort is described only with aggregate counts (9,922 reports, ~11k hours); no stratification by age, sex, diagnosis prevalence, or single- versus multi-center sourcing is provided. This information is load-bearing for the claim that CELM generalizes and outperforms baselines on representative clinical data.

    Authors: We agree that stratification details are important for supporting claims of generalizability. In the revised manuscript, we will expand the Dataset Curation section with a table and accompanying text providing breakdowns by age, sex, diagnosis prevalence, and data sourcing (single- versus multi-center). These statistics are derivable from the available metadata in the 9,048-patient cohort and will be reported to allow readers to assess representativeness. revision: yes

  2. Referee: [Experimental results] Experimental results section: the central claim that CELM 'consistently outperforms existing methods across all evaluation settings' is stated without accompanying tables or text reporting concrete metrics (BLEU, ROUGE, clinical accuracy), baseline model specifications, statistical tests, or confidence intervals. These details are required to assess the magnitude and reliability of the reported gains.

    Authors: We acknowledge that the current presentation lacks the quantitative details needed for rigorous evaluation. In the revision, we will add tables in the Experimental Results section reporting specific BLEU, ROUGE, and clinical accuracy scores for CELM versus all baselines, along with baseline model specifications, statistical significance tests, and confidence intervals. These metrics were computed during our experiments and will be fully documented to substantiate the performance claims. revision: yes

  3. Referee: [Human evaluation] Human evaluation section: the expert ratings of clinical coherence and diagnostic reliability are presented without any description of rater blinding, inter-rater agreement statistics, or the precise rating protocol. Absence of blinding details directly affects the credibility of the claim that generated reports are 'better aligned with expert interpretation.'

    Authors: We recognize that protocol details are necessary to establish the reliability of the human evaluation. In the revised manuscript, we will expand the Human Evaluation section to describe the rater blinding procedure, inter-rater agreement statistics (such as Cohen's or Fleiss' kappa), and the exact rating scales and instructions provided to the clinical experts. This will ensure full transparency regarding the evaluation process. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper describes an empirical training pipeline: curating 9,922 paired EEG-report examples from 9,048 patients, integrating pretrained EEG and language models into CELM, and reporting performance on held-out quantitative metrics plus human expert ratings. No equations, first-principles derivations, or parameter-fitting steps are presented that reduce the claimed outputs to the inputs by construction. All load-bearing claims rest on standard supervised training and external evaluation rather than self-referential definitions or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions about transfer learning from pretrained models and the sufficiency of the curated paired dataset for generalization; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Pretrained EEG foundation models can be integrated with language models to enable effective multimodal report generation
    Invoked in the description of CELM architecture and training.

pith-pipeline@v0.9.0 · 5475 in / 1142 out tokens · 35213 ms · 2026-05-16T10:24:43.880967+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 unverdicted novelty 7.0

    EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.

  2. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 conditional novelty 7.0

    EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    M., Benini, L., and Li, Y

    Döner, B., Ingolfsson, T. M., Benini, L., and Li, Y . Luna: Efficient and topology-agnostic foundation model for eeg signal analysis.arXiv preprint arXiv:2510.22257,

  2. [2]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  3. [3]

    Perceiver IO: A General Architecture for Structured Inputs & Outputs

    Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., Shelhamer, E., et al. Perceiver io: A general archi- tecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021a. Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception w...

  4. [4]

    E., Lys, J., Thölke, P., Farrugia, N., Pasde- loup, B., Gripon, V ., Jerbi, K., and Lioi, G

    Ouahidi, Y . E., Lys, J., Thölke, P., Farrugia, N., Pasde- loup, B., Gripon, V ., Jerbi, K., and Lioi, G. Reve: A foundation model for eeg–adapting to any setup with large-scale pretraining on 25,000 subjects.arXiv preprint arXiv:2510.21585,

  5. [5]

    Tokenizing Single-Channel EEG with Time-Frequency Motif Learning

    Pradeepkumar, J., Piao, X., Chen, Z., and Sun, J. Single- channel eeg tokenization through time-frequency model- ing.arXiv preprint arXiv:2502.16060,

  6. [6]

    MedGemma Technical Report

    Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al. Medgemma technical report.arXiv preprint arXiv:2507.05201,

  7. [7]

    Gemma 3 Technical Report

    Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Riv- ière, M., et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  8. [8]

    Eegpt: Pretrained transformer for universal and reliable repre- sentation of eeg signals.Advances in Neural Information Processing Systems, 37:39249–39280, 2024a

    Wang, G., Liu, W., He, Y ., Xu, C., Ma, L., and Li, H. Eegpt: Pretrained transformer for universal and reliable repre- sentation of eeg signals.Advances in Neural Information Processing Systems, 37:39249–39280, 2024a. Wang, J., Zhao, S., Luo, Z., Zhou, Y ., Jiang, H., Li, S., Li, T., and Pan, G. Cbramod: A criss-cross brain foundation model for eeg decodi...

  9. [9]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  10. [10]

    and Shin, H.-B

    Yin, K. and Shin, H.-B. Neurolex: A lightweight domain language model for eeg report understanding and genera- tion.10.48550/arXiv.2511.12851,

  11. [11]

    Belt-2: Bootstrapping eeg-to-language repre- sentation alignment for multi-task brain decoding.arXiv preprint arXiv:2409.00121,

    Zhou, J., Duan, Y ., Chang, F., Do, T., Wang, Y .-K., and Lin, C.-T. Belt-2: Bootstrapping eeg-to-language repre- sentation alignment for multi-task brain decoding.arXiv preprint arXiv:2409.00121,

  12. [12]

    report_sections

    C.3. Prompts Figure 8 illustrates the prompt used in our approach along with the EEG tokens. We also provide examples of the sections in the prompt. 15 Neural Signals Generate Clinical Notes in the Wild ELM Prompt Input :EEG projected tokens prepended to text tokens. EEG CHANNELS ['C3', 'C4', 'O1', 'O2', 'Cz', 'F3', 'F4', 'F7', 'F8', 'Fz', 'Fp1', 'Fp2', '...

  13. [13]

    Table 10.Alignment module ablation on S0002 dataset

    Across most evaluation settings, the SCT projector achieves significantly stronger performance than other alignment variants, highlighting the importance of modeling dependencies among EEG epoch tokens before projecting them into the LLM embedding space. Table 10.Alignment module ablation on S0002 dataset. The best results are highlighted in orange Projec...