A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection

Danwei Weng; Guoming Luan; Jingyi Yao; Ke Xu; Miao Liu; Minmin Luo; Min Wang; Ruiyu Wang; Tong Lu; Wenchao Zhang

arxiv: 2604.26379 · v1 · submitted 2026-04-29 · 💻 cs.CV

A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection

Tong Lu , Ke Xu , Zimo Zhang , Zitong Zhao , Danwei Weng , Ruiyu Wang , Miao Liu , Zizuo Zhang

show 7 more authors

Jingyi Yao Yixuan Zhao Wenchao Zhang Min Wang Guoming Luan Minmin Luo Zhifeng Yue

This is my paper

Pith reviewed 2026-05-07 11:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords eventdetectioneegvfusionseizuresensitivityaccuracyachievedalignment

0 comments

The pith

EEGVFusion achieves balanced accuracy of 0.9957 on random splits and 0.9718 on held-out subjects while cutting event false alarm rates to 0.48 FP/h by integrating pre-trained EEG and video features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Seizures in mouse models are studied by recording brain electrical activity (EEG) and video at the same time, but reviewing these recordings by hand takes a lot of time. Pure EEG systems get confused by movement artifacts during seizures, while video systems mistake normal mouse movements for seizures. The new system first learns useful patterns from EEG data without labels using self-supervised training, then encodes the video for space and time patterns. It aligns the two data streams with optimal transport and uses cross-attention so each modality can inform the other. On a new dataset of 93 sessions from 15 mice, the combined system detects nearly all seizures while producing far fewer false alarms than EEG alone, especially when tested on a completely new mouse.

Core claim

In the random-session split, EEGVFusion achieved a Balanced Accuracy of 0.9957 with perfect event sensitivity and an Event FAR of 0.6250 FP/h; in held-out-subject evaluation it reached 0.9718 balanced accuracy and reduced Event FAR from 2.7250 to 0.4833 FP/h while preserving perfect sensitivity.

Load-bearing premise

That the expert annotations on the 93 sessions are free of systematic labeling bias and that the 15-mouse cohort captures the variability needed for generalization to new subjects and recording conditions.

read the original abstract

Reliable seizure detection in mouse models is essential for preclinical epilepsy research, yet manual review of synchronized video-EEG recordings is labor-intensive and single-modality systems fail for complementary reasons: video-based methods are easily confounded by benign behaviors, whereas EEG-based methods are vulnerable to ictal motion artifacts. We present EEGVFusion, a multimodal framework that combines self-supervised EEG representation learning, spatio-temporal video encoding, optimal-transport alignment, and bidirectional cross-attention to integrate neural and behavioral evidence. We also curate an expert-annotated dataset of synchronized EEG and video recordings comprising 93 sessions from 15 mice for training and evaluation. In the random-session split, EEGVFusion achieved a Balanced Accuracy of 0.9957 with perfect event sensitivity and an Event FAR of 0.6250 FP/h, indicating strong seizure detection performance with a low false-alarm burden. In a single held-out-subject evaluation with Subject 110 reserved for testing, EEGVFusion achieved a Balanced Accuracy of 0.9718 and reduced Event FAR from 2.7250 FP/h for the EEG-only counterpart to 0.4833 FP/h while preserving perfect event sensitivity. Targeted ablations further showed that EEG pre-training and OT alignment help reduce false alarms while preserving event sensitivity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New multimodal EEG-video model and 15-mouse dataset cut false alarms in seizure detection, but single held-out subject gives weak support for cross-mouse generalization.

read the letter

This paper introduces EEGVFusion, which combines self-supervised EEG pre-training, video encoding, optimal-transport alignment, and bidirectional cross-attention for mouse seizure detection. It also releases a new expert-annotated dataset of 93 sessions from 15 mice. The results show clear gains over EEG-only baselines, especially in dropping event false-alarm rates while keeping perfect sensitivity, on both random-session and one held-out-subject splits. Ablations credit the pre-training and alignment steps for the false-alarm reduction. That practical angle on cutting manual review time in preclinical epilepsy work is the useful part here. The architecture choices are straightforward extensions of existing multimodal techniques, but the combination with the new data is what makes the numbers interesting. The main weakness is the evaluation. Only one mouse is held out, and sessions from the same animal share correlated signals, electrode placement, and seizure patterns, so the random split leaks subject identity. No leave-one-subject-out results, no variance across mice, and no error bars or significance tests appear in the abstract. The small cohort of 15 mice also leaves overfitting risk unaddressed, and training details like hyperparameters and exact losses are missing. Without those, the reported balanced accuracies of 0.9957 and 0.9718 are hard to trust as general. This is aimed at labs doing animal epilepsy modeling who already work with synchronized EEG-video. A reader in that niche could get value from the dataset and the false-alarm improvements if the full methods hold up. It deserves peer review because the problem is real, the new data is concrete, and the multimodal fusion is a reasonable step forward, even if the generalization experiments need strengthening.

Referee Report

3 major / 2 minor

Summary. The manuscript presents EEGVFusion, a multimodal pre-trained network for integrated EEG-video seizure detection in mouse models. It combines self-supervised EEG representation learning, spatio-temporal video encoding, optimal-transport alignment, and bidirectional cross-attention. The authors curate a dataset of 93 synchronized EEG-video sessions from 15 mice and evaluate the model on random-session and held-out-subject splits, reporting balanced accuracies of 0.9957 and 0.9718 respectively, with improvements in false alarm rates over baselines and ablations demonstrating the value of pre-training and alignment.

Significance. This work addresses a practical challenge in preclinical epilepsy research by developing an automated system that integrates complementary EEG and video modalities to improve detection reliability. The reported performance gains, particularly the reduction in event false alarm rate while maintaining perfect sensitivity in the held-out evaluation, suggest potential utility if validated more robustly. The curation of an expert-annotated multimodal dataset is a valuable contribution to the field.

major comments (3)

[Held-out Subject Evaluation] Held-out Subject Evaluation: The evaluation reserves only a single mouse (Subject 110) for testing. Given that sessions from the same mouse share correlated seizure phenotypes, recording conditions, and electrode placement, this provides limited evidence for subject-independent generalization. No leave-one-subject-out cross-validation, inter-mouse variance, or results across multiple held-out subjects are reported, which is load-bearing for the central claim of reliable cross-subject performance.
[Methods] Methods: Training hyperparameters, exact loss functions, optimization details, and any statistical significance testing or error bars on the performance metrics (e.g., Balanced Accuracy 0.9957 and 0.9718) are not provided. This omission hinders assessment of the robustness and reproducibility of the reported results and ablation studies.
[Dataset Description] Dataset Description: There is no discussion of inter-rater reliability for the expert annotations or potential systematic labeling biases, which could affect the validity of the ground truth labels in a small cohort of 15 mice.

minor comments (2)

[Abstract] The abstract could briefly clarify the role of optimal-transport alignment in the multimodal fusion to improve accessibility.
[Figures] Ensure figure captions are fully self-contained and reference all key components of the architecture diagram.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments identify key areas where additional clarity and rigor will strengthen the presentation. We address each major comment point-by-point below, indicating the revisions we intend to incorporate.

read point-by-point responses

Referee: [Held-out Subject Evaluation] Held-out Subject Evaluation: The evaluation reserves only a single mouse (Subject 110) for testing. Given that sessions from the same mouse share correlated seizure phenotypes, recording conditions, and electrode placement, this provides limited evidence for subject-independent generalization. No leave-one-subject-out cross-validation, inter-mouse variance, or results across multiple held-out subjects are reported, which is load-bearing for the central claim of reliable cross-subject performance.

Authors: We agree that a single held-out subject offers only preliminary evidence for subject-independent generalization, as intra-mouse correlations in seizure phenotypes, recording conditions, and electrode placement may influence results. Subject 110 was selected as the held-out test case to demonstrate performance on fully unseen data while retaining the largest possible training set from the remaining 14 mice. In the revision we will add an explicit limitations paragraph that qualifies the generalizability claims and discusses the implications of this design choice. We will also compute and report performance across additional randomly selected held-out subjects (where dataset constraints permit) together with inter-mouse variance statistics to provide a more robust picture of cross-subject behavior. revision: partial
Referee: [Methods] Methods: Training hyperparameters, exact loss functions, optimization details, and any statistical significance testing or error bars on the performance metrics (e.g., Balanced Accuracy 0.9957 and 0.9718) are not provided. This omission hinders assessment of the robustness and reproducibility of the reported results and ablation studies.

Authors: We apologize for the omission of these essential implementation details. The revised manuscript will contain a dedicated subsection (or appendix) that fully specifies all training hyperparameters, the exact mathematical definitions of every loss term (self-supervised EEG pre-training, optimal-transport alignment, bidirectional cross-attention, and classification losses), the optimizer, learning-rate schedule, batch size, number of epochs, and any regularization or early-stopping criteria. In addition, we will report error bars or confidence intervals on the balanced-accuracy and false-alarm metrics (obtained via multiple independent runs or bootstrapping) and will include statistical significance tests comparing EEGVFusion against the baselines and ablations. revision: yes
Referee: [Dataset Description] Dataset Description: There is no discussion of inter-rater reliability for the expert annotations or potential systematic labeling biases, which could affect the validity of the ground truth labels in a small cohort of 15 mice.

Authors: We acknowledge the importance of documenting annotation quality. All 93 sessions were labeled by a single expert neurologist following a standardized protocol for identifying electrographic and behavioral seizures in synchronized mouse EEG-video recordings. The revised manuscript will expand the dataset section to describe the annotation guidelines in detail, the criteria used to resolve ambiguous events, and any procedural steps taken to reduce systematic bias. Because only one rater performed the annotations, inter-rater reliability statistics are unavailable; we will therefore note this as a limitation of the current ground-truth labels and suggest multi-rater validation as a direction for future dataset releases. revision: partial

Circularity Check

0 steps flagged

No circularity: performance metrics are measured on held-out data splits

full rationale

The paper's central claims consist of empirical balanced accuracy, sensitivity, and event FAR values obtained by training EEGVFusion on 93 sessions from 15 mice and evaluating on random-session and single held-out-subject splits. These quantities are direct measurements of model output against expert annotations on unseen data; they are not algebraically equivalent to any training objective, fitted parameter, or self-citation by construction. No equations, uniqueness theorems, or ansatzes are invoked that reduce the reported results to the inputs. The architecture (self-supervised pre-training, OT alignment, cross-attention) is described as a design choice whose effectiveness is tested rather than presupposed. This is the normal case for an applied ML evaluation paper whose results remain falsifiable by new subjects or recording conditions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The performance claims rest on standard supervised learning assumptions plus the quality of expert annotations and the representativeness of the 15-mouse cohort; no new physical entities are postulated.

free parameters (1)

model hyperparameters and training schedule
Learning rates, batch sizes, attention dimensions, OT regularization strength, and pre-training epochs are chosen to achieve the reported metrics.

axioms (2)

domain assumption Expert annotations on synchronized EEG-video are treated as ground truth without reported inter-rater reliability metrics.
Central claim depends on annotation accuracy for both training and evaluation.
domain assumption The 15-mouse dataset distribution is representative of future recording sessions and subjects.
Held-out subject results are presented as evidence of generalization.

pith-pipeline@v0.9.0 · 5576 in / 1505 out tokens · 47585 ms · 2026-05-07T11:52:28.149806+00:00 · methodology

A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection

Core claim

Load-bearing premise

discussion (0)