arxiv: 2605.14569 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

Yujie Wei , Chenglong Ma , Jianxiong Gao , Chenhui Wang , Shiwei Zhang , Biao Gong , Shuai Tan , Hangjie Yuan

show 1 more author

Hongming Shan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords fMRIvideo reconstructionsemantic enrichmentneural decodinghierarchical frameworkmemory integrationbrain signalsaction recognition

0 comments

The pith

CineNeuron reconstructs videos from fMRI signals through bottom-up semantic enrichment followed by top-down memory integration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CineNeuron as a hierarchical framework that maps noisy fMRI signals into rich embeddings capturing textual semantics, image content, actions, and objects in a bottom-up stage. It then applies a top-down Mixture-of-Memories mechanism to dynamically select and fuse relevant prior data for refined video output. This dual approach is intended to close the semantic gap that limits current fMRI-to-video methods. The authors report that the resulting reconstructions surpass existing techniques across metrics on two standard benchmarks. A reader would care because accurate video reconstruction from brain signals could advance both basic understanding of visual neural processing and practical decoding applications.

Core claim

The central claim is that a bottom-up semantic enrichment stage maps fMRI signals to comprehensive embeddings spanning textual, visual, action, and object information, while a subsequent top-down stage uses Mixture-of-Memories to select and integrate relevant prior memories, enabling video reconstructions that capture dynamic cues such as actions more effectively than prior methods.

What carries the argument

The two-stage hierarchical framework consisting of bottom-up semantic enrichment of fMRI signals into multi-aspect embeddings and top-down Mixture-of-Memories integration for dynamic fusion with prior data.

If this is right

Reconstructed videos incorporate action and object details more accurately than methods relying on incomplete embeddings.
Dynamic selection of prior memories allows the model to refine outputs using previously seen data without fixed tuning.
The framework produces superior quantitative and qualitative results on existing fMRI-to-video benchmarks.
The dual-stage design mirrors human brain dual-pathway processing to bridge the semantic gap between signals and video content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same enrichment-plus-memory pattern could be tested on other brain-signal modalities such as EEG for cross-modal video decoding.
If the memory integration step proves robust, it might extend to real-time reconstruction tasks in brain-computer interface settings.
The separation of bottom-up feature mapping from top-down selection offers a template for other multimodal reconstruction problems where prior knowledge must be selectively applied.

Load-bearing premise

The assumption that the bottom-up enrichment and top-down memory stages can reliably extract video-specific cues like actions from noisy fMRI signals without benchmark-specific overfitting.

What would settle it

Failure of CineNeuron to outperform baselines on a new fMRI-to-video dataset containing unseen actions and object categories would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.14569 by Biao Gong, Chenglong Ma, Chenhui Wang, Hangjie Yuan, Hongming Shan, Jianxiong Gao, Shiwei Zhang, Shuai Tan, Yujie Wei.

**Figure 1.** Figure 1: Comparison with previous methods. (a) Previous methods often align fMRI embeddings with limited semantics in an isolation process, relying only on the current stimulus and yielding semantically inaccurate results. (b) Our method enriches the fMRI embeddings with comprehensive video semantics and introduces Mixture-of-Memories to dynamically select and fuse prior knowledge, producing semantically coherent v… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed CINENEURON. In stage 1, given an input fMRI-video pair, the fMRI signals are first embedded by a Brain Model and enriched with the text, image, action, and category semantics extracted from the video. In stage 2, the proposed Mixture-of-Memories method dynamically selects multimodal embeddings from previously seen data via a router and fuses them with the fMRI embeddings via a fusi… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of CINENEURON and baselines on the cc2017 dataset [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of CINENEURON and baselines on the CineBrain dataset [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of voxel weights from the first regression layer. Voxel weights are averaged and normalized to [0, 1], displayed with a 0.25 to 0.75 colorbar. The blue and green dotted lines indicate the lateral occipital and ventral cortex, respectively. GT ℒ!"#$ + ℒ!"% + ℒ&!'#() ours [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative ablation of each proposed component [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant "memories" from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CineNeuron's two-stage hierarchy with Mixture-of-Memories adds a concrete way to close the semantic gap in fMRI-to-video work, but the memory selection step needs checking for leakage.

read the letter

The paper's core move is a bottom-up stage that maps noisy fMRI signals into a richer embedding space covering text, image content, actions, and objects, followed by a top-down stage that uses Mixture-of-Memories to pick and fuse relevant items from previously seen data. That specific combination is what stands out relative to earlier fMRI-video methods that stopped at simpler embeddings. It directly targets the gap the abstract flags, and the reported gains on two benchmarks suggest the approach can improve reconstruction quality when the pieces work together. Credit to the authors for grounding the design in the brain's dual-pathway idea and for testing on standard benchmarks rather than inventing new ones. The results look better than prior SOTA across metrics, which is the practical payoff. The main soft spot is the memory integration step. If the bank draws from data that overlaps with test distributions or if selection has no explicit disjointness rule, the gains could partly reflect memorization of benchmark patterns instead of true generalization to new video cues. The abstract does not detail how the memory bank is built or whether any leakage controls were applied, so that part remains the least secure. Without seeing the full ablations, error bars, or exact data splits, it is hard to judge how robust the superiority claim is. This work is aimed at researchers in neural decoding and brain-computer interfaces who already follow fMRI-to-video papers. A reader who wants a practical hierarchical recipe for adding semantic depth and memory lookup will get value from the framework, even if they later tweak the memory handling. It is coherent on its own terms and shows honest engagement with the semantic-gap problem, so it deserves a serious referee to examine the implementation details and reproducibility checks rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CineNeuron, a hierarchical framework for fMRI-to-video reconstruction inspired by dual-pathway brain processing. It consists of a bottom-up semantic enrichment stage mapping fMRI signals to rich embeddings capturing textual semantics, image contents, action concepts, and object categories, followed by a top-down memory integration stage that employs Mixture-of-Memories to dynamically select and fuse relevant memories from previously seen data. Extensive experiments on two fMRI-to-video benchmarks are reported to show that CineNeuron surpasses state-of-the-art methods across various metrics.

Significance. If the reported gains prove robust under controls for data leakage, the work could advance fMRI-to-video reconstruction by addressing the semantic gap through explicit integration of video-specific cues (actions, dynamics) via brain-inspired stages, building on existing embeddings without introducing circular parameter fitting.

major comments (2)

[top-down memory integration stage] Top-down memory integration stage: the Mixture-of-Memories method selects and fuses memories from previously seen data, yet the manuscript provides no explicit mechanism (e.g., train/test split rules or disjoint memory bank construction) ensuring the bank excludes test distributions. This directly undermines the central claim of synergistic improvement, as gains on the two benchmarks could arise from memorization of benchmark statistics rather than general semantic enrichment.
[experimental results] Experimental results section: the abstract asserts superiority across metrics without reporting ablation studies isolating the contribution of bottom-up vs. top-down stages, error bars, or data exclusion rules. Without these, it is impossible to confirm whether the synergistic gains are robust or affected by post-hoc choices, weakening the evidence for the hierarchical framework's effectiveness.

minor comments (1)

[Abstract] Abstract: the phrase 'surpasses state-of-the-art methods across various metrics' would be clearer if the specific metrics (e.g., PSNR, SSIM, semantic similarity) and quantitative improvements were stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help clarify key aspects of our framework. We address each major point below and will revise the manuscript to improve transparency and robustness.

read point-by-point responses

Referee: [top-down memory integration stage] Top-down memory integration stage: the Mixture-of-Memories method selects and fuses memories from previously seen data, yet the manuscript provides no explicit mechanism (e.g., train/test split rules or disjoint memory bank construction) ensuring the bank excludes test distributions. This directly undermines the central claim of synergistic improvement, as gains on the two benchmarks could arise from memorization of benchmark statistics rather than general semantic enrichment.

Authors: We agree that explicit safeguards against data leakage are essential for validating the top-down stage. The memory bank is constructed solely from training-set embeddings with no overlap to test samples, following standard benchmark splits (e.g., subject-wise or video-wise disjoint partitions). However, this protocol was described only at a high level. In revision we will add a dedicated subsection with precise train/test split rules, pseudocode for disjoint bank construction, and confirmation that test distributions are fully excluded from memory selection and fusion. revision: yes
Referee: [experimental results] Experimental results section: the abstract asserts superiority across metrics without reporting ablation studies isolating the contribution of bottom-up vs. top-down stages, error bars, or data exclusion rules. Without these, it is impossible to confirm whether the synergistic gains are robust or affected by post-hoc choices, weakening the evidence for the hierarchical framework's effectiveness.

Authors: We acknowledge the need for stronger empirical isolation of each stage. The original experiments compare the full model against prior methods but do not include stage-wise ablations or variance estimates. We will expand the experimental section with (i) ablation tables removing bottom-up enrichment or top-down integration, (ii) error bars or standard deviations over multiple random seeds, and (iii) explicit restatement of the data-exclusion rules already used for the memory bank. These additions will directly demonstrate the synergistic contribution of the two stages. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a hierarchical framework (bottom-up semantic enrichment followed by top-down Mixture-of-Memories integration) that builds on existing embeddings and memory concepts. No equations, derivations, or self-referential reductions appear in the provided text; performance claims rest on benchmark experiments that remain externally falsifiable rather than being forced by fitted parameters or self-citation chains. The approach is self-contained against external benchmarks with no load-bearing self-definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level description of the two stages and the Mixture-of-Memories component.

pith-pipeline@v0.9.0 · 5523 in / 981 out tokens · 29716 ms · 2026-05-15T01:49:20.522330+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · 18 internal anchors

[1]

A massive 7T fMRI dataset to bridge cognitive neuroscience and ar- tificial intelligence.Nature Neuroscience, 25(1):116–126,

Emily J Allen, Ghislain St-Yves, Yihan Wu, Jesse L Breedlove, Jacob S Prince, Logan T Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, et al. A massive 7T fMRI dataset to bridge cognitive neuroscience and ar- tificial intelligence.Nature Neuroscience, 25(1):116–126,

work page
[2]

DreamDiffusion: Generating high- quality images from brain EEG signals.arXiv preprint arXiv:2306.16934, 2023

Yunpeng Bai, Xintao Wang, Yan-pei Cao, Yixiao Ge, Chun Yuan, and Ying Shan. DreamDiffusion: Generating high- quality images from brain EEG signals.arXiv preprint arXiv:2306.16934, 2023. 1

work page arXiv 2023
[3]

Signal quality as Achilles’ heel of graph the- ory in functional magnetic resonance imaging in multiple sclerosis.Scientific Reports, 11(1):7376, 2021

Johan Baijot, Stijn Denissen, Lars Costers, Jeroen Gie- len, Melissa Cambron, Miguel D’Haeseleer, Marie B D’hooghe, Anne-Marie Vanbinst, Johan De Mey, Guy Nagels, et al. Signal quality as Achilles’ heel of graph the- ory in functional magnetic resonance imaging in multiple sclerosis.Scientific Reports, 11(1):7376, 2021. 1

work page 2021
[4]

Brain decoding: toward real-time reconstruction of visual perception.arXiv preprint arXiv:2310.19812, 2023

Yohann Benchetrit, Hubert Banville, and Jean-R ´emi King. Brain decoding: toward real-time reconstruction of visual perception.arXiv preprint arXiv:2310.19812, 2023. 2

work page arXiv 2023
[5]

Structure and func- tion of visual area MT.Annu

Richard T Born and David C Bradley. Structure and func- tion of visual area MT.Annu. Rev. Neurosci., 28(1):157– 189, 2005. 8

work page 2005
[6]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Videocrafter2: Overcoming data limitations for high-quality video diffu- sion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310– 7320, 2024. 3

work page 2024
[8]

Masked autoencoders are effective tok- enizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tok- enizers for diffusion models. InForty-second International Conference on Machine Learning, 2025. 3

work page 2025
[9]

Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710– 22720, 2023. 1, 2

work page 2023
[10]

Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Sys- tems, 36:24841–24858, 2023

Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Sys- tems, 36:24841–24858, 2023. 1, 2, 3, 5, 6, 4

work page 2023
[11]

Neural decoding of music from the EEG.Scien- tific Reports, 13(1):624, 2023

Ian Daly. Neural decoding of music from the EEG.Scien- tific Reports, 13(1):624, 2023. 1

work page 2023
[12]

How do expectations shape perception?Trends in Cognitive Sci- ences, 22(9):764–779, 2018

Floris P De Lange, Micha Heilbron, and Peter Kok. How do expectations shape perception?Trends in Cognitive Sci- ences, 22(9):764–779, 2018. 2

work page 2018
[13]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009. 2

work page 2009
[14]

Dual-uncertainty guided multimodal mri-based visual pathway extraction.IEEE Transactions on Biomedical Engineering, 72(6):1993–2000, 2025

Alou Diakite, Cheng Li, Yousuf Babiker M Osman, Zan Chen, Yiang Pan, Jiawei Zhang, Tao Tan, Hairong Zheng, and Shanshan Wang. Dual-uncertainty guided multimodal mri-based visual pathway extraction.IEEE Transactions on Biomedical Engineering, 72(6):1993–2000, 2025. 2

work page 1993
[15]

fMRIPrep: a robust preprocessing pipeline for functional MRI.Nature Methods, 16(1):111–116, 2019

Oscar Esteban, Christopher J Markiewicz, Ross W Blair, Craig A Moodie, A Ilkay Isik, Asier Erramuzpe, James D Kent, Mathias Goncalves, Elizabeth DuPre, Madeleine Snyder, et al. fMRIPrep: a robust preprocessing pipeline for functional MRI.Nature Methods, 16(1):111–116, 2019. 1, 2

work page 2019
[16]

The par- allel visual motion inputs into areas V1 and V5 of human cerebral cortex.Brain, 118(6):1375–1394, 1995

Dominic H Ffytche, CN Guy, and Semir Zeki. The par- allel visual motion inputs into areas V1 and V5 of human cerebral cortex.Brain, 118(6):1375–1394, 1995. 8

work page 1995
[17]

Brain netflix: Scaling data to reconstruct videos from brain signals

Camilo Fosco, Benjamin Lahner, Bowen Pan, Alex An- donian, Emilie Josephs, Alex Lascelles, and Aude Oliva. Brain netflix: Scaling data to reconstruct videos from brain signals. InEuropean Conference on Computer Vision, pages 457–474. Springer, 2024. 2

work page 2024
[18]

Mind-3d: Reconstruct high- quality 3d objects in human brain, 2023

Jianxiong Gao, Yuqian Fu, Yun Wang, Xuelin Qian, Jian- feng Feng, and Yanwei Fu. Mind-3d: Reconstruct high- quality 3d objects in human brain, 2023. 2

work page 2023
[19]

Mind-3d++: Advancing fmri- based 3d reconstruction with high-quality textured mesh generation and a comprehensive dataset, 2025

Jianxiong Gao, Yanwei Fu, Yuqian Fu, Yun Wang, Xuelin Qian, and Jianfeng Feng. Mind-3d++: Advancing fmri- based 3d reconstruction with high-quality textured mesh generation and a comprehensive dataset, 2025. 2

work page 2025
[20]

CineBrain: A large-scale multi-modal brain dataset during naturalistic audiovisual narrative pro- cessing.arXiv preprint arXiv:2503.06940, 2025

Jianxiong Gao, Yichang Liu, Baofeng Yang, Jianfeng Feng, and Yanwei Fu. CineBrain: A large-scale multi-modal brain dataset during naturalistic audiovisual narrative pro- cessing.arXiv preprint arXiv:2503.06940, 2025. 2, 3, 5, 6, 4

work page arXiv 2025
[21]

Multi-modal longitudi- nal representation learning for predicting neoadjuvant ther- apy response in breast cancer treatment.IEEE Journal of Biomedical and Health Informatics, 2025

Yuan Gao, Tao Tan, Xin Wang, Regina Beets-Tan, Tianyu Zhang, Luyi Han, Antonio Portaluri, Chunyao Lu, Xing- long Liang, Jonas Teuwen, et al. Multi-modal longitudi- nal representation learning for predicting neoadjuvant ther- apy response in breast cancer treatment.IEEE Journal of Biomedical and Health Informatics, 2025. 3

work page 2025
[22]

Top-down influences on vi- sual processing.Nature Reviews Neuroscience, 14(5):350– 363, 2013

Charles D Gilbert and Wu Li. Top-down influences on vi- sual processing.Nature Reviews Neuroscience, 14(5):350– 363, 2013. 2

work page 2013
[23]

The minimal preprocessing pipelines for the human connectome project.Neuroimage, 80:105–124, 2013

Matthew F Glasser, Stamatios N Sotiropoulos, J An- thony Wilson, Timothy S Coalson, Bruce Fischl, Jesper L Andersson, Junqian Xu, Saad Jbabdi, Matthew Webster, Jonathan R Polimeni, et al. The minimal preprocessing pipelines for the human connectome project.Neuroimage, 80:105–124, 2013. 2

work page 2013
[24]

A multi-modal parcellation of human cere- bral cortex.Nature, 536(7615):171–178, 2016

Matthew F Glasser, Timothy S Coalson, Emma C Robin- son, Carl D Hacker, John Harwell, Essa Yacoub, Kamil Ugurbil, Jesper Andersson, Christian F Beckmann, Mark Jenkinson, et al. A multi-modal parcellation of human cere- bral cortex.Nature, 536(7615):171–178, 2016. 2

work page 2016
[25]

NeuroClips: Towards high-fidelity and smooth fmri-to-video reconstruction.Ad- vances in Neural Information Processing Systems, 37: 51655–51683, 2024

Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Changwei Wang, Rongtao Xu, Liang Hu, et al. NeuroClips: Towards high-fidelity and smooth fmri-to-video reconstruction.Ad- vances in Neural Information Processing Systems, 37: 51655–51683, 2024. 1, 2, 5, 6, 4, 7

work page 2024
[26]

Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction

Zixuan Gong, Qi Zhang, Guangyin Bao, Lei Zhu, Rong- tao Xu, Ke Liu, Liang Hu, and Duoqian Miao. Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 14247–14255, 2025. 2

work page 2025
[27]

The functional architecture of the ventral temporal cortex and its role in categorization.Nat

Kalanit Grill-Spector and Kevin S Weiner. The functional architecture of the ventral temporal cortex and its role in categorization.Nat. Rev. Neurosci., 15(8):536–548, 2014. 8

work page 2014
[28]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without spe- cific tuning.arXiv preprint arXiv:2307.04725, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Variational au- toencoder: An unsupervised model for encoding and de- coding fmri activity in visual cortex.NeuroImage, 198: 125–136, 2019

Kuan Han, Haiguang Wen, Junxing Shi, Kun-Han Lu, Yizhen Zhang, Di Fu, and Zhongming Liu. Variational au- toencoder: An unsupervised model for encoding and de- coding fmri activity in visual cortex.NeuroImage, 198: 125–136, 2019. 2

work page 2019
[30]

Turn- ing internal gap into self-improvement: Promoting the generation-understanding unification in mllms.arXiv preprint arXiv:2507.16663, 2025

Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, and Difan Zou. Turn- ing internal gap into self-improvement: Promoting the generation-understanding unification in mllms.arXiv preprint arXiv:2507.16663, 2025. 2

work page arXiv 2025
[31]

Can diffusion models learn hidden inter-feature rules behind images?arXiv preprint arXiv:2502.04725, 2025

Yujin Han, Andi Han, Wei Huang, Chaochao Lu, and Difan Zou. Can diffusion models learn hidden inter-feature rules behind images?arXiv preprint arXiv:2502.04725, 2025

work page arXiv 2025
[32]

Denoising dif- fusion probabilistic models.Advances in neural informa- tion processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural informa- tion processing systems, 33:6840–6851, 2020. 2, 5

work page 2020
[33]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video dif- fusion models.arXiv preprint arXiv:2204.03458, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 5

work page 2022
[35]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 3, 5, 6

work page 2024
[36]

Decoding EEG by visual- guided deep neural networks

Zhicheng Jiao, Haoxuan You, Fan Yang, Xin Li, Han Zhang, and Dinggang Shen. Decoding EEG by visual- guided deep neural networks. InIJCAI, pages 1387–1393. Macao, 2019. 2

work page 2019
[37]

Beyond brain decoding: Visual-semantic reconstructions to mental creation exten- sion based on fmri

Haodong Jing, Dongyao Jiang, Yongqiang Ma, Haibo Hua, Bo Huang, and Nanning Zheng. Beyond brain decoding: Visual-semantic reconstructions to mental creation exten- sion based on fmri. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 19258– 19268, 2025

work page 2025
[38]

Pinpointing visual content: Disentan- gled features in multimodal model for eeg representation learning and decoding.Knowledge-Based Systems, page 114212, 2025

Haodong Jing, Yongqiang Ma, Panqi Yang, Haibo Hua, and Nanning Zheng. Pinpointing visual content: Disentan- gled features in multimodal model for eeg representation learning and decoding.Knowledge-Based Systems, page 114212, 2025

work page 2025
[39]

Mind-vad: Brain- inspired fmri-to-video precise reconstruction via cross- modal autoregressive diffusion.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2026

Haodong Jing, Yongqiang Ma, Wenjie Gao, Dongyao Jiang, Shuai Huang, and Nanning Zheng. Mind-vad: Brain- inspired fmri-to-video precise reconstruction via cross- modal autoregressive diffusion.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2026

work page 2026
[40]

Damind: Zero-shot visual cross-domain alignment and representa- tion for eeg decoding.IEEE Transactions on Image Pro- cessing, 2026

Haodong Jing, Yongqiang Ma, Panqi Yang, Haoyu Li, Shuai Huang, Badong Chen, and Nanning Zheng. Damind: Zero-shot visual cross-domain alignment and representa- tion for eeg decoding.IEEE Transactions on Image Pro- cessing, 2026

work page 2026
[41]

Evoke: Efficient and high-fidelity eeg-to-video reconstruction via decou- pling implicit neural representation

Haodong Jing, Panqi Yang, Dongyao Jiang, Zhipeng Liu, Nanning Zheng, and Yongqiang Ma. Evoke: Efficient and high-fidelity eeg-to-video reconstruction via decou- pling implicit neural representation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5539– 5547, 2026. 2

work page 2026
[42]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Vi- ola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Mixco: Mix-up contrastive learning for visual repre- sentation.arXiv preprint arXiv:2010.06300, 2020

Sungnyun Kim, Gihun Lee, Sangmin Bae, and Se-Young Yun. Mixco: Mix-up contrastive learning for visual repre- sentation.arXiv preprint arXiv:2010.06300, 2020. 4

work page arXiv 2010
[44]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding vari- ational bayes.arXiv preprint arXiv:1312.6114, 2013. 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[45]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Representation of per- ceived object shape by the human lateral occipital complex

Zoe Kourtzi and Nancy Kanwisher. Representation of per- ceived object shape by the human lateral occipital complex. Science, 293(5534):1506–1509, 2001. 8

work page 2001
[47]

A penny for your (visual) thoughts: Self-supervised reconstruction of natural movies from brain activity.arXiv preprint arXiv:2206.03544, 2022

Ganit Kupershmidt, Roman Beliy, Guy Gaziv, and Michal Irani. A penny for your (visual) thoughts: Self-supervised reconstruction of natural movies from brain activity.arXiv preprint arXiv:2206.03544, 2022. 6, 5

work page arXiv 2022
[48]

Bold moments: modeling short visual events through a video fmri dataset and metadata.bioRxiv, pages 2023–03, 2023

Benjamin Lahner, Kshitij Dwivedi, Polina Iamshchinina, Monika Graumann, Alex Lascelles, Gemma Roig, Alessan- dro Thomas Gifford, Bowen Pan, SouYoung Jin, N Apurva Ratan Murty, et al. Bold moments: modeling short visual events through a video fmri dataset and metadata.bioRxiv, pages 2023–03, 2023. 5, 6

work page 2023
[49]

Enhanc- ing cross-subject fmri-to-video decoding with global-local functional alignment

Chong Li, Xuelin Qian, Yun Wang, Jingyang Huo, Xi- angyang Xue, Yanwei Fu, and Jianfeng Feng. Enhanc- ing cross-subject fmri-to-video decoding with global-local functional alignment. InEuropean Conference on Com- puter Vision, pages 353–369. Springer, 2024. 3, 6, 4

work page 2024
[50]

Visual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024

Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, Haoyang Qin, and Quanying Liu. Visual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024. 1

work page arXiv 2024
[51]

Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance

Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 12112–12123, 2025. 2

work page 2025
[52]

Flashmotion: Few- step controllable video generation with trajectory guidance

Quanhao Li, Zhen Xing, Rui Wang, Haidong Cao, Qi Dai, Daoguo Dong, and Zuxuan Wu. Flashmotion: Few- step controllable video generation with trajectory guidance. arXiv preprint arXiv:2603.12146, 2026. 2

work page arXiv 2026
[53]

Neurobolt: Resting-state EEG- to-fMRI synthesis with multi-dimensional feature mapping

Yamin Li, Ange Lou, Ziyuan Xu, Shengchao Zhang, Shiyu Wang, Dario Englot, Soheil Kolouri, Daniel Moyer, Roza Bayrak, and Catie Chang. Neurobolt: Resting-state EEG- to-fMRI synthesis with multi-dimensional feature mapping. Advances in Neural Information Processing Systems, 37: 23378–23405, 2024. 1

work page 2024
[54]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th Eu- ropean conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer,

work page 2014
[55]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In Proceedings of the IEEE international Conference on com- puter vision, pages 2980–2988, 2017. 3

work page 2017
[56]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

EEG2video: Towards decoding dynamic visual perception from EEG signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024

Xuan-Hao Liu, Yan-Kai Liu, Yansen Wang, Kan Ren, Han- wen Shi, Zilong Wang, Dongsheng Li, Bao-Liang Lu, and Wei-Long Zheng. EEG2video: Towards decoding dynamic visual perception from EEG signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024. 2

work page 2024
[59]

Brainclip: Bridging brain and visual- linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual- linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023. 2

work page arXiv 2023
[60]

See through their minds: Learning trans- ferable brain decoding models from cross-subject fmri

Yulong Liu, Yongqiang Ma, Guibo Zhu, Haodong Jing, and Nanning Zheng. See through their minds: Learning trans- ferable brain decoding models from cross-subject fmri. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 5730–5738, 2025. 2

work page 2025
[61]

Showtable: Unlocking creative table visualization with collaborative reflection and refinement

Zhihang Liu, Xiaoyi Bao, Pandeng Li, Junjie Zhou, Zhaohe Liao, Yefei He, Kaixun Jiang, Chen-Wei Xie, Yun Zheng, and Hongtao Xie. Showtable: Unlocking creative table visualization with collaborative reflection and refinement. arXiv preprint arXiv:2512.13303, 2025. 3

work page arXiv 2025
[62]

Hybrid-level instruction injection for video token com- pression in multi-modal large language models

Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, and Hongtao Xie. Hybrid-level instruction injection for video token com- pression in multi-modal large language models. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8568–8578, 2025. 3

work page 2025
[63]

Capability: A comprehensive vi- sual caption benchmark for evaluating both correctness and thoroughness.arXiv preprint arXiv:2502.14914, 2025

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, et al. Capability: A comprehensive vi- sual caption benchmark for evaluating both correctness and thoroughness.arXiv preprint arXiv:2502.14914, 2025. 3

work page arXiv 2025
[64]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[65]

A cognitive process-inspired architecture for subject-agnostic brain visual decoding.arXiv preprint arXiv:2511.02565, 2025

Jingyu Lu, Haonan Wang, Qixiang Zhang, and Xi- aomeng Li. A cognitive process-inspired architecture for subject-agnostic brain visual decoding.arXiv preprint arXiv:2511.02565, 2025. 2

work page arXiv 2025
[66]

Animate your thoughts: Decoupled reconstruction of dynamic nat- ural vision from slow brain activity.arXiv preprint arXiv:2405.03280, 2024

Yizhuo Lu, Changde Du, Chong Wang, Xuanliu Zhu, Liuyun Jiang, Xujin Li, and Huiguang He. Animate your thoughts: Decoupled reconstruction of dynamic nat- ural vision from slow brain activity.arXiv preprint arXiv:2405.03280, 2024. 1, 2, 6

work page arXiv 2024
[67]

Multimodal fusion of brain imaging data: Methods and applications.Machine Intelligence Research, 21(1):136–152, 2024

Na Luo, Weiyang Shi, Zhengyi Yang, Ming Song, and Tianzi Jiang. Multimodal fusion of brain imaging data: Methods and applications.Machine Intelligence Research, 21(1):136–152, 2024. 3

work page 2024
[68]

Hierarchical bayesian causality network to extract high-level semantic information in visual cortex

Yongqiang Ma, Wen Zhang, Ming Du, Haodong Jing, and Nanning Zheng. Hierarchical bayesian causality network to extract high-level semantic information in visual cortex. International Journal of Neural Systems, 34(01):2450002,

work page
[69]

Animal kingdom: A large and diverse dataset for animal behavior understanding

Xun Long Ng, Kian Eng Ong, Qichen Zheng, Yun Ni, Si Yong Yeo, and Jun Liu. Animal kingdom: A large and diverse dataset for animal behavior understanding. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19023–19034, 2022. 5

work page 2022
[70]

Reconstructing visual experiences from brain activity evoked by natural movies

Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Ben- jamini, Bin Yu, and Jack L Gallant. Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 21(19):1641–1646, 2011. 2

work page 2011
[71]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[72]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion.arXiv preprint arXiv:2304.07193, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Natural scene recon- struction from fMRI signals using generative latent diffu- sion.Scientific Reports, 13(1):15666, 2023

Furkan Ozcelik and Rufin VanRullen. Natural scene recon- struction from fMRI signals using generative latent diffu- sion.Scientific Reports, 13(1):15666, 2023. 1

work page 2023
[74]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 4195– 4205, 2023. 3

work page 2023
[75]

Psychometry: An omnifit model for image re- construction from human brain activity

Ruijie Quan, Wenguan Wang, Zhibo Tian, Fan Ma, and Yi Yang. Psychometry: An omnifit model for image re- construction from human brain activity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 233–243, 2024. 2

work page 2024
[76]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[77]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 5

work page 2021
[78]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[79]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2

work page 2022
[80]

Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728,

Paul Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Aidan Dempster, Nathalie Ver- linde, Elad Yundler, David Weisberg, Kenneth Norman, et al. Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728,

work page

Showing first 80 references.