By merging realistic mechanics with growth and division, the framework tests whether orientation alone creates symmetric body plans.
abstractclick to expand
The orientation of cell division is a major determinant of three-dimensional plant morphogenesis. Whether and how a simple division orientation rule explains the establishment of symmetric body plans is a fundamental question. Testing such hypotheses is facilitated by a modeling framework that combines realistic three-dimensional cell mechanics, irreversible cell-wall growth, and a deformable tissue geometry. We recently introduced such a framework, a 3D mechano-geometric multicellular model of apical stem cell-driven morphogenesis. Here we document how the model is built from physiological and computational perspectives. We describe the triangulated thin-shell representation of cells, the treatment of turgor pressure, cell-wall elasticity and strain-driven wall growth, the cell-division algorithm together with its two pluggable division-rule implementations, and the remeshing operations that keep the triangulation well-conditioned as cells grow, divide, and deform. The aim of this paper is to make the present model accessible and customizable to experimental plant biologists.
Biological systems perform complex multi-step processes in a reproducible way despite underlying stochasticity. The standard explanation is micromanagement by molecular machinery that recognizes and corrects specific errors. Here we study conditioning, a qualitatively different strategy in which attempts failing a coarse criterion are destroyed and do not leave a physical record. The surviving, i.e., conditioned, ensemble is narrower and therefore more ordered. We model conditioning through stochastic resets in a ''socks-before-shoes'' model of a growing population, where $n$ actions must be completed in any order to replicate and any replication attempt not finished by a threshold time is discarded. We find that resets impose hierarchical temporal ordering of the $n$ actions without microscopic control over which action happens when. When disorder carries a sufficient time penalty, this ordering is free: the fastest-growing population is automatically the most ordered, with no direct selection for order required. Save points, at which verified progress is preserved across resets, allow conditioning to scale to complex multi-step processes. Conditioning provides a minimal route to reliable behavior, requiring only a clock rather than molecular machinery that recognizes specific errors. For the right class of processes, it pays for itself.
Closed-loop brain-computer interfaces often require both a forecast of upcoming neural population activity and a readout of the animal's behavioral state. A single Mamba forecaster, trained only on next-step spike counts at Neuropixels scale, can deliver both in one forward pass. A lightweight per-session linear head reading the model's predicted rates decodes behavior better than the same linear classifier reading the raw spike counts, under matched temporal context. We test on the Steinmetz visual-discrimination benchmark, which spans 39 sessions, roughly 27,000 neurons, and 1,994 held-out trials. Across three training seeds, Mamba's predicted rates decode mouse choice at 75.7$\pm$0.2% trial vote, roughly 2.3 times chance level, and stimulus side at 66.1$\pm$0.6%, about twice chance. Compared to a matched 500 ms-context linear decoder on the raw spike counts, Mamba wins at trial vote by 4-6 pp on response and 4-6 pp on stimulus side. A session-start calibration block of about 100-150 trials brings the readout within 1-2 pp of asymptote, and the full pipeline fits inside the 50 ms bin budget on workstation-class GPUs typical of tethered chronic Neuropixels recordings.
Neural population models, which predict the joint firing of many simultaneously recorded neurons forward in time, are typically evaluated by a single aggregate Pearson correlation $r$ between predicted and actual spike counts, a number that masks critical structure. We argue that how we evaluate spike forecasting matters as much as what we build, and introduce SpikeProphecy, the first large-scale benchmark for causal, autoregressive spike-count forecasting on real electrophysiology recordings. Our core contribution is a population metric decomposition that separates aggregate performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment. The decomposition surfaces aspects of the underlying data that an aggregate scalar collapses together. We apply the protocol to 105 Neuropixels sessions (Steinmetz 2019 + IBL Repeated Site; ~89,800 neurons) with seven architecture baselines spanning four structural families: four SSMs (three diagonal and one non-diagonal), a Transformer, an LSTM, and a spiking network. The decomposition surfaces a brain-region predictability ranking that reproduces across all seven baselines and survives ANCOVA correction for firing-statistics constraints (region $\Delta R^2 = 0.018$ above the firing-statistics covariates). It also exposes a sub-Poisson evaluation floor where rigorous metrics combine with genuine biophysical constraints on regular spike trains, and yields a negative result on KL-on-output-rates distillation for ANN-to-SNN transfer in this Poisson count domain.
Understanding what individual neurons encode is a core question in neuroscience. In primary visual cortex (V1), mathematical models (e.g., Gabor functions) capture neural selectivity, but no comparable framework exists for higher areas. We show that natural language can fill this role: across macaque V1 and V4, the selectivity of most neurons is captured by concise, verifiable semantic descriptions. Using digital twins of V1 and V4, we develop a closed-loop framework that translates each neuron's high- and low-activating images into dense captions, generates a semantic hypothesis and synthesized images, and verifies the hypothesis in silico. Descriptions range from oriented edges and spatial frequency in V1 to conjunctions of form, color, and texture in V4. In V4, images generated from activating and suppressing hypotheses drove 96.1% of neurons above the 95th and 97.6% below the 5th percentile of natural-image responses, respectively (vs. ~10\% for random images); V1 activation results matched V4, while V1 suppression was less describable in language. Representational similarity analysis reveals partial alignment between neural activity, vision embeddings, and language embeddings, with vision most aligned to neural activity; alignment lost in the text bottleneck is recovered when hypotheses are rendered back into images, showing that linguistic compression is lossy yet semantically faithful. Together, these results show that combining generative models with neural digital twins enables interpretable, testable descriptions of neural function at scale, toward agentic scientific discovery.
Strongly coupled, recurrent, balanced network models have been successful in describing and predicting many phenomena observed in cortical neural recordings. However, most balanced network models use current-based synapse models in place of more realistic, conductance-based models. Conductance-based synapse models predict unrealistically small membrane potential variability. On the other hand, introducing realistic levels of spike time correlations to models with current-based synapses predicts unrealistically large membrane potential variability. We use computer simulations to show that these two effects can cancel: Recurrent network models with conductance-based synapses and spike time correlations produce more realistic, moderate levels of membrane potential variability. Consistent with recent work on feedforward networks, our results show that including more realistic modeling assumptions produces more realistic dynamics, but only if when two modeling assumptions are included together.
Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.
The spatial and functional organization of the primate visual cortex is a fundamental problem in neuroscience. While recent computational frameworks like the Topographic Deep Artificial Neural Network (TDANN) have successfully modeled spatial organization in the ventral stream, the computational origins of the dorsal stream's distinct topographies, such as direction-selective maps in the middle temporal (MT) area, remain largely unresolved. In this work, we present a spatiotemporal TDANN to investigate whether MT topography is governed by the same universal principles. By training a 3D ResNet on naturalistic videos via a Momentum Contrast (MoCo) self-supervised paradigm alongside a biologically inspired spatial loss, we demonstrate the spontaneous emergence of brain-like direction maps and topological pinwheel structures. Crucially, we reveal that MT tuning properties, characterized by strong direction selectivity paired with a residual axial component, arise from a strict optimization trade-off between task-driven discriminative pressure and spatial regularization. The model's representations quantitatively match in vivo macaque MT physiological baselines, including direction selectivity index, circular variance, and pinwheel density. These findings unify the computational origins of the ventral and dorsal streams, establishing a general mechanism for cortical self-organization.
The Inositol 1,4,5-trisphosphate receptor channel (IP 3 R) is an important calcium channel involved in calcium-induced calcium release, playing a prominent role in intracellular calcium signaling. However, accurately characterizing its gating behavior remains a challenge, particularly due to the temporal resolution of patch clamp techniques that is not large enough to detect all short-lived events. This limitation can significantly bias the inference of kinetic models describing the receptor activity. To address this issue, we focused on the quantitative analysis of IP 3 R gating behavior using patch clamp data, with particular attention to missed events. We modeled IP 3 R channel gating using Hierarchical Markov chains and used a Bayesian approach that integrates missed event correction directly into the likelihood function, enabling more accurate parameter inference and model evaluation. We show that accounting for missed events deeply clarifies the multi-modal model that emerges from model selection. In this new model, the Park and Drive modes both consist of the same 3-state Markov model, with mode-dependent kinetic parameters: the Drive mode stabilizes the closed state directly connected to the open one, whereas the Park mode stabilizes the other closed state, that is not connected to the open one. Intermediate Ca 2+ concentrations are found to strongly depress the Drive to Park transition rate, so that the IP 3 R channel undergoes frequent transitions to the Park mode only for __ 50 nM or micromolar Ca 2+ concentrations. Overall, our approach provides a refined perspective on IP 3 R channel modeling and highlights the critical importance of accounting for missed events upon model selection based on single-channel recordings.
Faster probabilistic inference enables large-scale analysis and hyperparameter optimization in protein inference and omics studies.
abstractclick to expand
NORI performs probabilistic inference to resolve ambiguous mappings between experimental observations and biological entities orders of magnitude faster than state-of-the-art methods. This makes large-scale analysis and extensive hyperparameter optimization possible, and supports a broader range of bioinformatics applications, including protein inference, taxonomic and functional analysis in omics-fields.
Topological data analysis (TDA) has established itself as a useful tool for capturing multiscale structures in complex networks, such as connected components, cycles, and cavities. Although Vietoris-Rips (VR) filtering is widely used in network analysis, it tends to be computationally expensive, especially for large networks. This work explores vertex function-based (VFB) filtering based on network measures, applying persistent homology to identify relevant topological structures in cancer-associated protein networks, and compares its effectiveness with the VR approach. The results show that VFB reproduces the second-order structures (Betti-2) identified by VR, recovering previously reported essential genes. In addition, VFB detected new driver genes, confirmed in databases such as IntOGen and NCG, and allowed analysis of third-order structures (Betti-3) that was not feasible with VR. Thus, VFB represents a scalable alternative to VR, preserving biological interpretability and complementing classical network metrics.
Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain-specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain-specific curation task and present an expert-in-the-loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline records, and expert-annotated ground truth. A lightweight cross-validated prompt-refinement module adapts extraction instructions from scarce expert annotations. With only seven annotated molecular glue publications, the workflow achieved record-level $F_1 = 0.98$ and transferred to PROTACs by terminology substitution alone, maintaining record-level $F_1 > 0.93$. Applied at scale, it expanded molecular glue and PROTAC databases by 81% and 92% records, respectively, with 92% and 82.5% of newly recovered records validated as correct upon expert review. The workflow also recovered kinetic and assay-context information essential for cross-study potency comparison and condition-aware degradation modeling. We release the workflow, prompts, evaluation code, and extracted datasets as resources for TPD data curation and AI-assisted scientific curation more broadly.
Regression to the Mean and Regression Dilution are often viewed as unrelated issues in the clinical and ecological literatures. In reality, they are different names for the same problem: measurement error in an independent variable that biases the perceived relationship between two factors. This study unifies these traditions by comparing specialized clinical tools, like the Berry correction, with standard structural estimators such as Major Axis and Reduced Major Axis regression. Using an analytical framework, we evaluate how these methods perform across various noise levels and sample sizes. Our results show that the Berry method is a specialized tool designed for clinical scenarios where a 1:1 relationship is expected. However, applying it to ecological trade-offs with negative slopes can lead to severe errors. We provide maps of optimality to identify which estimator most accurately recovers the true biological signal under different conditions. By reconciling these disparate methods, we offer a principled guide for researchers to choose the correct tool based on their data's noise profile rather than their disciplinary tradition.
Complex microbial habitats see the spatial competition of different clonal bacterial populations that switch between different phenotypes. Here, we determine the effect of this subpopulation structure on the invasion of one species by another in a minimal model of two competing species: one species switches, both stochastically and in response to its competitor, to a persister phenotype resilient to competition. Surprisingly, our combined analytical and numerical results show that this phenotypic switching has no effect on the speed of the travelling wave by which the competitors invade the first population. Conversely, we discover that phenotypic switching can speed up the wave by which this population invades their competitors. Our results thus suggest, counterintuitively, that bacterial persistence can be an offensive, rather than defensive ecological strategy.
The cerebellum and cerebral cortex form tightly coupled circuits thought to support flexible and efficient temporal processing. How this interaction shapes cortical learning dynamics, and whether such heterogeneous modularity can benefit artificial systems, remains unclear. Here, we augment a recurrent neural network (RNN) with a cerebellar-inspired feedforward module and evaluate the resulting architecture on temporal tasks of varying difficulty. The cortico-cerebellar RNN (CB-RNN) learns faster and reaches higher maximum performance than parameter-matched fully recurrent baselines across a variety of regimes. Crucially, freezing the recurrent core after minimal training and delegating subsequent learning to the cerebellar module preserves superior learning efficiency, suggesting the cerebellar module is a primary driver of efficiency and that the cortical network can largely function as a fixed reservoir. Our results suggest that heterogeneous modular architectures can act as a powerful structural inductive bias in neural systems.
Adaptive behavior requires the brain to transition between distinct contexts while maintaining representations of prior experience. The ability to reconfigure neural representations without erasing previously acquired knowledge is central to learning in dynamic environments, yet the neural mechanisms that support this balance remain unclear. Understanding these mechanisms is also critical for addressing catastrophic forgetting in artificial systems designed for lifelong learning. Here, we identify joint sparse coding and temporal dynamics in both the mouse medial prefrontal cortex (mPFC) and computational networks as mechanisms that help preserve prior representations during context transitions. Specifically, sparsity in context-dependent representations reduces cross-context interference, whereas temporal dynamics within the network activity further enhance context separability across time. Strikingly, networks endowed with both properties, such as spiking neural networks, exhibit improved retention during lifelong learning without auxiliary heuristics. These findings establish joint sparse coding and temporal dynamics as a core mechanism supporting flexible context reconfiguration in lifelong learning and, through their activity constraining nature, as an energy-efficient architectural principle for stable adaptation. Together, they provide a mechanistic framework for understanding how the brain preserves prior knowledge while flexibly adapting to new contexts.
Multimodal models that jointly reason over protein sequences, structures, and function annotations within a unified representation hold immense potential for integrating multimodal data and generating new proteins with designed functional properties. To utilize transformer architectures, such models require a tokenizer that converts protein structure from continuous atomic coordinates into discrete representations suitable for scalable multimodal training. The quality of such models are fundamentally upper bounded by the fidelity and expressiveness of the underlying tokenized structure. However, existing tokenizers prioritize reconstruction over generative abilities. To address these gaps, we introduce Yeti, a simple and compact protein structure tokenizer based on lookup free quantization and trained end to end with a flow matching objective for multimodal learning. Compared to existing models, Yeti generally achieves the best codebook utilization and token diversity, and second best reconstruction accuracy (with 10x fewer parameters than ESM3) on diverse datasets. To validate Yeti's generative capability, we trained a compact multimodal model jointly over its structure tokens and amino acid sequence entirely from scratch, with no pretrained initialization. The resulting multimodal model generates plausible structures under unconditional cogeneration of protein sequence and structures, achieving comparable results to 10x larger models. Together, these results demonstrate that Yeti is a compact and expressive protein structure tokenizer suitable for training multimodal models that cogenerates highly plausible sequences and structures.
TD3B controls protein state transitions to produce binders whose functional bias is independent of binding affinity.
abstractclick to expand
Protein function is often controlled by ligands that bias the direction of state transitions, such as agonists and antagonists, rather than stabilizing a single conformation. This is especially important for clinically relevant G protein-coupled receptors (GPCRs), where therapeutic efficacy depends on functional directionality. Structure-based design methods optimize binding to static conformations and cannot represent non-reversible, directional effects or systematically distinguish agonist from antagonist behavior. To address this gap, we introduce Transition-Directed Discrete Diffusion for Allosteric Binder Design (TD3B), a sequence-based generative framework that designs binders with specified agonist or antagonist behavior via a directional transition control objective. TD3B combines a target-aware Direction Oracle, a soft binding-affinity gate, and amortized fine-tuning of a pre-trained discrete diffusion model, enabling targeted agonist and antagonist generation decoupled from binding affinity and unattainable by equilibrium-based or inference-only guidance baselines. The code and checkpoints are available at https://huggingface.co/ChatterjeeLab/TD3B.
One of the great phytogeography zones of semi-arid lands in the world is the Kurdistan region of Iraq which hosts many important fruit species due to its geographical location and ecology. Mountain Hawthorn (Crataegus spp.) is a vital wild edible deciduous fruit tree of the genus Crataegus for the region, which is highly beneficial for ornamental, economical, industrial and medicinal uses. In the present study, morphological, phytochemical and molecular marker systems were applied on sixty-one Hawthorn accessions from different locations in the Iraqi Kurdistan region during April 2022 to September 2023. Phenotypic markers have proven to be extremely useful in studies of genetic diversity in Hawthorn genotypes, the results of the present morphological study showed that there are seven taxa (five species, two hybrids) were observed including, Crataegus azarolus, Crataegus meyrei, Crataegus monogyna, Crataegus orientalists, Crataegus pentagyna, Crataegus azarolus x Crataegus meyrei and Crataegus azarolus x Crataegus pentagyna. There was significant variation among different ecotypes in terms of plant type, reproductive stage, and fruit morphology and production uses. Fruit Physio-morphological data revealed a high level of significant variability (P 0.01) among accessions based on the analysis of variance. The most important characteristics for explaining fruit morphological variability `were 11 varbales including fruit weight (FW), fruit length (FL), fruit width (FW), seed length (SL), seed width (SW), number of seeds per fruits (NSF), volume solution (VS), fruit fresh weight (WOF), seed weight (WS), Potentional of hydrogen (pH) and mositure content (MC). They all are significantly different for all the traits measured for the studied accessions.
Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC-AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC-AUC of 0.71, competitive with state-of-the-art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.
Adults vary greatly in how effectively they learn a new language, but the signals driving the learning processes and individual differences remain unclear. Over seven days, we tracked behavioral learning and collected fMRI data from 102 adults as they learned an artificial language with corrective feedback. We trained matched transformer models with prediction, feedback, or combined objectives and compared their internal representations to brain activity. Representations derived from the prediction-focused model accounted for the largest share of unique neural variance at the group level, despite the human task being feedback-based. Throughout model training, both objectives showed a shift in brain-model alignment from sensory to higher-order language and associative networks, indicating abstraction processing. Conversely, neural patterns related to the feedback model were most useful for predicting individual generalization outcomes on Day 7. These findings support a multi-signal model of adult language learning, in which prediction shapes a common neural learning architecture across learners, whereas feedback-related mechanisms better explain individual differences over time.
Analytical probability distributions show noise-driven transitions and non-monotonic fear effects that reconcile conflicting ecological data
abstractclick to expand
Traditional population models that include predator-prey interactions attribute demographic changes directly to predation-related effects. However, predator-induced fear in prey has increasingly been recognised as an important factor shaping population dynamics. In this study, we propose a cubic population model in which fear acts through two distinct functional channels for a single-species population exhibiting the Allee effect. In this model, fear reduces the intrinsic growth rate through a multiplicative suppression mechanism while also playing an integrated role in modulating the growth and interaction dynamics by rescaling the saturation structure of the Holling type III interaction term. The stochastic extension of the model is described by a Langevin formalism containing correlated additive and multiplicative Gaussian noise, and the steady state probability distribution (SSPD) is analytically obtained using the corresponding Fokker-Planck equation. The analytical solution is validated by numerical simulations. The SSPD reveals both noise-induced transitions and fear-controlled regime changes between low- and high-density states, with the two-channel effect of fear producing structural competition and non-monotonic changes in the distribution. These are analysed through phenomenological bifurcation (P-bifurcation) diagrams and three-dimensional distribution surfaces. Additionally, statistical properties, parameter sensitivity, and escape dynamics are investigated through normalised moments, Fisher information, and mean first-passage time (MFPT) calculations. Notably, our model treats fear as an independent control parameter and provides a natural explanation for several conflicting empirical findings in the literature on fear-mediated population dynamics, while also offering an analytical basis for conservation biology and ecosystem management.
Learning in artificial neural networks usually relies on continuous, externally driven weight updates, in which parameters are modified at every step in response to incoming data, error signals or reward feedback. In this setting, routine and informative inputs contribute similarly to parameter adjustment. We introduce a learning approach in which parameter updates are governed by internally generated events arising from the network own representational dynamics. During ongoing activity, synaptic interactions are accumulated as latent traces encoding recent coactivation patterns, without immediately modifying the underlying parameters. In parallel, an internal predictive process estimates the evolving latent state, while a scalar measure of discrepancy between predicted and observed states is continuously computed. When discrepancy exceeds an adaptive threshold derived from recent error statistics, a learning event is triggered, inducing a retrospective update selectively integrating past activity into the current configuration. We performed simulations using a minimal neural network exposed to structured sequential inputs with transient perturbations. We found that learning occurs through sparse, temporally localized events associated with increases in prediction error, leading to stepwise changes in synaptic efficacy and discrete transitions in latent state organization. By selectively reorganizing parameters in response to internally detected discrepancies, our episodic updating may reduce unnecessary parameter drift while preserving informative patterns. Potential applications include systems requiring selective adaptation to rare or informative inputs such as physiological, industrial or environmental monitoring, edge computing under limited energy budgets, autonomous systems operating in dynamic conditions and sequential computational data processing.
Molecular dynamics shows the AI-predicted structure of the skin adhesion protein remains mostly stable over 500 ns with domain-specific flex
abstractclick to expand
Background: BP180, also known as collagen XVII and BPAG2 (bullous pemphigoid antigen 2), is a 180-kDa transmembrane protein within the hemidesmosomal plaque complex, and which is known to be a major antigen in bullous pemphigoid, gestational pemphigoid, cicatricial (mucous membrane) pemphigoid, and linear IgA bullous disease.
Objective: At present, the 3D structure of BP180 is not known. The goal is to predict a reasonable structure for BP180 through machine learning and molecular dynamics.
Methods: In this work, we use the recent Boltz-2 model to predict a putative structure for the intracellular, transmembrane, and proximal extracellular domains, including the NC16A antigenic region and a portion of its first extracellular collagenous domain, Col-15. We computationally embed BP180 in a simple phospholipid bilayer, demonstrate that the putative structure is stable using molecular dynamics, and analyze its allosteric properties.
Results: The structures presented satisfy symmetry and secondary structure properties which are expected from homology modelling. Over three 500 ns trajectories, there is minor instability of the predicted globular head domain, but the homotrimer otherwise stays mostly folded. The putative NC16A domain is stiff, whereas the truncated Col-15 domain is highly flexible. There does not appear to be a nearby stable conformation distinct from the initial state.
Conclusion: The structure presented is a useful starting point for targeting BP180 pharmacologically, for further experimental characterization of BP180, and for generating hypotheses regarding the relevant epitopes contributing to bullous disease. Diffusion models such as Boltz-2 and AlphaFold3 are useful, but their results must be evaluated carefully.
Importance: The 2026 multisociety dyslipidemia guideline recommended the PREVENT equations in place of the PCE equations, introduced 30-year risk assessment as a new treatment pathway, and lowered risk-based treatment thresholds. The net population impact of these concurrent changes on statin recommendations is unknown.
Objective: To estimate changes in statin recommendations under 2026 PREVENT-based dyslipidemia guidelines compared with 2018 PCE-based guidelines.
Design and Participants: Cross-sectional analysis of pooled data from NHANES, spanning 2011-2023 and comprising 24,199 participants aged 30-79 years.
Main Outcomes and Measures: Number and proportion of US adults receiving or recommended for statin therapy.
Results: At the class 1 threshold, the number of US adults receiving or recommended for statin therapy decreased by an estimated 3.0 million (95% CI, 2.3 million to 3.6 million), with larger reductions among Black adults (-4.2 percentage points [pp]), men (-4.0pp), and adults aged 50-69 years (-5.6pp). At the class 2 threshold--which additionally recommends statins for adults aged 30-59 years based on 30-year risk--the number of adults recommended increased by an estimated 20.8 million (95% CI, 19.6 million to 22.0 million), or +11.6pp. The increase was largest among adults aged 50-59 years (+19.7pp) and 40-49 years (+14.8pp).
Conclusions: The net population impact of the 2026 dyslipidemia guidelines depends critically on which recommendation class is applied. At the class 1 threshold, statin recommendations decreased modestly; at the class 2 threshold, inclusion of 30-year risk assessment substantially expanded recommendations, particularly among younger adults. These divergent effects underscore the importance of the 30-year risk criterion as a major driver of new eligibility and the need for outcomes and equity monitoring during guideline implementation.
Protein design aims to compose amino-acid sequences that fold into stable three-dimensional structures while satisfying targeted functional properties. The field is increasingly shifting toward vibe protein design, where a single model is expected to generate novel sequences, engineer existing proteins, and reason about protein characteristics through flexible natural-language constraints. Large language models (LLMs) have emerged as a leading paradigm in this space. However, existing evaluation benchmarks often limit their scope to a partial aspect of protein design, while others restrict design objectives to structured input schemas, lacking an integrated framework that evaluates the broad spectrum of protein design competence under open-ended intents. To this end, we present Vibe Protein design Benchmark (VibeProteinBench), a language-interfaced benchmark that probes generalist capabilities through three complementary stages mirroring a computational protein design workflow: recognition, engineering, and generation. Each stage is grounded in expert-curated mechanistic rationales and multi-faceted in silico validation, to computationally verify whether model outputs are biologically plausible. Evaluations across diverse general-purpose and domain-specialized LLMs reveal that no model achieves strong performance across all three stages, suggesting that generalist protein design remains a substantial open challenge for current LLMs.
Longitudinal studies keep data, metadata and results together so workflows stay transparent and reports generate automatically.
abstractclick to expand
MeTime is an opensource R package for reproducible analysis of longitudinal metabolomics data. It builds upon a central S4 container, metime_analyser, that stores multiple datasets, associated metadata and analysis outputs, enabling unified handling of complex longitudinal studies. Analyses are constructed by piping modular functions, beginning with data transformations (mod_), followed by calculations (calc_), and optional meta-analysis (meta_), so entire workflows remain transparent and easy to modify. MeTime wraps numerous existing methods within a consistent interface, including sample and metabolite distributions, correlation and distance matrices, dimensionality reduction (PCA, UMAP, tSNE), random forest imputation and feature selection via Boruta, eigenmetabolites and WGCNA based clustering, conservation index analysis, regression models (linear, mixed effects, and generalized additive), and partial correlation networks. By retaining all intermediate results and provenance within the container, MeTime facilitates iterative exploration and ensures reproducible reporting via automatically generated HTML and PDF outputs. Comprehensive user guides, case studies and reference documentation accompany the package, making MeTime a versatile platform for longitudinal omics workflows.
Oscillatory activity in auditory cortex is thought to play a central role in auditory and speech processing by synchronizing neural rhythms to external acoustic features of the speech stream. To support this function, cortical oscillators must flexibly phase-lock to inputs spanning a wide range of timescales, including rhythms substantially slower than their intrinsic frequency. Here we identify a general dynamical mechanism by which intrinsic inhibitory currents operating on multiple timescales enable such flexible phase-locking. Using tools from dynamical systems theory, we show that interactions between slow and superslow inhibitory processes generate prolonged post-input recovery delays through delayed Hopf phenomena, thereby substantially expanding the frequency range over which entrainment can occur. We demonstrate this mechanisms in a biophysically grounded cortical theta oscillator model for speech segmentation. Specifically, we show that both a theta-timescale (4-8 Hz) inhibitory current $I_m$ and a slower delta-timescale (1-4 Hz) inhibitory potassium current $I_{\rm K_{SS}}$ are crucial for entrainment flexibility. Their interaction creates a three-timescale structure that gives rise to pronounced delay phenomena associated with a delayed Hopf bifurcation (DHB). Interestingly, the superslow $I_{\rm K_{SS}}$ and the associated DHB play little role in the unforced oscillatory dynamics, but are recruited to support phase locking under external forcing. Moreover, the intermediate-timescale current $I_m$, rather than being redundant, further expands the phase-locking range by prolonging delayed recovery along the superslow manifold. Together, these results suggest that coordination among intrinsic inhibitory currents operating on multiple timescales may represent a key mechanism supporting flexible phase locking to rhythmic inputs in the brain.
Hierarchical propagation through interaction and pathway graphs improves prediction and reveals functional disease mechanisms in multiple癌
abstractclick to expand
Understanding how molecular alterations propagate across biological systems to drive disease remains a central challenge. Although high-throughput profiling enables comprehensive characterization of tumor states, most models neglect structured biological relationships or lack interpretability across scales. Here we present PPI-Net, a hierarchical graph neural network that integrates protein-protein interaction (PPI) networks with pathway-level representations to model disease from molecular interactions to functional processes. Patient-specific molecular profiles are embedded within a shared interaction network from STRING and propagated through a multi-layer Reactome hierarchy using graph attention, enabling aggregation of gene-level signals into higher-order biological programs. Across RNA-seq data from ten cancer types from The Cancer Genome Atlas, PPI-Net achieves robust predictive performance, with balanced accuracy exceeding 90% in multiple cohorts. Comparative analysis on RNA-Seq data from breast cancer demonstrated that PPI-Net's integration of the Reactome hierarchy improved balanced accuracy by 6.7% relative to a PPI-only model, while hierarchical multi-level supervision improved balanced accuracy by 12.3% relative to using only a single top-level prediction head. Applying a multi-omics approach using RNA-seq and methylation data improves model interpretation, recovering canonical oncogenic modules, including TP53-AKT signaling and stress response pathways, while revealing convergence onto coherent programs such as ion signaling and cellular responses to stimuli. These results demonstrate that integrating interaction networks with pathway hierarchies enables accurate prediction while providing mechanistic insight into cancer biology.
RNA inverse sequence design has broad biological and engineering applications, but computational methods for practical design queries remain limited. Such queries may impose several constraints at once, including target folds or motifs, fixed bases, and coding restrictions, while leaving arbitrary sequence and structure in unspecified regions. Because these constraints may permit many acceptable sequences, we study RNA design as a conditional generative modeling problem. The basic object is a conditional law over RNA sequences given a user-specified condition, with full inverse folding as a special case. We introduce GoForth, a forward-trained RNA language model that conditions on structure, sequence, and coding targets. The formulation separates three ingredients that are often entangled in RNA design: a sequence prior, a forward folding sampler, and a reward or likelihood oracle. We train encoder-decoder models on witnessed folds rather than on outputs from an inverse-design teacher and validate our methodology on full inverse-folding benchmarks, as well as tasks involving constraints on structure, sequence, and coding. The resulting models achieve fast and high-quality candidate generation for mixed RNA design specifications. Moreover they furnish useful semantic embeddings of design tasks and a robust learned notion of designability.
The emergence of a hantavirus variant aboard a commercial cruise ship presents a significant public health concern. This study develops a discrete-time stochastic Susceptible-Exposed-Infectious-Recovered-Dead model to estimate transmission dynamics, hidden exposed infections, and outbreak risk among passengers and crew. Epidemiological parameters and latent disease states were inferred using an Ensemble Adjustment Kalman Filter calibrated to reported case data from WHO and ECDC situation reports. The estimated basic reproduction number was 2.76, with a 95\% confidence interval of 2.52-2.99, indicating substantial potential for sustained onboard transmission before strict quarantine measures. Simulations further suggest that several exposed individuals may remain unidentified during the early outbreak phase, creating a hidden reservoir that symptom-based surveillance alone may fail to detect. These findings highlight the importance of rapid surveillance, widespread testing, targeted quarantine, and active monitoring of exposed individuals in confined travel settings. The proposed modeling framework can support timely outbreak assessment and intervention planning for infectious-disease events in similarly dense and spatially constrained populations.
The success of machine learning in drug discovery hinges on learning the relationship between a chemical structure and its biological activity. While DNA-Encoded Library (DEL) technology can generate the massive datasets required for this task, its primary signal -- sequencing read counts -- is an indirect and often noisy proxy for true molecular binding affinity. To address the scarcity of public benchmarks for developing robust models that can overcome this data challenge, we introduce CA-DEL, a multi-dimensional public benchmark featuring screens against three homologous carbonic anhydrase isoforms. While recent benchmarks like KinDEL have introduced 3D poses for kinase targets, CA-DEL distinguishes itself by focusing on the selectivity challenge among homologous Carbonic Anhydrase isoforms (CAII, CAIX, CAXII). Unlike benchmarks relying solely on noisy enrichment scores, CA-DEL integrates a rigorous validation set of experimentally determined binding affinities ($K_i$) from ChEMBL, establishing a critical Sim-to-Real evaluation paradigm: training on noisy DEL screens and testing on high-fidelity biophysical data.
Qualitative models provide crucial instruments for modelling complex biological systems. While advances in automated reasoning and symbolic encodings have enabled rigorous inference of these models from data, the process remains highly fragile. First, biological measurement errors inevitably propagate into formal model specifications. Second, when a specification becomes unsatisfiable, distinguishing between fundamental design flaws and minor technical errors is notoriously difficult. This uncertainty often leads to under-specification, as it is unclear which observations are still ``safe'' to incorporate. To overcome these challenges, we introduce a robust inference method based on weighted MaxSMT. By encoding uncertain biological observations as weighted soft constraints, our approach enables the solver to identify a model best reflecting the observations, even with some conflicting constraints. Our method allows for Boolean and multi-valued variable domains, alongside observations derived from discretisation (level constraints) and differential expression (ordering constraints). We show our approach can be used to successfully infer neural cell differentiation models from prior-knowledge networks with 200--1300 genes using ordering constraints on all included genes.
A central challenge in the origin of life is understanding how catalytic peptide-like polymers and information-bearing nucleic acid-like polymers emerged as an interde-pendent system. This study constructs a primordial cognitive model incorporating two internal Lotka-Volterra chemical oscillators to investigate, through simulation, whether a catalytic loop, primordial tRNAs, and nucleic acids that record and amplify them, can form through the interaction of polymers represented by binary (0/1) sequences. In this model, a mechanism was introduced where the synthesis of internal oscillations pro-vides a temporal bias for 0/1 selection during polymer elongation, while generated functional sequences are protected, recorded, and re-amplified. Simulation results demonstrated that the proposed cognitive model significantly outperformed a contrast model based on random 0/1 selection in terms of the establishment rate of catalytic loops, the accumulation of functional molecules, polymer elongation, and the reduction of Shannon entropy in sequence distribution. Furthermore, this superiority was generally maintained across sensitivity analyses, including batch calculations with different ran-dom seeds. While this study is a computational model based on abstract binary se-quences and simplified translation/replication rules rather than a direct reconstruction of life's origin, it provides a working hypothesis for the interdependent emergence of catalytic function and information retention by demonstrating that internal oscillations can bias sequence exploration within a framework linking autocatalytic networks, re-cording, and group selection. Future research must verify the generality and empirical validity of this framework by expanding monomer types, evolving into multi-oscillator systems, and establishing correspondences with compartmentalized experimental sys-tems.
The study of cultural evolution seeks to understand the processes by which behavioral variants are chosen in cultures over time, often as the result of large numbers of individual human choices. The selection of new popes, each of whom chooses a papal name -- typically reusing previous names in reference to previous popes -- is among the longest ongoing cultural processes taking place in a single human institution. Here, we use the record of papal names as a setting for long-term analysis of human cultural behavior. Although papal name choices are careful individual decisions, we find that the long-term sequence of papal names accords with predictions of a family of models developed in population genetics and stochastic processes -- Ewens sampling theory and the Chinese restaurant process -- which in the case of papal names amounts to randomly copying an existing name in proportion to its frequency, with the possibility of innovation of new names (mutations). Hence, despite the consideration that enters into choices of individual papal names, aggregate cultural behavior in a 2000-year old human process can potentially be described with simple laws. We discuss instances in which particular historical events might have caused temporary deviations from the random-copying model.
Functional connectivity (FC) derived from resting-state fMRI is widely used to characterize large-scale brain network alterations in neurological and psychiatric disorders. However, FC construction critically depends on the choice of brain atlas, and different parcellations may emphasize distinct organizational features, leading to heterogeneous and sometimes inconsistent representations. Existing multi-atlas approaches partially alleviate this issue but often fuse atlas-derived features or predictions at a relatively shallow level, while single-atlas disentanglement methods do not explicitly address cross-atlas heterogeneity. We propose Multi-Atlas Disentangled Connectivity LEarning (MADCLE), a multi-branch representation learning framework that jointly encodes FC matrices derived from different brain atlases. Rather than introducing a single explicitly shared latent variable across parcellations, MADCLE learns atlas-wise disease-related representations and encourages them to be cross-atlas consistent through distributional alignment. Meanwhile, covariate-related and atlas-dependent residual factors are modeled separately using covariate similarity supervision, atlas-specific reconstruction, and decorrelation constraints, thereby reducing the leakage of non-disease and parcellation-dependent information into the disease-related embeddings. Experiments on the ADNI and ADHD-200 datasets suggest that MADCLE achieves competitive or improved performance compared with single-atlas baselines, multi-atlas GNN/Transformer models, and recent multi-atlas consistency frameworks. These results support the potential value of structured disentanglement for FC-based disorder identification under heterogeneous parcellation schemes.
Our understanding of cell division control in bacteria still relies largely on interpreting correlations between phenomenological variables, with limited connection to the underlying molecular mechanisms.
Here, we analytically solve a stochastic threshold-accumulation model in which a size-dependent divisor protein triggers division upon reaching a noisy, autocorrelated threshold, quantifying within a unified framework the combined effects of intrinsic and extrinsic noise and key mechanistic parameters such as protein reset and threshold memory. We show that incorporating these elements yields behavior far richer than the commonly assumed adder, spanning a continuum of division strategies from timer to sizer while modulating size fluctuations in a nontrivial fashion. Comparison with single-cell E. coli data shows that extrinsic noise and additional mechanistic ingredients are required to account for the observed size fluctuations. The adder emerges when threshold correlations balance protein reset, generalizing the hypothesis that full reset is necessary to maintain adder control.
Our results establish a unified analytical framework linking stochastic molecular processes to emergent division laws, to be used in more complex bacterial cell-cycle models.
Trial-to-trial variability of neural responses has been linked to important aspects of neural computation and is essential for understanding how neuronal populations respond. While current overdispersion models treat each neuron's gain as independent of each other, this assumption fails to capture the network statistics of neuronal populations. As no existing model can capture overdispersed structured spiking gain-modulation across a neural population, network-level gain covariance remains largely unstudied. We thus present the Poisson matrix-normal latent variable (PMNLV) model, which extends single-neuron overdispersion to neural populations by placing a matrix-normal prior over the latent gain with a Kronecker-factored covariance. Spike counts are Poisson-distributed with a rate equal to the sum of a per-neuron stimulus tuning term and a matrix-normal gain, passed through a quadratic soft-rectifying link. We derive two complementary estimation algorithms: a variational EM (VEM) with a matrix-normal posterior that recovers dense Kronecker factors without structural assumptions, and a Kernel Tournament Method (KTM) that performs data-driven selection over a biologically motivated kernel dictionary and composite likelihood. On simulated data, both algorithms recover the inter-neuron and temporal covariance factors and accurate tuning curves. Applying VEM to Neuropixel recordings across four cortical regions of mouse visual hierarchy, we replicate a previous finding that single-neuron marginal variability changes little across cortical areas. We then show that shared population co-variability, invisible to scalar summaries e.g., the Fano factor, peaks in primary visual cortex and declines in higher visual areas. The PMNLV framework is applicable to any simultaneously recorded population where structured gain covariance is of scientific interest.
Robust genotype-to-phenotype (G2P) prediction is essential for accelerating breeding decisions and genetic gain. However, it remains challenging to measure complex traits under variable field conditions and across years. In this study, we propose a linear-Transformer approach, LiT-G2P (Linear-Transformer Genotype-to-Phenotype), an automated predictive framework that integrates additive genetic variance effects with Transformer-based nonlinear interactions using genome-wide single-nucleotide polymorphisms (SNPs) data. We evaluated LiT-G2P on a panel of diverse grape accessions, genotyped with SNP markers and measured for phenotypes across two consecutive years. Target phenotypic traits include leaf hair density and trichome density of grapevines. Across both single-year and cross-year testing scenarios, LiT-G2P consistently improves prediction performance compared with baseline models. For hair density, LiT-G2P achieves the lowest error in both single-year and cross-year evaluations, with RMSEs of 0.469 and 0.454, respectively, while maintaining strong tolerance accuracies of 79.2% and 74.6%, respectively. For trichome density, LiT-G2P also presents the best overall G2P performance. In addition, we extract model-prioritized SNPs from attention weights and apply genotype-stratified analysis to provide interpretable candidate marker for downstream validation. These results demonstrate that integrating stable additive effects with learned interaction patterns can enhance cross-year robustness and support practical SNP-based predictive modeling for genomic selection.
In the early stages of development, Drosophila melanogaster embryos possess very fast and well-coordinated cell cycles. In the cell cycle, CDK activity is essentially regulated by binding CDK and CycB to form an active complex and by phosphorylating CDK via CDC25 and dephosphorylating it via Wee1. We develop a mathematical model for the embryonic cell cycle which is biochemically sound and which can be rigorously analysed after a model reduction. We show that there exists a region in the parameter space where the model describes oscillations. We then focus on the role of two parameters: the CycB synthesis and the activation coefficient of APC. Our main biological hypothesis is that the first one is responsible for the period lengthening over the first 14 cycles which can be experimentally observed and this hypothesis is supported by numerical simulations of our model: if the CycB synthesis is made time-dependent with a prescribed dynamics, then our simulations show qualitatively a very similar behavior to experimental data reported in the literature.
Brain-DNN alignment is usually assessed through stimulus-level correspondence or stimulus-set geometry. Inspired by category theory, we operationalize a different question: do brain and model preserve the same candidate transformations among stimuli? We formalize this as approximate naturality: if a proxy-defined stimulus change is propagated through the brain side and then translated to the model side, the result should match translating first and then propagating, so that the naturality square approximately commutes. We quantify deviations from commutativity by a Naturality Violation Score (NVS) normalized to a permutation null, shifting alignment from per-stimulus sameness to preservation of structure under an explicitly chosen comparison map. As a proof of concept, a controlled five-factor synthetic setting shows that NVS separates complementary alignment failures that aggregate object- and geometry-level scalars cannot resolve. Applied to fMRI responses from the GOD dataset (5 subjects), 3 vision DNNs, and 3 World-Model proxy embeddings, the axis-resolved analysis reveals a hierarchy crossover: semantic axes align most strongly toward HVC and deeper DNN layers (NVS^animacy = 0.39 vs 0.52 for the next-best axis and 1.0 for the permutation-null baseline), whereas low- and mid-level visual axes align toward earlier visual cortex and shallower layers. Supporting analyses (a 15-axis appendix atlas, dissociation tests against RSA/CKA and encoding/decoding accuracy, and a W-less anchor-ablation control) confirm that the alignment is selective over candidate morphism families rather than uniform. NVS thereby turns brain-DNN comparison into a test of jointly preserved candidate transformations, relative to an explicit proxy space and permutation null, and opens a path to richer proxy spaces and controlled world-side transformations.
Understanding how neural population responses represent sensory information is a central problem in systems neuroscience. One approach is to define a representational geometry on stimulus space in which distances reflect how reliably stimuli can be distinguished from neural activity. However, different constructions of these distances can lead to qualitatively different conclusions about the neural code. Here, we show that a unique Riemannian representational geometry emerges from first principles governing how distances contract as stimulus resolution is lost through coarse-graining. This results in a multi-scale extension of the Fisher information metric, capturing encoding structure from fine stimulus details to coarse global distinctions. The resulting geometry is exactly related to the mutual information encoded by the population: well encoded stimulus directions - those contributing more to mutual information - are expanded, whereas poorly encoded directions are contracted. The metric tensor can be estimated using diffusion models, making the framework practical for large neural populations and high-dimensional stimuli. Applied to visual cortical responses to natural images, the eigenvectors of the metric tensor identify stimulus variations that contribute most to information transmission, yielding interpretable features that are robust to modelling choices. Together, these results provide a principled, information-theoretic framework for characterising neural population codes.
Higher-order interactions are increasingly recognized as a key component of ecological dynamics. However, we show that higher-order Lotka-Volterra dynamics can, in some scenarios, be accurately reproduced by effective pairwise models fitted to the same abundance time series. Consequently, higher-order interactions cannot, in general, be inferred from time-series data alone. We further identify a fundamental problem of mechanistic identifiability, whereby distinct interaction mechanisms generate nearly indistinguishable dynamics, potentially leading to accurate yet misleading ecological interpretations. Our results highlight the need to complement time-series data with additional ecological information to infer interaction structure reliably.
Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks. OmicsLM represents each transcriptomic profile as a compact continuous representation within the LLM context. This interface preserves quantitative expression signal while allowing natural-language instructions, explicit gene mentions, and multiple interleaved biological samples to be processed together in one model context. We train OmicsLM on more than 5.5 million instruction-following examples spanning over 70 task types, combining continuous transcriptomic inputs, experimental data rendered through diverse language templates, and free-text biological knowledge and question-answering data. This mixture covers cell type annotation, perturbation prediction, clinical prediction, pathway reasoning, and open-ended biological question answering. Existing benchmarks evaluate either profile-level prediction or text-only biological QA, leaving language-guided, multi-sample reasoning over real expression profiles unmeasured. To close this gap, we introduce GEO-OmicsQA, a benchmark for multi-sample biological question answering built from real Gene Expression Omnibus (GEO) studies. We demonstrate that OmicsLM can use expression profiles directly and perform comparably to specialized omics models on profile-level tasks, while outperforming both omics-specialized models and general LLMs on language-guided biological reasoning over expression data.
Multiple stable states - the coexistence of two or more distinct ecological configurations under identical environmental conditions - have attracted sustained interest in ecology, yet the field still lacks a unified framework connecting ecological mechanisms to dynamical models. Here, we review empirical and theoretical approaches to multiple stable states, synthesising perspectives on stability, tipping, hysteresis, and transient dynamics, and contextualise these within a common mathematical framework. Drawing on examples of well-known ecosystem models, we highlight the central and necessary role of positive feedback loops and identify other common, unifying features of ecological systems that exhibit multiple stable states. We further discuss the relationship between stable and transient dynamics, the roles of spatial and temporal scales in feedback identification, and the implications for ecological restoration and management. We conclude with open questions and challenges for the field, including extending multistability theory to persistent-transient frameworks and harnessing emerging data-collection technologies to sharpen empirical inference.
Decoding approaches are widely used in neuroscience and machine learning to compare stimulus representations across neural systems, such as different brain regions, organisms, and deep learning models. Popular methods include decoding (perceptual) manifolds and alignment metrics such as Representational Similarity Analysis (RSA) and Dynamic Similarity Analysis (DSA), where similarity in decoding representations is interpreted as evidence for similar computation. This paper demonstrates a fundamental weakness behind this approach: it is misleading to assume that representational geometry is representative of a neuronal population as a whole, when such representations may actually be shaped by a very small subset of neurons. We show that the complementary encoding paradigm addresses this issue directly: it characterizes how neurons are organized globally in terms of their responses to a set of data, providing insight into how the decoding representation is implemented by neurons within a population. We demonstrate across experiments in biological systems and deep learning models that (i) surprisingly, similar decoding behavior and high representational alignment can arise from small, non-representative subpopulations of neurons; and critically, (ii) alignment metrics are insensitive to encoding manifold topology (how function is distributed across neurons), despite this being a key signature of differentiation across biological systems. A controlled MNIST experiment provides causal evidence: decoding metrics remain unchanged even when encoding topology is causally manipulated via the training loss. Overall, similarity in decoding behavior, as measured by classic alignment metrics, does not imply similarity in function or computation, motivating the use of encoding manifolds as a complementary tool for comparing neural systems.
Clinical dietary assessment can generate detailed but high-dimensional nutrient and food-group information that is difficult to translate quickly into counselling priorities. This paper proposes an explainable unsupervised-to-supervised machine learning framework for discovering, reproducing and interpreting dietary patterns using public UK National Diet and Nutrition Survey data. Adult participants aged 19 years and above from NDNS Years 12-15 were represented using 25 energy-adjusted nutrient and food-group features. K-means, Gaussian Mixture Models and Agglomerative Clustering were compared across k = 2-8, with stability and dietetic interpretability used alongside internal validation metrics. The selected K-means k = 4 solution identified four interpretable dietary patterns: high fat/meat and sodium, higher fibre fruit-vegetable micronutrient, high free-sugar snacks and sugary drinks, and dairy/cereal calcium-rich saturated-fat. A supervised surrogate classifier reproduced held-out cluster membership with high test performance (macro-F1 = 0.963), but was interpreted only as an explanatory surrogate rather than as an independent clinical prediction model. SHAP analysis linked predictions to dietetically meaningful drivers, suggesting potential value for dietitian-in-the-loop assessment, counselling prioritisation and follow-up monitoring.
Designing functional protein sequences that satisfy multiple desired properties is a core research focus of protein engineering. Prior methods struggle with inability or inefficiency when dealing with numerous, often conflicting, properties. We propose Multi-Property Protein Diffusion (MP2D), a unified framework for multi-objective protein sequence optimization that integrates conditional discrete diffusion with constrained MCTS and global iterative refinement. MP2D formulates diffusion denoising as a constrained sequential decision-making process and employs MCTS to explore diverse denoising trajectories guided by Pareto-based rewards. A global iterative refinement strategy further enables repeated remasking and re-optimization of candidate sequences, while a dynamic Pareto constraint prevents candidate bloat and maintains balanced trade-offs across objectives. We evaluate MP2D on two challenging multi-objective protein design tasks: antimicrobial peptide and protein binder optimization, involving four to five conflicting properties. Experimental results demonstrate that MP2D consistently outperforms existing multi-objective baselines, achieving robust and balanced improvements across all objectives without retraining generative models. These results highlight MP2D as a practical and scalable solution for multi-objective functional protein design.
The study of shapes is one of the most fundamental problems in life sciences. Although numerous methods have been developed for the morphometry of planar biological shapes over the past several decades, most of them focus solely on either the outer silhouettes or the interior features of the shapes without capturing the coupling between them. Moreover, many existing shape mapping techniques are limited to establishing correspondence between planar structures without further allowing for the quantitative analysis or modelling of shape changes. In this work, we introduce FDA-QC, a novel planar morphometry method that combines functional shape data analysis (FDA) techniques and quasi-conformal (QC) mappings, taking both the boundary and interior of the planar shapes into consideration. Specifically, closed planar curves are represented by their square-root velocity functions and registered by elastic matching in the function space. The induced boundary correspondence is then extended to the entire planar domains by a quasi-conformal map, optionally with landmark constraints. Moreover, the proposed FDA-QC method can naturally lead to a unified framework for shape morphing and shape variation quantification. We apply the FDA-QC method to various leaf and insect wing datasets, and the experimental results show that the proposed combined approach captures morphological variation more effectively than purely boundary-based or interior-based descriptions. Altogether, our work paves a new way for understanding the growth and form of planar biological shapes.
The origin of life is often framed primarily as a chemical problem, yet life's defining feature is evolution. Advances in geochemistry, prebiotic chemistry, and molecular biology have produced diverse scenarios for the emergence of genomes, metabolism, and cellular compartments on the early Earth, but most of these models lack a population-genetics framework. Here, we argue that origin-of-life research must expand from asking simply how life began to exploring how it evolved from pre-biological systems. Synthesizing evidence from comparative genomics, phylogenetics, biochemistry, and geoscience, we emphasize that the last universal common ancestor (LUCA) was already a complex, ecologically adapted population far removed from the starting point of life, implying a deep pre-LUCA evolutionary history. We highlight how population genetics, ecology, and synthetic biology can constrain origin-of-life scenarios by making explicit the roles of selection, drift, mutation, horizontal gene transfer, parasites, and compartmentalization in shaping early communities. Finally, we outline an evolutionary research agenda spanning protometabolic and autocatalytic networks, protocells, the emergence of translation, and the transition to DNA genomes, in which qualitative models can now be buttressed and formalized by evolution-driven hypotheses subject to testing using theory and laboratory experiments, including those with synthetic cells.
The origin of life is often framed primarily as a chemical problem, yet life's defining feature is evolution. Advances in geochemistry, prebiotic chemistry, and molecular biology have produced diverse scenarios for the emergence of genomes, metabolism, and cellular compartments on the early Earth, but most of these models lack a population-genetics framework. Here, we argue that origin-of-life research must expand from asking simply how life began to exploring how it evolved from pre-biological systems. Synthesizing evidence from comparative genomics, phylogenetics, biochemistry, and geoscience, we emphasize that the last universal common ancestor (LUCA) was already a complex, ecologically adapted population far removed from the starting point of life, implying a deep pre-LUCA evolutionary history. We highlight how population genetics, ecology, and synthetic biology can constrain origin-of-life scenarios by making explicit the roles of selection, drift, mutation, horizontal gene transfer, parasites, and compartmentalization in shaping early communities. Finally, we outline an evolutionary research agenda spanning protometabolic and autocatalytic networks, protocells, the emergence of translation, and the transition to DNA genomes, in which qualitative models can now be buttressed and formalized by evolution-driven hypotheses subject to testing using theory and laboratory experiments, including those with synthetic cells.
We study the geometry of the mean fitness surface of replicator systems and its relationship to evolutionary trajectory dynamics. Using the symmetric--antisymmetric decomposition of the fitness landscape matrix, we derive an explicit formula for the rate of change of mean fitness and establish necessary conditions for its monotonicity along trajectories. In general, replicator trajectories do not reach the maximum of the fitness surface, even in the presence of a unique asymptotically stable equilibrium. We characterise, in terms of the symmetric and antisymmetric parts of the fitness matrix, the precise conditions under which an equilibrium coincides with a local extremum of the fitness surface. Circulant matrices are identified as a natural and nontrivial class satisfying these conditions. We establish a two-way connection between fitness surface maxima and evolutionarily stable states: evolutionary stability implies a local fitness maximum, and the converse holds under the identified structural conditions. When the unique asymptotically stable equilibrium is a local maximum, it is evolutionarily stable and realises the global maximum of the fitness surface; an unstable equilibrium forces the global maximum to the boundary of the simplex. The framework is extended to general Lotka--Volterra systems, where an analogue of mean fitness is shown to share the same extremal properties. Results are illustrated through six examples spanning autocatalytic and hypercyclic replication, a parametric family exhibiting Andronov--Hopf bifurcation and heteroclinic cycles, and the Eigen quasispecies model.
Computational cognitive models discovered using large language models have so far relied solely on behavioral data. However, it is well-known that models produced from the behavioral trajectory alone are typically under-determined. In this work, we explore the use of Think Aloud traces as an additional form of data constraint during automated model discovery. When applied to the domain of risky decision-making, we find that the models discovered with think-aloud achieve significantly improved predictive performance on held-out data. Additionally, we find that the discovered models belong to different structural classes than those discovered from behavior alone for the majority of participants (69.4\%), specifically, it shifts from Explicit comparator towards Integrated utility. These results suggest that process-level language data not only improve model fit, but also systematically reshape the structure of the discovered cognitive models, enabling the identification of mechanisms that are not recoverable from behavior alone.
Failing to account for ecological processes such as dispersal and connectivity when modeling distributions can lead to biased inference about environmental drivers and reduced predictive performance. Spatial dynamic occupancy models are promising to study range dynamics while accounting for dispersal and connectivity, but they currently rely on restrictive formulations of the colonization process, and computational constraints prevent their application at large spatial scales. Here, we propose a process-based dynamic occupancy model to study the distribution of range-expanding species while accounting for connectivity and effects of the environment. We introduce a formulation based on dispersal-pressure that provides a flexible and ecologically interpretable representation of the colonization process, and develop a computational approach based on sparse distance matrices that enables its application to national and transnational scales. We conducted a simulation study that showed unbiased parameter estimation across various ecological scenarios. We also applied our model to two range-expanding carnivores offering complementary insights: the grey wolf and the Eurasian otter. Our model revealed contrasting colonization dynamic, with wolves primarily constrained by altitude and forest cover while otters where only marginally affected by the environment, suggesting that their distribution is limited by dispersal history rather than habitat preferences. By explicitly disentangling the influence of dispersal and environment on distributions, our model provides better insight into occupancy-environment relationships under non-equilibrium conditions, and help identifies what limits species distributions. In light of the increasing availability of large-scale biodiversity data, our framework offers opportunities to study range dynamics using mechanistic approaches across entire landscapes.
T-cell receptor (TCR) interactions with antigenic peptides underpin adaptive immunity and are pivotal for personalized immunotherapy and vaccine development. Despite recent progress, computational modeling of TCR-peptide specificity remains challenging due to data scarcity, complex sequence dependencies, and the absence of standardized evaluation frameworks. To systematically address these issues, we introduce TCRTransBench, a comprehensive benchmark for bidirectional TCR-peptide sequence generation tasks. Specifically, we define two sequence-to-sequence (seq2seq) tasks: generating antigenic peptides from TCR sequences (TCR2PEP) and generating TCR sequences from antigenic peptides (PEP2TCR). Our framework provides a rigorously curated, MHC-free dataset comprising tens of thousands of validated TCR-peptide pairs, along with diverse evaluation metrics that integrate computational efficiency, sequence accuracy, and biological plausibility. Extensive benchmarking across representative neural architectures, including recurrent, convolutional, and transformer-based models, reveals key trade-offs among performance metrics, highlighting the effectiveness of transformers in capturing intricate biological interactions and the necessity of biologically informed evaluation criteria. TCRTransBench establishes standardized tasks, datasets, and evaluation protocols, laying a robust foundation for future computational advances in immunological sequence modeling and therapeutic protein design.
Cross-frequency interactions are fundamental brain mechanisms for integrating information across temporal scales. However, accurate identification of these couplings is hindered by complex multi-frequency nonlinearities and by spurious, zero-lag artifacts caused by volume conduction. To our knowledge, conventional metrics lack a robust framework to characterize genuine interactions among multiple time series where a frequency of interest $f_N$ arises from the combination of $N-1$ components such that $f_N = \sum_{i=1}^{N-1} f_i$. We introduce a general family of antisymmetric cross-polyspectral indices designed to quantify these harmonic dependencies while being intrinsically robust to instantaneous mixing. We derive the theoretical properties of these quantities and validate them through simulations of cubic nonlinearities. As a proof of concept, we apply the indices to empirical EEG recordings; the results reveal significant higher-order dependencies that elude standard analytical approaches. We further discuss how these indices can inform novel, personalized multi-site transcranial magnetic stimulation (mTMS) protocols by enabling the selective monitoring and modulation of specific multi-frequency network interactions.
The genetic code and symbolic language create unstable transitions that make advanced technological societies vanishingly rare.
abstractclick to expand
The Great Filter hypothesis proposes that the emergence of technological societies capable of interstellar travel depends on a small number of exceptionally hard and highly improbable steps. Traditional versions of this hypothesis enumerate such "hard steps" along the trajectory from inanimate matter to complex technological societies, but diverge in their explanations for why these particular steps should be so improbable. The theory of Major Evolutionary Transitions also faces challenges in identifying which steps should be considered universally "hard" across different evolutionary pathways. In contrast, we argue that two deeply structural obstacles dominate the evolutionary landscape: the coding threshold associated with the origin of the genetic code, and the language threshold associated with the emergence of symbolic communication. We examine the developmental precursors of both transitions and analyze the underlying algorithmic bottlenecks: points at which evolving systems separate code from function, while entangling them within information hierarchies. Using a game-theoretic analysis of coupled signaling and coordination dynamics, we then argue that the corresponding multichannel games exhibit unstable equilibria that render the transitions intrinsically difficult. We conjecture that the so-called Great Filter is best understood not as a sequence of isolated improbable events, but as a nested structure of tangled information hierarchies. Under this interpretation, the rarity of advanced societies follows from the difficulty of crossing these coding thresholds in a competitive noisy environment. This perspective reframes the Great Filter as an algorithmic property of evolving systems, highlighting why only a vanishingly small fraction of life may ever traverse the path toward technological societies capable of interstellar travel.
We introduce PhenixCraft, a fully automated pipeline for building atomic models from cryo-EM density maps. By integrating AlphaFold predictions, we enhance the map-segmentation step in Phenix during model building, addressing challenges posed by noise and artifacts that traditionally hinder this step. Our results demonstrate PhenixCraft's superior performance in TM-scores and sequence accuracy, significantly improving upon the limitations and inefficiencies of traditional model building using Phenix.
Deep convolutional neural networks (DCNNs) have rivaled humans on many visual tasks, yet they remain vulnerable to near-imperceptible perturbations generated by adversarial attacks. Recent work shows that aligning DCNN representations with human visual cortex activity improves adversarial robustness, but the mechanisms driving this advantage are unclear. One hypothesis suggests that neural alignment confers robustness by biasing models away from brittle high-frequency details and towards the low spatial frequencies (LSF). However, recent work shows that human object recognition critically depends on a narrow, mid-frequency "human channel". Interestingly, this band was partially preserved in prior LSF-focused studies. Here, we investigate whether a spectral bias towards the LSF or the human channel is the primary driver of the adversarial robustness observed in neurally aligned DCNNs. We first show that DCNNs aligned to higher-order regions of the human ventral visual stream systematically increase reliance on both LSF and the human channel. However, directly steering DCNNs towards these bands revealed a clear dissociation. Biasing models towards the human channel, either alone or together with LSF, does not improve robustness and even impairs it. LSF bias produced some robustness gains, but such improvements are modest despite inducing much larger shifts in spatial-frequency reliance than neurally aligned models. Spatial-frequency-biased models overall show little, if any, increase in similarity to human neural representational geometry. Together, our results suggest that altered spatial-frequency reliance is likely an emergent property of learning more human-like representations rather than the primary mechanism by which neural alignment confers adversarial robustness, and motivate the need for future research examining representational properties beyond spatial-frequency profiles.
Spatial environmental variation can either amplify or suppress the fixation of beneficial mutants in structured populations, yet the interplay of ecological factors and spatial structure in determining which outcome occurs remains theoretically unresolved. Here, we develop a unified framework for selection on lattice graphs with environmental heterogeneity, in which mutant and resident fitness depend on the local environmental state. Across three common classes of genotype-environment interactions and a wide range of spatial arrangements of environmental states, we identify two governing principles. Genotype specificity determines the direction of the effect: heterogeneity amplifies selection when it modulates resident fitness, but suppresses selection when it modulates mutant fitness, with genotype-symmetric modulation producing weaker amplification. Spatial arrangement determines the magnitude: intermixed versus clustered environments tune the strength of amplification or suppression without reversing the direction of the effect. Together, these principles reconcile disparate theoretical results and provide predictive criteria for adaptation in heterogeneous landscapes, from microbial communities to somatic evolution and cancer.
Biological systems operate under simultaneous energetic and informational constraints, yet direct evidence that such constraints shape real metabolic networks is limited. The Network-Weighted Action Principle predicts that networks under these constraints should organize toward high modularity. We tested this prediction in marine microbiome metabolic networks reconstructed from Tara Oceans metagenomes using two complementary approaches. Composite metrics of protein-deployment efficiency and functional-repertoire complexity (n=10) failed under causal-inference diagnostics, with apparent structure dominated by shared-component bias. In contrast, network modularity (n=7) was high (Q ~ 0.987), but this value was shown to arise from sparsity alone. The biologically meaningful signal is the excess over null models: modularity exceeded configuration-model, label-permutation, and bipartite-incidence nulls by Delta Q ~ 0.15-0.40 (p < 0.001), with the largest effect under the bipartite-incidence control. Fine-grained communities recovered by the network partition are not arbitrary: 25% recur across samples, and the most consistent modules map to known functional units, including enzyme subunits, biosynthetic sequences, and transporter complexes. Together, these results show that modularity excess - rather than absolute modularity - is the appropriate signature of biological organization, and that such excess is consistent with cost-minimization principles operating at the scale of natural metabolic networks.
It beats linear encoding models several-fold on fMRI data and matches decades of lab findings through virtual experiments on vision and word
abstractclick to expand
Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predicting human brain activity in a variety of naturalistic and experimental conditions. Leveraging a unified dataset of over 1,000 hours of fMRI across 720 subjects, we demonstrate that our model accurately predicts high-resolution brain responses for novel stimuli, tasks and subjects, superseding traditional linear encoding models, delivering several-fold improvements in accuracy. Critically, TRIBE v2 enables in silico experimentation: tested on seminal visual and neuro-linguistic paradigms, it recovers a variety of results established by decades of empirical research. Finally, by extracting interpretable latent features, TRIBE v2 reveals the fine-grained topography of multisensory integration. These results establish artificial intelligence as a unifying framework for exploring the functional organization of the human brain.
Tests on 853 compounds across 16 viral targets show ML models outperform docking, with fine-tuning lifting correlation to 0.7.
abstractclick to expand
Antivirals are uniquely positioned to be deployed quickly during a new outbreak, especially when repurposed from approved drugs. Yet there are no FDA-approved antivirals for the majority of viral families with pandemic potential. Here we lay out the case for investing in technologies and techniques for antiviral drug discovery and designing antiviral combinations. We present a survey of open source datasets and computational tools for in silico antiviral drug discovery, with a particular focus on the latest AI-based systems and docking tools. We then present our custom dataset of 43,005 viral protein-ligand binding measurements that we curated from BindingDB and other sources. Importantly, we found that 31% of viral protein binding data in BindingDB required polyprotein sequences to be carefully split before the data were suitable for training or testing ML models. Using our custom dataset we fine-tuned the DrugFormDTA binding affinity prediction model (Khokhlov et al. 2025). We then benchmarked 15 open-source binding affinity prediction tools on a custom test set of 853 antiviral compounds spread across 16 different protein targets from 10 virus species. Models tested include Boltz-2, GNINA, FlowDock, Interformer, AutoDock-GPU, and others. We found that Boltz-2 and DrugFormDTA ranked highest overall among ML-based approaches, and GNINA did best among docking approaches, with notable variance across specific viral proteins. Fine-tuning DrugFormDTA on our custom cleaned antiviral dataset boosted performance from $r=0.5$ to $r=0.7$. As part of this work we also compiled a library of approved drugs and a comprehensive list of investigational and approved antiviral drugs that can be viewed at https://antivirals-database.radvac.org. Together, this work provides a foundation for future work towards new tools and platforms for rapid drug repurposing and rapid design of antiviral combinations.
Free energy geometry arises bottom-up in recurrent circuits driven by the world, then fixed by plasticity into autonomous attractors.
abstractclick to expand
The free energy principle casts perception as variational inference, but its biological implementation remains underspecified. In particular, the generalized-coordinate formalism should not be read as a literal claim that neurons compute arbitrary Taylor expansions. This paper argues that generalized synchronization provides the missing bottom-up mechanism. A contractive recurrent circuit driven by structured sensory input can synchronize to the driving dynamics. Under generic embedding conditions developed in the reservoir-computing literature, the resulting synchronization map can embed the low-dimensional sensory manifold into neural state space. Thus, the geometry predicted by the free energy principle need not be imposed from above by an explicitly Bayesian neural calculus; it can arise from ordinary recurrent dynamics driven by the world.
I then propose a developmental extension. Hebbian plasticity acting on the correlations generated by sensory-driven synchronization may crystallize the embedded manifold into recurrent connectivity, yielding an autonomous continuous attractor network when the required fixed point exists. On this view, mature head-direction, grid-cell, and stimulus-driven visual manifolds are not genetically prespecified templates, but developmental products of three interacting processes: dynamical contraction, generalized synchronization, and correlation-based plasticity. The synthesis links the free energy principle, reservoir-computing embedding theorems, and contraction-theoretic models of Hebbian recurrent networks. It also yields testable predictions about dimensional thresholds for topological recovery, developmental sensitivity to plasticity, and the dependence of attractor geometry on input statistics. The central open problem is whether the Hebbian fixed point exists and preserves the embedding quality of the synchronization manifold.
Ancestral sequence reconstruction (ASR) aims to infer extinct protein sequences at internal nodes of a phylogenetic tree. Classical ASR methods are typically based on continuous-time Markov substitution models, but they treat sites largely independently and handle insertions and deletions only weakly or not at all. We introduce a tree-conditioned edit-flow model for variable-length ASR. Given two descendant sequences and their branch distances to a shared ancestor, the model reconstructs the ancestor through paired bidirectional edit trajectories constrained to agree on a common ancestral state. On a benchmark of experimentally evolved sequences with only context-independent substitutions, the model does not match the accuracy of the best classical method, yet still achieves reasonable performance despite being trained on natural sequences that include insertions, deletions, and substitutions. On a benchmark of natural homologous sequences with abundant insertions and deletions, the model most accurately localizes inferred evolutionary change.
Scoring functions remain the principal bottleneck in molecular docking: they routinely fail to rank near-native poses above decoys, and their composite single-score design obscures the physicochemical basis of each ranking error. We present AgenticPosesRanker, an agentic AI framework that combines six deterministic, physically grounded analysis tools (interaction fingerprinting, solvent-accessible burial, conformational strain, steric-clash detection, unsatisfied-polar-atom penalty, and chemical-identity extraction) with large-language-model (GPT-5) chain-of-thought reasoning to evaluate and rank docking poses. On a curated benchmark of ten protein-ligand systems (162 poses) balanced by construction between Smina scoring-function successes and failures, the agent achieved 50.0% best-pose accuracy, matching the design-fixed Smina baseline of 50.0% and significantly exceeding a 7.7% uniformly random baseline (p < 0.001, one-sided exact binomial test). The balanced-benchmark accuracy decomposes symmetrically: the agent retained 80% (4/5) of the Smina-success systems and recovered 20% (1/5) of the Smina-failure systems, so the aggregate 50% reflects one regression offset by one recovery rather than any net improvement over the Smina reference. Decision-attribution analysis showed high alignment between the agent's self-reported tool weights and objective metric separations of the selected pose (median \r{ho} = +0.83), consistent across correct and incorrect outcomes, localising the performance ceiling to tool-suite coverage rather than reasoning inconsistency. These results establish a methodological template for evaluating agentic AI against objective ground truth in the natural sciences and position the framework as an interpretable curation layer for late-stage pose refinement in structure-based drug design.
Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.
Bacterial chemotaxis has long been viewed as operating near the physical limits of sensing, as originally articulated by Berg and Purcell. Recent information-theoretic analyses challenge this view, suggesting that Escherichia coli uses only a small fraction of the information available in ligand arrival statistics to bias its motion. How should such low information efficiency be interpreted at the level of behavior? Here, I argue that chemotactic performance is shaped not only by information transmission and noise, but by the strategy of movement itself. Using simple scaling arguments and minimal models, I show how run-and-tumble chemotaxis can remain robust to noise through symmetry and temporal averaging, even when internal information processing is inefficient. Comparing bacterial and eukaryotic chemotaxis highlights how different sensing strategies convert physical limits into observable behavior. These considerations suggest that low information efficiency need not imply poor performance, but may instead reflect an evolved balance between robustness, simplicity, and function.
We introduce and discuss a kinetic framework describing the time evolution of the statistical distributions of a population divided into the compartments of susceptible, infectious, recovered, and resistant in the presence of a microbial infection driven by susceptible infectious interactions. Our main objective is to quantify the impact of excessive and inappropriate antimicrobial use, which accelerates the spread of resistance by enabling a fraction of infectious individuals to transition into the resistant compartment. The model consists of a system of Boltzmann type equations capturing binary interactions between susceptible and infectious individuals, complemented by linear redistribution operators that represent recovery, the development of resistance, and reinfection processes. In the grazing collision limit, we show that this Boltzmann system is well approximated by a system of coupled Fokker Planck equations. This limiting description allows for a more tractable analysis of the dynamics, including the characterization of the long-time behavior of the population densities. Our analysis highlights how interaction terms drive the system toward a stable equilibrium and quantifies the effects of inappropriate antimicrobial use on the distribution of resistant individuals. Overall, the results offer a multiscale perspective that bridges kinetic theory with classical epidemic modeling.
Boltzmann Machines trained on evolutionary sequence data have emerged as a powerful paradigm for the data-driven design of artificial proteins. However, the relationship between model architecture, specifically parameter density, and experimental performance remains poorly understood. Here, we investigate this relationship using the Chorismate Mutase enzyme family as a model system. We compare standard fully connected Boltzmann Machines for Direct Coupling Analysis (bmDCA) with sparse models generated via progressive edge activation (eaDCA) and edge decimation (edDCA). We identify a maximum-entropy model (meDCA) along the decimation trajectory that represents an optimal balance between constraint satisfaction and the flexibility of the probability distribution. We synthesized and tested artificial sequences from all models using an in vivo complementation assay, finding that all architectures, regardless of sparsity, generate functional enzymes with high success rates, even at significant divergence from natural sequences. Despite this functional equivalence, we demonstrate that the meDCA model samples a viable sequence space that is more than fifteen orders of magnitude larger than its low-entropy counterparts. Furthermore, comparative analyses reveal that high-entropy models systematically minimize overfitting and better capture the local neutral spaces surrounding natural proteins. These findings suggest that while various models satisfying coevolutionary statistics can generate functional sequences, high-entropy Boltzmann Machines provide a superior representation of the underlying evolutionary fitness landscape.
Closed-form distributions enable joint Ne inference and show selection biases tract lengths upward
abstractclick to expand
Identity by descent (IBD) tracts and runs of homozygosity (ROH) are related concepts that refer to the autozygosity in chromosome segments. However the formal relationship between their length distributions remains to be established. Here we present a coalescent framework that unifies these two concepts within a single analytical development. Starting from a Wright-Fisher model, we derive closed-form probability density functions for IBD tract lengths and extend these to the observable distribution of ROH lengths. This is achieved by explicitly modelling the displacement of ROH limits from true recombination breakpoints to the nearest heterozygous marker site. Mutation, gene conversion, finite marker density, and variable marker heterozygosity are incorporated as parameters in the theory that link IBD tracts to ROH. We show that the chromosome segment homozygosity (CSH) statistic emerges as a special case. This enables demographic information from IBD tracts and ROHs to be combined into a framework for inferring effective population size. Finally, we incorporate the quantitative genetic theory of background selection into the IBD length distribution, to show how selection introduces a systematic upward bias in apparent tract lengths. This demonstrates that no single Ne value can account for the entire IBD length distribution under selection. The application of this theory to the detection of selection signatures in the genome is illustrated using the example of the local selective sweep associated with lactase persistence in human populations.
We present A-CODE, a fully atomic unified one-stage protein co-design model that simultaneously refines discrete atom types and continuous atom coordinates. Unlike predominant two-stage methods that cascade structure design with amino acid-level sequence design, our approach is fully atomic within a unified multimodal diffusion framework, in which residue identities are inferred solely from atom-level predictions. Built upon the powerful all-atom architecture, A-CODE achieves superior designability for unconditional protein generation, outperforming all existing one-stage and two-stage design models. For binder design, A-CODE rivals and even outperforms existing state-of-the-art two-stage design models and, compared with the existing one-stage co-design model, achieves a drastic tenfold improvement in success rate on hard tasks. The inherent flexibility of our atomic formulation enables, for the first time, seamless adaptation to non-canonical amino acid (ncAA) modeling. Our fully atomic framework establishes a new, versatile foundation for all-atom generative modeling that can be naturally extended to complex biomolecular systems.
Donor-level disease classification from single-cell RNA sequencing (scRNA-seq) requires strict donor-aware cross-validation: naive pipelines that split cells randomly conflate training and test donors, inflating reported performance through pseudoreplication. We present a donor-aware benchmark evaluating three feature representations across two independent IBD cohorts: centered log-ratio (CLR) transformed cell-type composition, GatedStructuralCFN dependency embeddings, and scVI variational autoencoder latent embeddings. The cohorts are the SCP259 ulcerative colitis atlas (UC vs. Healthy, n=30 donors, 51 cell types) and the Kong 2023 Crohn's disease atlas (CD vs. Healthy, n=71 donors, 55-68 cell types across three intestinal regions).
Compartment-stratified CLR composition achieves AUROC 0.956 +/- 0.061 on SCP259; GatedStructuralCFN on the same features achieves 0.978 +/- 0.050. In the Kong cohort, CFN achieves its best performance in the colon region (0.960 +/- 0.055 after feature filtering), exceeding linear CLR (0.900 +/- 0.100), while terminal ileum classification is dominated by linear models (CatBoost CLR 0.967 +/- 0.075 vs. CFN 0.811 +/- 0.164). Cross-dataset transfer (CD->UC, four shared cell types) achieves AUC 0.833 with XGBoost CLR; the reverse direction performs at chance. CFN edge stability analysis shows that compartment-wise composition eliminates spurious unit-sum-induced instability present in global composition (Jaccard 0.026 vs. top-20 recurrence 1.0). CFN shows a consistent numerical advantage over linear models in the colon region of CD (AUROC 0.960 vs. 0.900), though no inter-method comparison reached statistical significance at n<=34 donors per region. Compartment-aware feature construction is critical for both classification performance and structural interpretability. Code: https://github.com/Jonathan-321/sfn-scrna-study
Lazy extraction of metadata and recordings lets the same code move from laptop tests to cluster runs without manual wrangling.
abstractclick to expand
Artificial intelligence (AI) is increasingly central to understanding how the brain processes information. However, the integration of neuroscience and modern AI is bottlenecked by a fragmented software ecosystem. Current tools are siloed by recording modality and optimized for small-scale, in-memory workflows, limiting the use of massive, naturalistic datasets. Here, we introduce NeuralSet, a Python framework that efficiently unifies the processing of diverse neural recordings (including fMRI, M/EEG, and spikes) and complex experimental stimuli (such as text, audio, and video). By decoupling experimental metadata from lazy, memory-efficient data extraction, NeuralSet harmonizes standard neuroscientific preprocessing pipelines with pretrained deep learning embeddings. This approach provides a single PyTorch-ready interface that scales seamlessly from local prototyping to high-performance cluster execution. By eliminating manual data wrangling and ensuring full computational provenance, NeuralSet establishes a scalable, unified infrastructure for the next generation of neuro-AI research.
Decoupling metadata from lazy extraction lets the package handle fMRI through video while scaling to clusters and tracking provenance.
abstractclick to expand
Artificial intelligence (AI) is increasingly central to understanding how the brain processes information. However, the integration of neuroscience and modern AI is bottlenecked by a fragmented software ecosystem. Current tools are siloed by recording modality and optimized for small-scale, in-memory workflows, limiting the use of massive, naturalistic datasets. Here, we introduce NeuralSet, a Python framework that efficiently unifies the processing of diverse neural recordings (including fMRI, M/EEG, and spikes) and complex experimental stimuli (such as text, audio, and video). By decoupling experimental metadata from lazy, memory-efficient data extraction, NeuralSet harmonizes standard neuroscientific preprocessing pipelines with pretrained deep learning embeddings. This approach provides a single PyTorch-ready interface that scales seamlessly from local prototyping to high-performance cluster execution. By eliminating manual data wrangling and ensuring full computational provenance, NeuralSet establishes a scalable, unified infrastructure for the next generation of neuro-AI research.
Usutu virus (USUV) is a flavivirus of the Japanese encephalitis complex transmitted between \textit{Culex} mosquitoes and birds, a transmission pattern similar to that of the West Nile virus (WNV). In Germany, the first case of USUV was detected in 2010 in mosquitoes collected in the town of Weinheim, and by 2018 the virus had spread to almost the entire country. Interestingly, the infection front exhibited a clockwise rotational spread pattern throughout the years, a pattern completely different from that of the WNV. This clockwise progression corresponded closely with the spatial temperature gradient, suggesting that warmer regions probably facilitated faster viral amplification and onward transmission. Understanding the drivers that influence the spreading patterns of arboviruses is important as it guides surveillance and implementation of control strategies. In this study, we develop a reaction-diffusion partial differential equation (PDE) model to investigate the spatial spread of USUV in Germany within an extended domain that includes some neighbouring countries (Belgium, the Netherlands, and Luxembourg), thereby capturing cross-border transmission processes. Mosquito parameters, i.e., extrinsic incubation rate, mortality and biting rates, are temperature-driven, as temperature plays an important role in the activity of mosquitoes. Our model qualitatively reproduced the main spatial trends of USUV in Germany and surrounding countries. The heterogeneous spread pattern arises from the interplay of diffusion and spatially varying temperature, which together may influence determine regions with higher transmission potential.
Fitness landscapes provide a quantitative framework for understanding how natural selection shapes evolutionary trajectories. A central feature of these landscapes is their number of local optima, which determines whether fitness-increasing evolution can proceed towards a global optimum or become trapped on suboptimal peaks. Although multiple peaks are known to require reciprocal sign epistasis, the quantitative relationship between epistasis and number of peaks remains incompletely understood. Here, we show that for a broad class of unstructured fitness landscapes, i.e. isotropic Gaussian random fields, the expected number of local optima is determined by a single local measure of epistasis: the correlation of fitness effects. This provides a baseline prediction for the number of peaks in typical unstructured landscapes and links peak density directly to the amount of reciprocal sign epistasis. This baseline changes when epistatic interactions are structured. We show that clustering interactions within blocks of loci slightly increases the number of local optima. In contrast, strong heterogeneity between loci, where only a small subset of loci participate in epistatic interactions, causes the number of peaks to collapse. These results show that the number of local optima is governed not only by the overall strength of epistasis, but also by how epistatic interactions are distributed across the genotype space. Our framework therefore reconciles the central role of reciprocal sign epistasis with the observation that landscapes with similar amounts of epistasis can differ substantially in ruggedness, and provides a guide to the range of peak numbers expected in typical landscapes.
In biological systems, neural circuits compute through directed, short-latency interactions whose effects unfold across multiple time scales and behavioral contexts. We address the problem of inferring these local, lag-specific interactions from sampled neural population activity under varying stimuli, without assuming a parametric form for the underlying dynamics. Our approach leverages denoising score models by estimating joint-window scores over consecutive activity snapshots (i.e., brain states) and converting these scores into calibrated, directed edge tests via cross-block score products. The key insight is that these products recover the Jacobian of the transition map between brain states under nonlinear dynamics. To cleanly separate lag-specific effects, we introduce minimal multi-block windows that condition on intermediate time points, avoiding the omitted-lag bias inherent in pairwise analyses. The resulting method, Score--Block Time Graphs (SBTG), identifies lag-specific directed interactions in sampled neuronal population data. We specifically apply SBTG to whole-brain C. elegans calcium imaging data to recover lag-specific circuit structure not resolved by current methods, including improved alignment with independent connectomes, cell-type-specific temporal organization, and neuromodulatory profiles consistent with known receptor kinetics. These findings highlight the potential for SBTG to serve as a practical ``AI for science'' tool by turning high-dimensional neural population recordings into statistically testable circuit hypotheses.
Gene programs co-activate within cells, but existing single-cell methods either treat programs independently or require experimental perturbation data to model their interactions. We introduce ORBIT, a self-supervised transformer that learns asymmetric dependencies among gene programs from observational single-cell RNA-sequencing data alone, quantifying how strongly each program influences every other program. The key mechanism is an intervention-consistent training objective: the model learns each program's directional influence on every other program by predicting how the others change when that program is removed, yielding attention weights that reflect asymmetric influence rather than symmetric co-occurrence. Applied to 191,890 prefrontal cortex nuclei across three pathway vocabularies, ORBIT recovers co-activation structure consistent with established Alzheimer's disease vulnerability signatures, identifies cell-type-specific rewiring invisible to differential expression, and achieves 0.984 macro F1 on cell-type classification from 220 pathway scores, which is within 0.3 points of a state-of-the-art classifier using all 22,088 genes.
Simultaneous reconstruction with sparse change penalty stabilizes normal tissue while keeping lesion signals intact in stroke and MS data.
abstractclick to expand
Quantitative susceptibility mapping (QSM) has been increasingly applied in longitudinal studies of neurodegenerative diseases and aging to assess temporal alterations in brain iron and myelin. The accuracy of such investigations depends on the repeatability and sensitivity of measurements. However, the ill-posed nature of the QSM processing steps makes the reconstruction vulnerable to background field changes, head orientation changes, noise, and imperfect registration, which compromise repeatability and sensitivity and hinder reliable detection of true changes. To address these limitations, we propose Longitudinal QSM, a simultaneous reconstruction framework that jointly estimates susceptibility maps across time points while enforcing spatial sparsity of temporal changes. The method was evaluated through simulations and in-vivo experiments and compared with conventional reconstruction methods. Longitudinal QSM consistently reduced inter-scan variability and accurately recovered simulated lesion changes. Application to stroke patient and multiple sclerosis patient data further demonstrated that the framework stabilizes non-lesion variability while preserving lesion-related temporal changes. This approach offers a promising tool for monitoring subtle temporal changes in brain iron and myelin in various neurodegenerative diseases as well as throughout aging and development.