q-bio.GN — Pith

0

q-bio.GN 2026-05-13 Recognition

Genome embeddings predict microbiome abundances for novel species

Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

Set-aggregated representations from genomic language models generalize better than classical bioinformatics methods on unseen genomes.

abstract click to expand

Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.

0

q-bio.GN 2026-05-11 2 theorems

Protein embeddings classify bacterial operons at 0.71 ROC-AUC

SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification

Siamese MLP on pre-trained models matches top DGEB entries while outperforming physicochemical baselines for scalable microbial genome work.

abstract click to expand

Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC-AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC-AUC of 0.71, competitive with state-of-the-art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.

0

q-bio.GN 2026-05-08 2 theorems

Hybrid model lowers error in grapevine trait prediction across years

A Linear-Transformer Hybrid for SNP-Based Genotype-to-Phenotype Prediction in Grapevine

LiT-G2P blends additive genetic effects with Transformer nonlinear interactions to beat baselines on hair density and trichome density in 2-

abstract click to expand

Robust genotype-to-phenotype (G2P) prediction is essential for accelerating breeding decisions and genetic gain. However, it remains challenging to measure complex traits under variable field conditions and across years. In this study, we propose a linear-Transformer approach, LiT-G2P (Linear-Transformer Genotype-to-Phenotype), an automated predictive framework that integrates additive genetic variance effects with Transformer-based nonlinear interactions using genome-wide single-nucleotide polymorphisms (SNPs) data. We evaluated LiT-G2P on a panel of diverse grape accessions, genotyped with SNP markers and measured for phenotypes across two consecutive years. Target phenotypic traits include leaf hair density and trichome density of grapevines. Across both single-year and cross-year testing scenarios, LiT-G2P consistently improves prediction performance compared with baseline models. For hair density, LiT-G2P achieves the lowest error in both single-year and cross-year evaluations, with RMSEs of 0.469 and 0.454, respectively, while maintaining strong tolerance accuracies of 79.2% and 74.6%, respectively. For trichome density, LiT-G2P also presents the best overall G2P performance. In addition, we extract model-prioritized SNPs from attention weights and apply genotype-stratified analysis to provide interpretable candidate marker for downstream validation. These results demonstrate that integrating stable additive effects with learned interaction patterns can enhance cross-year robustness and support practical SNP-based predictive modeling for genomic selection.

0

q-bio.GN 2026-05-08 Recognition

Multimodal LLM reasons with omics numbers and language together

OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning

OmicsLM matches specialized models on predictions and leads on multi-sample questions from real GEO studies.

abstract click to expand

Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks. OmicsLM represents each transcriptomic profile as a compact continuous representation within the LLM context. This interface preserves quantitative expression signal while allowing natural-language instructions, explicit gene mentions, and multiple interleaved biological samples to be processed together in one model context. We train OmicsLM on more than 5.5 million instruction-following examples spanning over 70 task types, combining continuous transcriptomic inputs, experimental data rendered through diverse language templates, and free-text biological knowledge and question-answering data. This mixture covers cell type annotation, perturbation prediction, clinical prediction, pathway reasoning, and open-ended biological question answering. Existing benchmarks evaluate either profile-level prediction or text-only biological QA, leaving language-guided, multi-sample reasoning over real expression profiles unmeasured. To close this gap, we introduce GEO-OmicsQA, a benchmark for multi-sample biological question answering built from real Gene Expression Omnibus (GEO) studies. We demonstrate that OmicsLM can use expression profiles directly and perform comparably to specialized omics models on profile-level tasks, while outperforming both omics-specialized models and general LLMs on language-guided biological reasoning over expression data.

0

q-bio.GN 2026-05-05

Transformer learns directional gene-program influences from unperturbed single-cell data

ORBIT: Learning Gene Program Co-Activation Structure for Cell-Type-Stratified Pathway Rewiring Analysis in Single-Cell Transcriptomics

Intervention-consistent training on observational RNA-seq recovers Alzheimer's rewiring and classifies cell types nearly as well as the full

abstract click to expand

Gene programs co-activate within cells, but existing single-cell methods either treat programs independently or require experimental perturbation data to model their interactions. We introduce ORBIT, a self-supervised transformer that learns asymmetric dependencies among gene programs from observational single-cell RNA-sequencing data alone, quantifying how strongly each program influences every other program. The key mechanism is an intervention-consistent training objective: the model learns each program's directional influence on every other program by predicting how the others change when that program is removed, yielding attention weights that reflect asymmetric influence rather than symmetric co-occurrence. Applied to 191,890 prefrontal cortex nuclei across three pathway vocabularies, ORBIT recovers co-activation structure consistent with established Alzheimer's disease vulnerability signatures, identifies cell-type-specific rewiring invisible to differential expression, and achieves 0.984 macro F1 on cell-type classification from 220 pathway scores, which is within 0.3 points of a state-of-the-art classifier using all 22,088 genes.

0

q-bio.GN 2026-05-04

Data fusion lifts migraine prediction AUC from 0.644 to 0.688

EFGPP: Exploratory framework for genotype-phenotype prediction

Framework integrates genotype features, covariates and risk scores from migraine and depression GWAS to beat single sources in 733 UK Biobnk

abstract click to expand

Predicting complex human traits from genetic data is challenging because different genetic, clinical, and molecular data sources often contain different parts of the signal. Here, we present EFGPP, a reproducible framework for generating, ranking, and combining multiple types of data for genotype-to-phenotype prediction. We applied EFGPP to migraine prediction using UK Biobank data from 733 individuals. The framework combined genotype-derived features, principal components, clinical and metabolomic covariates, and polygenic risk scores generated from migraine and depression GWAS using PLINK, PRSice-2, AnnoPred, and LDAK-GWAS. The best single data type achieved a test AUC of 0.644, while combining multiple data types improved performance to 0.688 using migraine-focused inputs and 0.663 using cross-trait depression-derived inputs. Genetic features alone did not outperform the covariates-only baseline, but genotype-derived features performed better than PRS alone, and depression-derived PRS showed useful predictive signal. Overall, EFGPP provides a practical proof-of-concept framework for prioritising and integrating heterogeneous genetic data sources for complex phenotype prediction.

0

q-bio.GN 2026-05-04

Pipeline recovers 98.4% of known phenotype genes from 13 databases

PhenotypeToGeneDownloaderR: automated multi-source retrieval and validation of phenotype-associated genes

It validates 87.6% of symbols and supplies ready gene lists for risk scoring and variant interpretation.

abstract click to expand

Identifying phenotype-associated genes is a common first step in polygenic risk score construction, enrichment testing, target prioritisation and variant interpretation, but relevant evidence is distributed across heterogeneous databases with different interfaces, formats and evidence models. Here, we present PhenotypeToGeneDownloaderR, a phenotype-guided R/Python pipeline for automated gene retrieval, harmonisation, symbol validation and cross-source summary analysis. Given a phenotype term, the pipeline queries integrated biological databases, standardises per-source outputs, combines gene lists, validates retrieved symbols against the NCBI human gene reference and generates summary tables and visualisations. Across 13 clinically relevant phenotypes and 13 databases, PhenotypeToGeneDownloaderR generated 136,487 raw gene retrievals, with at least one source returning genes for every phenotype. Across all 13 phenotypes, 100,175 of 114,345 combined input symbols were retained after direct or synonym-based validation, corresponding to an 87.6\% validation rate. Cross-source overlap was low, supporting the complementarity of integrated evidence sources. Against an HPO/ClinVar/OMIM-derived gold standard, the pipeline recovered 1,039 of 1,056 known phenotype-associated genes, corresponding to 98.4\% recall. PhenotypeToGeneDownloaderR provides a lightweight, reproducible upstream framework for generating candidate gene sets for downstream prioritisation and interpretation. The pipeline is implemented in R and Python, released under the MIT licence, and available at https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR.

0

q-bio.GN 2026-05-01

MCMC steering improves single-cell perturbation predictions

CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation

A Metropolis-Hastings sampler using masked conditionals avoids out-of-distribution artifacts when forecasting gene-knockout responses.

abstract click to expand

In this work, we introduce CellxPert, a scalable multimodal foundation model that unifies single-cell and spatial multi-omics within a common representation space. CellxPert jointly encodes transcriptomic (scRNA-seq), chromatin-accessibility (ATAC-seq), and surface-proteomic (CITE-seq) measurements, while directly incorporating MERFISH and imaging mass-cytometry data as 2D or 3D spatial-visual layers. CellxPert facilitates four key downstream tasks out of the box: (i) cell-type annotation across a broad ontology of 154 largely overlapping identities -- the largest label space addressed to date and a stringent test of fine-grained discrimination, (ii) efficient fine-tuning using Low Rank Adaptation (LoRA), (iii) genome-wide transcriptomic response prediction to in-silico perturbations (ISP), and (iv) seamless multi-omic integration across various assays and platforms. Unlike current single-cell foundation models, which approximate gene perturbations by deleting or reordering tokenized gene expression ranks, CellxPert employs a Metropolis-Hastings sampler whose proposal kernel uses the model's masked conditional distributions to transition to new transcriptomic states conditioned on the perturbed genes. This Markov-chain procedure mitigates out-of-distribution artifacts introduced by abrupt token manipulation and produces trajectories that are biologically interpretable. Evaluations on PBMC68K, Replogle Perturb-seq, Systema, and BMMC benchmarks show that CellxPert surpasses classical and state-of-the-art baselines in cell-type annotation, perturbation response prediction, and multi-omic integration.

0

q-bio.GN 2026-05-01

Fused signals and conformal calibration certify zero-miss DNA hazard screening under new-f

CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift

Three public annotation signals combined and threshold-calibrated on leave-one-family-out folds bound expected miss rate at 5 percent with 0

abstract click to expand

DNA-synthesis providers screen incoming orders by searching the requested sequence against curated hazard lists. We show that this baseline collapses to a 100% false-flag rate when the hazardous sequence comes from a taxonomic family absent from the reference set: under Conformal Risk Control's certified miss-rate constraint, a low-discrimination signal forces the threshold below the entire test-benign mass. We compose three signals derived from a synthesis order's public annotation: $k$-mer Jaccard similarity to known toxins, the trimmed-mean score of a five-LLM judge panel, and cosine similarity to clustered embedding centroids. Fused under a monotone logistic aggregator and calibrated by Conformal Risk Control, the resulting screener certifies $\mathbb{E}[\mathrm{FNR}] \le \alpha$. Across ten leave-one-taxonomic-family-out folds at $\alpha=0.05$ on UniProt KW-0800 reviewed toxins, the calibrated screener achieves 0% test miss rate on every fold and 0% test false-flag rate on nine of ten folds. The bound's finite-sample slack $1/(n_{\mathrm{cal}}+1)$ caps the certifiable miss rate at 1.77% on our 200-hazard subsample; reaching procurement-grade $\alpha=10^{-3}$ requires an $18\times$ larger calibration set, which the full reviewed UniProt KW-0800 corpus is large enough to deliver. The binding constraint on certifiable DNA-synthesis screening is calibration data, not algorithms. Code: https://github.com/najmulhasan-code/crc-screen

0

q-bio.GN 2026-04-29

A generalized gene co-expression method applied to AMD RNA-Seq data identifies stable…

Robust Clustering Analysis of Genes Related to Age-related Macular Degeneration using RNA-Seq

Enhanced clustering with stability checks recovers known AMD genes and surfaces fresh candidates for mechanism and therapy research.

abstract click to expand

Identifying genes associated with diseases is crucial to understanding disease mechanisms and developing therapies. However, identification of individual genes associated with a disease often needs to be supplemented with clustering analysis to understand the relationships between genes and identify gene modules beyond individual gene-level relationships. Gene co-expression networks are widely used as a graph theoretic approach to the clustering analysis of genes. In our work, we perform robust clustering analysis on RNA-Seq data of Age-related Macular Degeneration (AMD) patients and controls by generalizing one such framework, Multiscale Embedded Gene Co-Expression Network Analysis (MEGENA). We propose a carefully curated set of module quality evaluation metrics to choose appropriate statistical distance-based or information theoretic similarity measures over simple linear correlation to represent the similarities between genes. Furthermore, we design and implement a stability test to ensure the robustness of the detected hub genes in the presence of noise. Finally, we propose differential module eigengene analysis for a deeper understanding of upregulation and downregulation of each module with respect to the disease and control groups for a comprehensive understanding of the clustering analysis. Besides detecting robust hub genes and modules that are supported by prior findings, we also identify previously undiscovered hub genes that can potentially lead to further biomedical research into understanding the AMD disease mechanism and developing new treatments.

0

q-bio.GN 2026-04-29

Over 1000 TCR sequences flag long COVID vs recovery

T-cell repertoire response in individuals with post-acute sequelae of COVID-19

Repertoire analysis in 120 patients links motifs and clone changes to persistent symptoms after SARS-CoV-2 infection.

abstract click to expand

T-cells are central to SARS-CoV-2 clearance and immunological memory, yet their contribution to the persistence of post-acute sequelae of COVID-19 (PASC) remains poorly understood. The immunological features that distinguish individuals who develop PASC from those who recover fully are unresolved, in part due to the phenotypic heterogeneity of the condition and the likely multiplicity of its underlying mechanisms. Here, we profiled longitudinal bulk TCR$\beta$ repertoires from 120 individuals in the INCOV cohort--71 with PASC and 49 without--sampled at two to three time points spanning the acute and post-acute phases of infection. Using robust statistical modeling of repertoire composition and clonal dynamics, we found that global statistics such as V, J gene usage and CDR3 length do not differ between groups, but that locally enriched sequence motifs and differentially dynamic clones reveal distinct T-cell signatures associated with PASC status. Clones contracting following the peak of the acute response were significantly enriched for SARS-CoV-2 specificity in both groups. Interestingly, Influenza A-specific TCRs were disproportionately enriched among contracting clones in PASC{$^+$} repertoires, implicating viral co-infection as a potential contributor to early disease severity and, possibly, PASC pathogenesis. Rare public TCR clones were markedly enriched for SARS-CoV-2 specificity, with PASC{$^+$} individuals harboring a modestly but significantly higher proportion than PASC{$^-$} individuals. Together, we identified over 1,000 candidate TCR$\beta$ receptors potentially discriminating PASC{$^+$} from PASC{$^-$} immune responses, opening a path toward the identification of disease-relevant T-cell specificities and the development of T-cell-based immunological biomarkers for long COVID.

0

q-bio.GN 2026-04-27

Radiomic features distinguish molecular subtypes in tongue cancer

Imaging Exploration of Molecular Subtypes in Tongue Squamous Cell Carcinoma

Ten wavelet texture measures from preoperative scans align with transcriptomic clusters that differ in immune and differentiation pathways.

abstract click to expand

Tongue squamous cell carcinoma (TSCC) is an aggressive malignancy with marked biological heterogeneity and variable clinical outcomes. Although molecular profiling has improved understanding of TSCC heterogeneity, its clinical use remains constrained by invasive tissue sampling and limited representation of whole-tumor spatial complexity. Meanwhile, most radiomics studies in TSCC have focused on downstream clinical endpoints, and whether imaging can non-invasively reflect intrinsic molecular subtypes remains unclear. In this study, an integrated transcriptomic-radiomics framework was used to investigate the relationship between preoperative imaging phenotypes and molecular subtypes in TSCC. Transcriptomic data from 60 TSCC cases in The Cancer Genome Atlas were analyzed using unsupervised consensus clustering, followed by differential expression and functional enrichment analyses. Matched preoperative imaging data from The Cancer Imaging Archive were manually annotated for primary tumor regions, and radiomic features were extracted using PyRadiomics; group differences were assessed with the U-test. Two stable molecular subtypes, C1 and C2, were identified. Their biological differences were mainly associated with squamous epithelial differentiation, inflammatory signaling, and lipid metabolism, with C2 showing greater enrichment of immune-related pathways. In addition, 10 radiomic features differed significantly between the two subtypes, mainly wavelet-derived texture features from gray-level size zone, dependence, co-occurrence, and run length matrices (P=0.00202-0.0162). These findings support the potential of radiomics as a non-invasive approach for characterizing molecular heterogeneity in TSCC and provide an initial radiogenomic framework for biologically informed preoperative assessment.

0

q-bio.GN 2026-04-27

Cathaya genome links defense gene loss to slow growth and symbiosis

The Cathaya argyrophylla Genome Reveals the Evolutionary Trade-offs of a Living Fossil

Contractions in immunity pathways and transport expansions explain the living fossil's vulnerabilities and microbial dependence.

abstract click to expand

Cathaya argyrophylla is an endangered paleoendemic gymnosperm characterized by restricted ecological adaptability and high pathogen susceptibility. To elucidate its genomic architecture and evolutionary history, a de novo chromosome-level genome assembly was constructed using PacBio High-Fidelity long reads and Hi-C scaffolding. The resulting 22.73 Gb assembly resolves into 12 pseudochromosomes, demonstrating genome gigantism driven primarily by a 72.92 percent repeat sequence content and extensive intron expansion. Phylogenomic analysis using single-copy orthologs identifies C. argyrophylla as a sister lineage to the Pinus clade, with an estimated divergence time of 102.8 million years ago. Analysis of gene family dynamics reveals significant expansions in pathways related to membrane lipid metabolism, transmembrane transport, and translation machinery, indicating specific molecular adaptations for cellular homeostasis in resource-limited environments. Conversely, the genome exhibits massive contractions in endogenous defense networks, including plant-pathogen interactions, brassinosteroid signaling, and DNA repair mechanisms. This distinct genomic reduction correlates directly with the slow growth rate and weak innate immunity observed in the species, while the expanded transmembrane transport networks suggest an obligate physiological reliance on symbiotic microbiomes for survival. Ultimately, this reference genome establishes a critical molecular resource for future conservation and breeding programs.

0

q-bio.GN 2026-04-24

Supregraphs capture full read information in assembly graphs

Supregraph: Enabling Information-Optimal Assembly Graph Representation of a Read Set

The new graph type avoids data loss and forced breaks that plague existing methods, supporting optimal assemblies under natural assumptions.

abstract click to expand

The first step in any genome assembly algorithm entails the conversion from the domain of strings and overlaps to the language of graphs and paths, typically using one of the two conventional methods: de Bruijn graphs or overlap graphs. However, both standard approaches are known to have limitations. De Bruijn graphs fail to represent complete information from reads, while the overlap graphs often produce artificial breaks in contigs due to the necessity to discard contained reads as a preliminary step. In this work we present a mathematical model for genome assembly that provides a formal framework to determine what constitutes a correct conversion of a read set into an assembly graph under the assumption of error-free reads. We prove that a correct representation of a read set exists in the form of a new class of assembly graphs, which we call supregraphs. We show that supregraphs can be constructed by iteratively transforming de Bruijn graphs using the multiplexing procedure, previously employed in the genome assemblers LJA and Verkko. Finally, we demonstrate that, under a set of natural assumptions, supregraphs provide a foundation for constructing theoretically optimal genome assemblies.

0

q-bio.GN 2026-04-23

Tree-guided diffusion creates cell-specific DNA regulators

Conditional Monte Carlo Tree Diffusion for Designing Cell-Type-Specific and Biologically Faithful Regulatory DNA

The approach beats diffusion, autoregressive, and optimization baselines on specificity and natural sequence fidelity for human cell lines.

abstract click to expand

Designing regulatory DNA elements with precise cell-type-specific activity is broadly relevant for cell engineering and gene therapy. Deep generative models can generate functional gene-regulatory elements, but existing methods struggle to achieve high specificity against undesired cell types while adhering to the genome's natural regulatory grammar. Here, we introduce DNA-CRAFT, a generative framework that integrates class-conditioned discrete diffusion with Monte Carlo tree search to design cell-type-specific and biologically faithful regulatory elements. We first train a discrete diffusion model on the ENCODE registry of 3.2 million candidate regulatory elements. Second, we condition the model to learn class-specific regulatory grammars of naturally occurring DNA sequences, including enhancers and promoters. Third, we employ conditional Monte Carlo tree guidance, an inference-time alignment algorithm designed to maximize the differential regulatory activity between desired and undesired cell types. By benchmarking DNA-CRAFT on regulatory sequence design tasks for human cell lines and immune cell types, we demonstrate that our model generates sequences with high predicted cell-type-specific activity and biological fidelity, achieving the best trade-offs compared to methods that use diffusion, autoregressive models, and gradient-based optimization.

0

q-bio.GN 2026-04-20

Quantum classifier finds combined gene set best for lung cancer subtyping

Quantum AI for Cancer Diagnostic Biomarker Discovery

Multi-omic analysis plus quantum machine learning separates LUAD from LUSC more accurately than either gene list alone.

abstract click to expand

Quantum machine learning offers a promising new paradigm for computational biology by leveraging quantum mechanical principles to enhance cancer classification, biomarker discovery, and bioinformatics diagnostics. In this study, we apply QML to identify subtype specific biomarkers for lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), the two predominant forms of non-small cell lung cancer. Our methodology involves a two-phase process: in Phase 1, differential expression analysis and methylation analysis between tumor and normal samples allows us to identify LUAD-specific and LUSC-specific genes, revealing potential prognostic biomarkers for cancer subtypes. Phase 2 focuses on developing a quantum classifier capable of distinguishing between LUAD and LUSC tumors, as well as between tumor and normal samples. This classifier not only enhances diagnostic precision but also demonstrates the quantum advantage in processing large-scale multiomic datasets. Our results consistently demonstrated that Sample3, representing the combined gene set, achieved the highest overall predictive performance in all metrics. These results demonstrate that QML provides an effective and scalable approach for biomarker discovery and subtype specific cancer classification. GO enrichment analysis highlighted the significant involvement of genes in synaptic signaling, ion channel regulation, and neuronal development. In the quantum phase, KEGG analysis further identified enrichment in cancer-associated pathways, including neurotrophin, MAPK, Ras, and PI3KAkt signaling, with key genes such as NGFR, NTRK2, and NTF3 suggesting a central role in neurotrophinmediated oncogenic processes. Our findings highlight the growing potential of quantum computing to advance precision oncology and next-generation biomedical analytics.

0

q-bio.GN 2026-04-15

Documentation grounds LLMs for accurate bioinformatics commands

oxo-call: Documentation-grounded Skill Augmentation for Accurate Bioinformatics Command-line Generation with Large Language Models

oxo-call pairs full version-specific tool help texts with expert skills to convert natural language task descriptions into precise, logged,

abstract click to expand

Command-line bioinformatics tools remain essential for genomic analysis, yet their diversity in syntax and parameterization presents a persistent barrier to productive research. We present oxo-call, a Rust-based command-line assistant that translates natural-language task descriptions into accurate tool invocations through two complementary strategies: documentation-first grounding, which provides the large language model (LLM) with the complete, version-specific help text of each target tool, and curated skill augmentation, which primes the model with domain-expert concepts, common pitfalls, and worked examples. oxo-call (v0.10) ships >150 built-in skills covering 44 analytical categories, from variant calling and genome assembly to single-cell transcriptomics, compiled into a single, statically linked binary. Every generated command is logged with provenance metadata to support reproducible research. oxo-call also provides a DAG-based workflow engine, extensibility through user-defined and community skills via the Model Context Protocol, and support for local LLM inference to address data-privacy requirements. oxo-call is freely available for academic use at https://traitome.github.io/oxo-call/.

0

q-bio.GN 2026-04-09 2 theorems

Evo2 learns local CTCF grammar but misses 3D chromatin structure

Probing 3D Chromatin Structure Awareness in Evo2 DNA Language Model

Tests on TAD boundaries and convergent loops show the model captures sequence patterns but not higher-order 3D organization, so longer spans

abstract click to expand

DNA language models like Evo2 now fit million-token contexts large enough to cover entire TADs, yet whether they learn 3D chromatin structure, a key regulatory layer acting atop primary sequence, remains untested and questionable, given that Evo2's training data includes prokaryotes lacking this structure. We probed Evo2-7B on TAD boundaries and convergent CTCF loops in 1 Mb windows using two complementary tests: likelihood-based perturbation and sequence generation. Evo2 did not distinguish functional perturbations from matched random controls and failed to reliably generate convergent CTCF loops, recovering TAD boundaries only partially. Together, these results indicate that Evo2 has learned local CTCF grammar but misses higher-order 3D organization, pointing to bidirectional model architectures integrating cell types and 3D contacts, rather than longer contexts, as the path to developing 3D-aware DNA language models.

0

q-bio.GN 2026-04-09 Recognition

Pipeline fixes circular bias in ecDNA cancer benchmarks

ECLIPSE: A Composable Pipeline for Predicting ecDNA Formation, Evolution, and Therapeutic Vulnerabilities in Cancer

ecDNA-Former reaches AUROC 0.812 from standard features alone while physics models and causal inference improve dynamics and target accuracy

abstract click to expand

Extrachromosomal DNA (ecDNA) represents one of the most pressing challenges in cancer biology: circular DNA structures that amplify oncogenes, evade targeted therapies, and drive tumor evolution in ~30% of aggressive cancers. Despite its clinical importance, computational ecDNA research has been built on broken foundations. We discover that existing benchmarks suffer from circular reasoning -- models trained on features that already require knowing ecDNA status -- artificially inflating performance from AUROC 0.724 to 0.967. We introduce ECLIPSE, the first methodologically sound framework for ecDNA analysis, comprising three modules that transform how we predict, model, and target these structures. ecDNA-Former achieves AUROC 0.812 using only standard genomic features, demonstrating for the first time that ecDNA status is predictable without specialized sequencing, and that careful feature curation matters more than complex architectures. CircularODE captures ecDNA's unique stochastic dynamics through physics-constrained neural SDEs, achieving r > 0.997 on experimental data via zero-shot transfer. VulnCausal applies causal inference to identify therapeutic vulnerabilities, achieving 80x enrichment over chance and 3.7x higher validation than standard approaches by filtering spurious correlations. Together, these modules establish rigorous baselines for an emerging application area and reveal a broader lesson: in high-stakes biomedical ML, methodological rigor -- eliminating leakage, encoding domain physics, addressing confounding -- outweighs architectural innovation. ECLIPSE provides both the tools and the template for principled computational oncology.

0

q-bio.GN 2026-04-09 Recognition

Genomic language models ignore where regulatory DNA sits

The Mechanistic Invariance Test: Genomic Language Models Fail to Learn Positional Regulatory Logic

They score wrong positions higher than correct ones and are driven only by AT base content, while a 100-parameter model succeeds

abstract click to expand

Genomic language models (gLMs) have transformed computational biology, achieving state-of-the-art performance across genomic tasks. Yet a fundamental question threatens the foundation of this success: do these models learn the mechanistic principles governing gene regulation, or do they merely exploit statistical shortcuts? We introduce the Mechanistic Invariance Test (MIT), a rigorous 650-sequence benchmark across 8 classes with scrambled controls that enables clean discrimination between compositional sensitivity and genuine positional understanding. We evaluate five gLMs spanning all major architectural paradigms (autoregressive, masked, and bidirectional state-space models) and uncover a universal failure mode. Through systematic mechanistic probing via AT titration, positional ablation, spacing perturbation, and strand orientation tests, we demonstrate that apparent compensation sensitivity is driven entirely by AT content correlation (r=0.78-0.96 across architectures), not positional regulatory logic. The failures are striking: Evo2-1B and Caduceus score regulatory elements at incorrect positions higher than correct positions, inverting biological reality. All models are strand-blind. Compositional effects dominate positional effects by 46-fold. Perhaps most revealing, a simple 100-parameter position-aware PWM achieves perfect performance (CSS=1.00, SCR=0.98), exposing that billion-parameter gLMs fail not from insufficient capacity but from fundamentally misaligned inductive biases. Larger models show stronger compositional bias, demonstrating that scale amplifies rather than corrects this limitation. These findings reveal that current gLMs capture surface statistics while missing the positional grammar essential for gene regulation, demanding architectural innovation before deployment in synthetic biology, gene therapy, and clinical variant interpretation.

0

q-bio.GN 2026-04-08 Recognition

LLMs detect local DNA signals but weaken on multi-step genome tasks

GenomeQA: Benchmarking General Large Language Models for Genome Sequence Understanding

GenomeQA tests 5,200 raw sequences across six task families and shows frontier models beat random baselines only when short motifs or GC含量

abstract click to expand

Large Language Models (LLMs) are increasingly adopted as conversational assistants in genomics, where they are mainly used to reason over biological knowledge, annotations, and analysis outputs through natural language interfaces. However, existing benchmarks either focus on specialized DNA models trained for sequence prediction or evaluate biological knowledge using text-only questions, leaving the behavior of general-purpose LLMs when directly exposed to raw genome sequences underexplored. We introduce GenomeQA, a benchmark designed to provide a controlled evaluation setting for general-purpose LLMs on sequence-based genome inference tasks. GenomeQA comprises 5,200 samples drawn from multiple biological databases, with sequence lengths ranging from 6 to 1,000 base pairs (bp), spanning six task families: Enhancer and Promoter Identification, Splice Site Identification, Taxonomic Classification, Histone Mark Prediction, Transcription Factor Binding Site Prediction, and TF Motif Prediction. Across six frontier LLMs, we find that models consistently outperform random baselines and can exploit local sequence signals such as GC content and short motifs, while performance degrades on tasks that require more indirect or multi-step inference over sequence patterns. GenomeQA establishes a diagnostic benchmark for studying and improving the use of general-purpose LLMs on raw genomic sequences.

0

q-bio.GN 2026-04-08 2 theorems

Transcriptomic models for ICI response generalize poorly across cohorts

Transcriptomic Models for Immunotherapy Response Prediction Show Limited Cross-cohort Generalisability

Benchmarking on independent datasets finds near-chance accuracy and inconsistent biomarkers, pointing to needed improvements in adaptation.

abstract click to expand

Immune checkpoint inhibitors (ICIs) have transformed cancer therapy; yet substantial proportion of patients exhibit intrinsic or acquired resistance, making accurate pre-treatment response prediction a critical unmet need. Transcriptomics-based biomarkers derived from bulk and single-cell RNA sequencing (scRNA-seq) offer a promising avenue for capturing tumour-immune interactions, yet the cross-cohort generalisability of existing prediction models remains unclear.We systematically benchmark nine state-of-the-art transcriptomic ICI response predictors, five bulk RNA-seq-based models (COMPASS, IRNet, NetBio, IKCScore, and TNBC-ICI) and four scRNA-seq-based models (PRECISE, DeepGeneX, Tres and scCURE), using publicly available independent datasets unseen during model development. Overall, predictive performance was modest: bulk RNA-seq models performed at or near chance level across most cohorts, while scRNA-seq models showed only marginal improvements. Pathway-level analyses revealed sparse and inconsistent biomarker signals across models. Although scRNA-seq-based predictors converged on immune-related programs such as allograft rejection, bulk RNA-seq-based models exhibited little reproducible overlap. PRECISE and NetBio identified the most coherent immune-related themes, whereas IRNet predominantly captured metabolic pathways weakly aligned with ICI biology. Together, these findings demonstrate the limited cross-cohort robustness and biological consistency of current transcriptomic ICI prediction models, underscoring the need for improved domain adaptation, standardised preprocessing, and biologically grounded model design.

0

q-bio.GN 2026-04-06

ML models recover NGS quality labels from QC and blocklist features

An Imbalanced Dataset with Multiple Feature Representations for Studying Quality Control of Next-Generation Sequencing

37,491 samples carry both 34 fixed metrics and adjustable read-count vectors plus expert labels for testing automated quality control

abstract click to expand

Next-generation sequencing (NGS) is a key technique for studying the DNA and RNA of organisms. However, identifying quality problems in NGS data across different experimental settings remains challenging. To develop automated quality-control tools, researchers require datasets with features that capture the characteristics of quality problems. Existing NGS repositories, however, offer only a limited number of quality-related features. To address this gap, we propose a dataset derived from 37,491 NGS samples with two types of quality-related feature representations. The first type consists of 34 features derived from quality control tools (QC-34 features). The second type has a variable number of features ranging from eight to 1,183. These features were derived from read counts in problematic genomic regions identified by the ENCODE blocklist (BL features). All features describe the same human and mouse samples from five genomic assays, allowing direct comparison of feature representations. The proposed dataset includes a binary quality label, derived from automated quality control and domain experts. Among all samples, $3.2\%$ are of low quality. Supervised machine learning algorithms accurately predicted quality labels from the features, confirming the relevance of the provided feature representations. The proposed feature representations enable researchers to study how different feature types (QC-34 vs. BL features) and granularities (varying number of BL features) affect the detection of quality problems.

0

q-bio.GN 2026-04-03 2 theorems

Heritability estimates swing but PRS accuracy barely moves

Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Benchmark of 86 configurations on 10 UK Biobank traits finds near-zero correlation between h² and test AUC in two common PRS methods.

abstract click to expand

Objective: SNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. Methods: We benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, and SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. Results: Heritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) being negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h^2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2, with both being non-significant. Conclusion: SNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input.

0

q-bio.GN 2026-04-02 2 theorems

Varifold distances on RNA velocity curves recover cell trees

VeloTree: Inferring single-cell trajectories from RNA velocity fields with varifold distances

A dissimilarity measure turns integral curves of the velocity field into robust estimates of path distances on differentiation trees.

abstract click to expand

Trajectory inference is a critical problem in single-cell transcriptomics, which aims to reconstruct the dynamic process underlying a population of cells from sequencing data. Of particular interest is the reconstruction of differentiation trees. One way of doing this is by estimating the path distance between nodes -- labeled by cells -- based on cell similarities observed in the sequencing data. Recent sequencing techniques make it possible to measure two types of data: gene expression levels, and RNA velocity, a vector that quantifies variation in gene expression. The sequencing data then consist in a discrete vector field in dimension the number of genes of interest. In this article, we present a novel method for inferring differentiation trees from RNA velocity fields using a distance-based approach. In particular, we introduce a cell dissimilarity measure defined as the squared varifold distance between the integral curves of the RNA velocity field, which we show is a robust estimate of the path distance on the target differentiation tree. Upstream of the dissimilarity measure calculation, we also implement comprehensive routines for the preprocessing and integration of the RNA velocity field. Finally, we illustrate the ability of our method to recover differentiation trees with high accuracy on several simulated and real datasets, and compare these results with the state of the art.

0