arxiv: 2605.08815 · v1 · submitted 2026-05-09 · 💻 cs.LG · q-bio.BM· q-bio.GN· q-bio.QM

Recognition: 2 theorem links

· Lean Theorem

MicroFuse: Protein-to-Genome Expert Fusion for Microbial Operon Reasoning

Seungik Cho

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:41 UTC · model grok-4.3

classification 💻 cs.LG q-bio.BMq-bio.GNq-bio.QM

keywords microbial operon predictionprotein-genome fusionmixture of expertscross-modal alignmentgenome contextoperon co-membershipsequence conflict resolution

0 comments

The pith

MicroFuse fuses protein identity with genomic layout through expert modules to predict microbial operon co-membership more accurately than single-modality models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a fusion framework that combines protein-scale molecular identity with genome-context organization to determine whether adjacent microbial genes belong to the same operon. It routes representations through four specialized experts—one for protein signals, one for genome signals, one for agreement cases, and one for conflict cases—using a learned soft router. Training incorporates cross-modal alignment and disagreement-weighted contrastive shaping to emphasize biologically meaningful differences. This setup matters because naive concatenation or single-view models miss cases where sequence similarity misleads while layout correctly indicates independent regulation. The approach delivers stronger benchmark results precisely on those ambiguous gene pairs.

Core claim

MicroFuse integrates structure-aware protein representations with genome-context representations through a four-expert Mixture-of-Experts module consisting of protein, genome-context, agreement, and conflict experts together with a learned soft router. Training combines binary cross-entropy loss with symmetric cross-modal InfoNCE alignment and disagreement-weighted supervised contrastive shaping. On the constructed 100,000-pair scaffold-level benchmark, the model records the highest AUROC, AUPRC, mAP, and mAR among protein-only, genome-only, and simple concatenation baselines, with the largest gains occurring on a hard subset of pairs where sequence similarity conflicts with genomic layout.

What carries the argument

A four-expert mixture-of-experts module with a soft router that selects among protein, genome-context, agreement, and conflict specialists while cross-modal contrastive alignment shapes the representations.

Load-bearing premise

The learned soft router can reliably detect cases where protein identity and genomic layout give conflicting signals and exploit that distinction without simply memorizing dataset-specific patterns.

What would settle it

A new collection of gene pairs labeled by independent experimental evidence in which protein sequence similarity is high yet genomic neighborhood shows separate regulation, with MicroFuse showing no advantage over a protein-only baseline.

Figures

Figures reproduced from arXiv: 2605.08815 by Seungik Cho.

**Figure 1.** Figure 1: Overview of MicroFuse. Given a microbial gene pair, MicroFuse extracts a frozen ProstT5 protein-pair embedding and a frozen Bacformer genome-context embedding, projects both modalities into a shared latent space, and fuses them through protein, genome-context, agreement, and conflict experts. The fused representation is trained with binary task supervision, cross-modal alignment, and disagreement-weighted … view at source ↗

**Figure 2.** Figure 2: Threshold sensitivity analysis. MicroFuse reaches comparable macro-F1 and slightly higher accuracy after threshold selection, suggesting that its weaker default-threshold macro-F1 is partly due to thresholding rather than representation quality. C. Router Diagnostics [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Predicting microbial operon co-membership requires integrating two complementary biological signals: protein-scale molecular identity and genome-context organization. While recent biological foundation models provide powerful representations of each view independently, naive concatenation of these modalities ignores a key biological property -- protein identity and genomic context may agree when adjacent genes form a coherent functional module, or conflict when sequence similarity is misleading but genomic layout indicates independent regulation. We present MicroFuse, a protein-to-genome expert fusion framework that integrates structure-aware protein representations from ProstT5 with genome-context representations from Bacformer through a four-expert Mixture-of-Experts module (protein, genome-context, agreement, and conflict experts) with a learned soft router. Training combines binary cross-entropy with symmetric cross-modal InfoNCE alignment and disagreement-weighted supervised contrastive shaping. We further construct OG-Operon100K, a 100,000-pair scaffold-level benchmark from the OMG metagenomic corpus with biologically grounded positive and negative criteria. On OG-Operon100K, MicroFuse achieves the strongest AUROC, AUPRC, mAP, and mAR among ProstT5-only, Bacformer-only, and Concat MLP baselines. Ablations identify cross-modal contrastive alignment as the dominant component, and a hard sequence-conflict subset reveals MicroFuse's largest gains precisely in biologically ambiguous cases where protein identity alone is misleading.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MicroFuse adds a four-expert router and a new 100k scaffold benchmark for operon prediction, but the router's ability to detect real biological conflict versus labeling patterns remains unproven.

read the letter

The paper's main contribution is a mixture-of-experts fusion that adds dedicated agreement and conflict experts on top of ProstT5 protein and Bacformer genome encoders, trained with BCE plus symmetric InfoNCE and disagreement-weighted contrastive losses. They also release OG-Operon100K, a 100,000-pair benchmark drawn from the OMG metagenomic corpus with positive and negative labels based on scaffold-level criteria. On this set the model beats the single-encoder and simple-concat baselines, with the largest reported lift on a hard subset where sequence similarity alone is misleading. Ablations flag the contrastive term as the biggest driver. That combination is new for this specific task even if the base encoders are not. The benchmark construction itself looks like a practical addition for the field. The soft spots are straightforward. The abstract supplies no error bars, no statistical tests, no hyperparameter schedule, and no training protocol, so the performance numbers cannot be assessed for robustness. The central claim that the soft router reliably routes on true signal disagreement rather than dataset-specific labeling rules is not yet backed by controls that would rule out the stress-test concern. If the conflict expert mainly activates on patterns that co-occur with the negative-labeling heuristic, the hard-subset gains lose their intended biological meaning. This work is aimed at computational biologists already working with protein and genome foundation models who need better fusion for operon or similar co-membership tasks. The benchmark could be reused even if the model needs tightening. I would send it to peer review. The architecture and data are coherent enough to merit referee time, provided the authors supply full methods, code, and additional router diagnostics.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces MicroFuse, a protein-to-genome expert fusion framework that integrates ProstT5 protein representations with Bacformer genome-context representations via a four-expert Mixture-of-Experts module (protein, genome-context, agreement, and conflict experts) controlled by a learned soft router. Training combines binary cross-entropy with symmetric InfoNCE cross-modal alignment and disagreement-weighted supervised contrastive loss. The authors construct the OG-Operon100K scaffold-level benchmark from the OMG metagenomic corpus using biologically grounded positive/negative criteria and report that MicroFuse outperforms ProstT5-only, Bacformer-only, and Concat MLP baselines on AUROC, AUPRC, mAP, and mAR, with ablations attributing gains primarily to the contrastive component and largest improvements on a hard sequence-conflict subset.

Significance. If the central claims hold under rigorous validation, the explicit modeling of agreement versus conflict signals via specialized experts addresses a genuine biological challenge in operon co-membership prediction where protein identity and genomic layout can provide contradictory cues. The new OG-Operon100K benchmark grounded in the OMG corpus is a useful contribution for the community. The ablation isolating cross-modal contrastive alignment as dominant is a strength, but overall significance is tempered by the absence of statistical testing and external validation of the router's behavior.

major comments (2)

[Abstract and results section] Abstract and results section: the headline performance claims (strongest AUROC/AUPRC/mAP/mAR on OG-Operon100K) and the interpretation of largest gains on the hard sequence-conflict subset rest on single-run summary statistics without statistical significance tests, error bars, or details on hyperparameter selection and full training protocol. This makes it impossible to determine whether the reported margins over the three baselines are robust or could arise from random variation.
[Router and expert module description] Router and expert module description: the claim that the four-expert soft router (protein, genome-context, agreement, conflict) learns to detect and exploit true biological disagreement rather than benchmark-specific labeling artifacts is load-bearing for both the ablation conclusions and the hard-subset gains. No independent probe of router decisions (e.g., on held-out external operon datasets or synthetic conflict cases) is provided, leaving open the possibility that the conflict expert activates on patterns correlated with the OMG-derived negative-labeling rule rather than regulatory independence.

minor comments (3)

[Methods] The manuscript would benefit from explicit equations for the disagreement-weighted contrastive loss term and the soft-router gating function to allow exact reproduction.
[Ablation studies] Ablation tables lack multiple random seeds or variance estimates, which would strengthen the claim that contrastive alignment is the dominant component.
[Throughout] Minor notation inconsistency: the abstract refers to 'mAR' while the main text occasionally uses 'mean average recall'; standardize terminology.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful review and for highlighting areas where additional rigor would strengthen our work. We agree with both major comments and will revise the manuscript accordingly to include statistical analyses and a more cautious interpretation of the router's behavior, supported by additional qualitative probes.

read point-by-point responses

Referee: [Abstract and results section] Abstract and results section: the headline performance claims (strongest AUROC/AUPRC/mAP/mAR on OG-Operon100K) and the interpretation of largest gains on the hard sequence-conflict subset rest on single-run summary statistics without statistical significance tests, error bars, or details on hyperparameter selection and full training protocol. This makes it impossible to determine whether the reported margins over the three baselines are robust or could arise from random variation.

Authors: We fully agree that single-run results limit the ability to assess robustness. In the revised manuscript, we will conduct experiments with at least five different random seeds, reporting mean performance metrics with standard deviations. We will also perform statistical significance tests (e.g., Wilcoxon signed-rank test) to compare MicroFuse against baselines and include p-values. Additionally, we will expand the methods section to detail the hyperparameter search procedure and full training protocol, including optimizer settings, learning rates, and early stopping criteria. These changes will provide evidence that the observed improvements are not due to random variation. revision: yes
Referee: [Router and expert module description] Router and expert module description: the claim that the four-expert soft router (protein, genome-context, agreement, conflict) learns to detect and exploit true biological disagreement rather than benchmark-specific labeling artifacts is load-bearing for both the ablation conclusions and the hard-subset gains. No independent probe of router decisions (e.g., on held-out external operon datasets or synthetic conflict cases) is provided, leaving open the possibility that the conflict expert activates on patterns correlated with the OMG-derived negative-labeling rule rather than regulatory independence.

Authors: We acknowledge the importance of validating that the router captures genuine biological signals. While we do not have immediate access to additional external operon datasets for independent probing, we will add a new analysis section examining the router's activation patterns on the hard sequence-conflict subset. This will include examples where the conflict expert is activated and how it correlates with biological features like gene distance or functional annotations. We will also revise the manuscript text to moderate the claim, stating that the router 'appears to prioritize conflict signals in ambiguous cases' based on the ablation results, rather than asserting it definitively 'learns to detect true biological disagreement'. The disagreement-weighted contrastive loss and the performance gains on the hard subset provide supporting evidence, but we agree an external probe would be ideal for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline is self-contained

full rationale

The paper constructs OG-Operon100K independently from the OMG corpus using explicit biologically grounded positive/negative labeling rules, then trains a standard MoE architecture with BCE + InfoNCE + contrastive losses and evaluates AUROC/AUPRC/mAP/mAR on held-out splits against explicit baselines. No equations, router definitions, or loss terms reduce the reported metrics to quantities defined solely by the fitted parameters or by re-expressing the benchmark construction itself. Ablations and hard-subset analyses compare independent model variants on the same fixed benchmark, without any self-definitional, fitted-input-renamed-as-prediction, or self-citation-load-bearing steps. The derivation from architecture to performance claims remains externally falsifiable and does not collapse by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that protein identity and genome context supply complementary and sometimes conflicting signals for operon membership, plus standard machine-learning assumptions that contrastive alignment improves multimodal fusion. The model introduces two new expert modules whose utility is demonstrated only internally.

free parameters (1)

balancing weights among BCE, InfoNCE, and disagreement-weighted contrastive losses
The abstract states that training combines the three losses but does not specify how the relative weights are chosen or tuned.

axioms (1)

domain assumption Protein-scale molecular identity and genome-context organization provide complementary signals that can agree or conflict for operon co-membership
Explicitly stated in the abstract as the key biological property motivating the expert design.

invented entities (1)

agreement expert and conflict expert no independent evidence
purpose: To specialize in cases where the two modalities agree or disagree on operon membership
New modules introduced in the MoE layer without external biological validation or prior literature reference.

pith-pipeline@v0.9.0 · 5552 in / 1543 out tokens · 39763 ms · 2026-05-12T01:41:38.189071+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

four-expert Mixture-of-Experts module (protein, genome-context, agreement, and conflict experts) with a learned soft router... disagreement-weighted supervised contrastive shaping
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean bare_distinguishability_of_absolute_floor echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

hconf_ij = E_conf [|zp_ij − zb_ij|; zp_ij − zb_ij]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Genetic regulatory mechanisms in the synthesis of proteins , journal =

Jacob, Fran. Genetic regulatory mechanisms in the synthesis of proteins , journal =. 1961 , doi =

work page 1961
[2]

and Maltsev, Natalia , title =

Overbeek, Ross and Fonstein, Michael and D'Souza, Michael and Pusch, Gordon D. and Maltsev, Natalia , title =. Proceedings of the National Academy of Sciences , volume =. 1999 , doi =

work page 1999
[3]

and Huang, Katherine H

Price, Morgan N. and Huang, Katherine H. and Arkin, Adam P. and Alm, Eric J. , title =. Molecular Systems Biology , volume =. 2005 , doi =

work page 2005
[4]

Nucleic Acids Research , volume =

Mao, Fenglou and Dam, Phuongan and Chou, Janet and Olman, Victor and Xu, Ying , title =. Nucleic Acids Research , volume =. 2009 , doi =

work page 2009
[5]

Briefings in Functional Genomics , volume =

Naville, Magali and Ghuillot-Gaudeffroy, Ange and Marchais, Antonin and Gautheret, Daniel , title =. Briefings in Functional Genomics , volume =. 2015 , doi =

work page 2015
[6]

Nucleic Acids Research , volume =

Okuda, Shujiro and Katayama, Toshiaki and Kawashima, Shuichi and Goto, Susumu and Kanehisa, Minoru , title =. Nucleic Acids Research , volume =. 2006 , doi =

work page 2006
[7]

and Kellogg, Elizabeth and Ovchinnikov, Sergey and Girguis, Peter , title =

Hwang, Yunha and Cornman, Andre L. and Kellogg, Elizabeth and Ovchinnikov, Sergey and Girguis, Peter , title =. Nature Communications , volume =. 2024 , doi =

work page 2024
[8]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and others , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =. 2021 , doi =

work page 2021
[9]

Science , volume =

Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and others , title =. Science , volume =. 2023 , doi =

work page 2023
[10]

NAR Genomics and Bioinformatics , volume =

Heinzinger, Michael and Weissenow, Konstantin and Sanchez, Jose and Henkel, Adrian and Mirdita, Milot and Steinegger, Martin and Rost, Burkhard , title =. NAR Genomics and Bioinformatics , volume =. 2024 , doi =

work page 2024
[11]

and Tumescheit, Charlotte and Mirdita, Milot and Lee, Jeonghun and Gilchrist, Cameron L

van Kempen, Michel and Kim, Stephanie S. and Tumescheit, Charlotte and Mirdita, Milot and Lee, Jeonghun and Gilchrist, Cameron L. M. and S. Fast and accurate protein structure search with. Nature Biotechnology , volume =. 2024 , doi =

work page 2024
[12]

bioRxiv , year =

Wiatrak, Maciej and others , title =. bioRxiv , year =

work page
[13]

and West-Roberts, Jillian and Camargo, Antonio Pedro and Roux, Simon and Beracochea, Martin and Mirdita, Milot and Ovchinnikov, Sergey and Hwang, Yunha , title =

Cornman, Andre L. and West-Roberts, Jillian and Camargo, Antonio Pedro and Roux, Simon and Beracochea, Martin and Mirdita, Milot and Ovchinnikov, Sergey and Hwang, Yunha , title =. International Conference on Learning Representations , year =

work page
[14]

van den Oord, Aaron and Li, Yazhe and Vinyals, Oriol , title =

work page
[15]

International Conference on Machine Learning , pages =

Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey , title =. International Conference on Machine Learning , pages =

work page
[16]

International Conference on Learning Representations , year =

Loshchilov, Ilya and Hutter, Frank , title =. International Conference on Learning Representations , year =

work page