Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

Ellwil Sharma , Arastu Sharma

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIphysics.comp-ph

keywords expertphysicallatentmulti-physicsroutinguniversalvelocityachieving

0 comments

The pith

Shodh-MoE uses Top-1 routing on compressed latents from a divergence-free autoencoder to let distinct physics regimes train separate experts, yielding low MSE and autonomous domain separation in mixed pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training one neural model on many kinds of physics at once often fails because the math rules for fast fluid flows clash with those for slow flow through tiny pores, causing the shared parameters to get pulled in opposite directions. The authors first compress the 3D physical fields into a smaller latent grid using an autoencoder that forces the velocity to have zero divergence, which automatically keeps mass conserved. A transformer then processes these latents with a router that sends each small patch to one of several expert sub-networks. During training the router learns to send fluid data almost only to Expert 0 and porous-media data almost only to Expert 1. The final model reaches very small errors both in the latent space and when the fields are decoded back to real velocities and pressures.

Core claim

These results support sparse expert routing as a practical architectural mechanism for mitigating multi-physics interference in universal neural operators.

Load-bearing premise

That a Top-1 soft-semantic router will reliably produce autonomous domain bifurcation and specialized parameter paths for incompatible PDE regimes without losing shared symmetries or requiring extensive hyperparameter tuning.

read the original abstract

Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstable optimization, and plasticity loss in dense neural operators. In particular, broadband open-channel fluid dynamics and boundary-dominated porous media flows impose incompatible spectral and geometric demands on a single dense parameter path. We introduce Shodh-MoE, a sparse-activated latent transformer architecture for multi-physics transport. Shodh-MoE operates on compressed 16^3 physical latents produced by a physics-informed autoencoder with an intra-tokenizer Helmholtz-style velocity parameterization, restricting decoded states to divergence-free velocity manifolds. The model guarantees exact mass conservation, achieving a physically verifiable velocity divergence of ~2.8 x 10^-10 (evaluated post-hoc in FP64) on 128^3 grids. A Top-1 soft-semantic router dynamically assigns localized latent patches to expert subnetworks, enabling specialized parameter paths for distinct physical mechanisms while preserving shared experts for universal symmetries. In a 20,000-step distributed pretraining run over mixed three-dimensional physical tensors, routing telemetry shows autonomous domain bifurcation: held-out validation tokens from the open-channel domain route exclusively to Expert 0, while porous-media tokens route exclusively to Expert 1. The model converges simultaneously across both regimes, achieving latent validation MSEs of 2.46 x 10^-5 and 9.76 x 10^-6, and decoded physical MSEs of 2.48 x 10^-6 and 1.76 x 10^-6. These results support sparse expert routing as a practical architectural mechanism for mitigating multi-physics interference in universal neural operators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shodh-MoE gets clean routing and low errors on mixed 3D PDE tensors but the missing dense baseline leaves the negative-transfer claim unquantified.

read the letter

The main takeaway is that this latent transformer with a Helmholtz-style autoencoder and Top-1 router converges on a mix of open-channel and porous-media flows, with the router sending each domain to its own expert and keeping velocity divergence at 2.8e-10. The autoencoder part looks useful because it builds in exact mass conservation on the decoded fields, and the reported latent and physical MSEs after 20k steps are small enough to be interesting for multi-physics work. The routing telemetry showing autonomous bifurcation on held-out tokens is a concrete observation that matches the design goal of letting incompatible regimes use separate parameter paths while sharing some symmetries. What is new here is the specific pairing of the intra-tokenizer velocity parameterization with sparse expert assignment on compressed physical latents; prior MoE and physics-informed operator papers do not describe this exact setup. The soft spot is the lack of a dense baseline trained on the identical mixed dataset. Without that control we cannot measure how much gradient conflict or plasticity loss actually occurred in the non-sparse case, so the claim that the router eradicates negative transfer rests on the observed convergence rather than a direct comparison. No ablations on router details or expert count are shown either, and training hyperparameters are only sketched. This is aimed at people building universal SciML operators across regimes. A reader who already works with latent models or MoE routing in physics settings would pick up practical ideas from the bifurcation behavior and conservation numbers. The work is concrete enough to deserve a serious referee, mainly to verify the implementation and request the missing baseline run. I would send it to review but flag the need for that dense comparison to make the central claim solid.

Referee Report

1 major / 1 minor

Summary. The paper introduces Shodh-MoE, a sparse-activated latent transformer for multi-physics SciML that combines a physics-informed autoencoder (producing 16^3 divergence-free latents) with a Top-1 soft-semantic router. It reports that in a 20,000-step pretraining run on mixed 3D open-channel and porous-media tensors, the router produces autonomous domain bifurcation (open-channel tokens to Expert 0, porous-media to Expert 1), simultaneous convergence, latent validation MSEs of 2.46e-5 and 9.76e-6, decoded physical MSEs of 2.48e-6 and 1.76e-6, and post-hoc FP64 velocity divergence of ~2.8e-10, claiming this architecture eradicates negative transfer.

Significance. If the central claim is substantiated with proper controls, the result would demonstrate a practical mechanism for mitigating gradient conflict and plasticity loss when co-training incompatible PDE regimes inside a single neural operator, advancing scalable universal foundation models in scientific machine learning. The physics-informed autoencoder's exact mass conservation is a concrete strength that could transfer to other operator architectures.

major comments (1)

[Abstract / Results] Abstract and experimental results: the claim that Top-1 sparse routing eradicates negative transfer is not supported because no dense latent transformer baseline is trained on the identical mixed 3D tensor dataset. Without this control, the observed bifurcation and joint convergence cannot be causally attributed to the router rather than domain dissimilarity alone; the magnitude of any mitigation remains unquantified.

minor comments (1)

[Method] The precise formulation of the 'soft-semantic' router (gating function, temperature, and how shared experts are preserved) is not fully specified in the provided description, making reproducibility of the routing telemetry difficult.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The major comment raises a valid point about experimental controls, which we address below. We will revise the manuscript to strengthen the claims.

read point-by-point responses

Referee: [Abstract / Results] Abstract and experimental results: the claim that Top-1 sparse routing eradicates negative transfer is not supported because no dense latent transformer baseline is trained on the identical mixed 3D tensor dataset. Without this control, the observed bifurcation and joint convergence cannot be causally attributed to the router rather than domain dissimilarity alone; the magnitude of any mitigation remains unquantified.

Authors: We agree that the current experiments lack a direct dense latent transformer baseline trained on the identical mixed 3D tensor dataset, which prevents a fully quantified causal attribution of negative transfer mitigation to the Top-1 router. The reported autonomous domain bifurcation and simultaneous convergence across regimes are consistent with reduced interference due to sparse routing, but without the baseline the magnitude of any benefit cannot be measured. In the revised manuscript we will add a controlled comparison: a dense latent transformer (identical architecture and hyperparameters except for the router) trained on the same mixed dataset, reporting side-by-side metrics on convergence stability, validation MSE, and any signs of gradient conflict or plasticity loss. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard domain assumptions about mass conservation via divergence-free fields and the ability of sparse routing to separate incompatible spectral demands; no explicit free parameters are introduced beyond ordinary training hyperparameters.

axioms (1)

domain assumption Velocity fields must lie on divergence-free manifolds to guarantee exact mass conservation
Invoked in the physics-informed autoencoder to restrict decoded states.

invented entities (1)

Shodh-MoE sparse-activated latent transformer no independent evidence
purpose: To provide specialized parameter paths for distinct physical mechanisms via Top-1 routing
New architecture introduced to address negative transfer; no independent evidence outside the reported training run.

pith-pipeline@v0.9.0 · 5620 in / 1308 out tokens · 52589 ms · 2026-05-15T03:18:39.052599+00:00 · methodology

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

Core claim

Load-bearing premise

discussion (0)