super hub Canonical reference

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Andy Davis, Azalia Mirhoseini, Geoffrey Hinton, Krzysztof Maziarz, Noam Shazeer, Quoc Le · 2017 · cs.LG · arXiv 1701.06538

Canonical reference. 75% of citing Pith papers cite this work as background.

296 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 296 citing papers more from Andy Davis arXiv PDF

abstract

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 39 method 10 baseline 2 dataset 1

citation-polarity summary

background 39 use method 9 baseline 2 support 1 use dataset 1

claims ledger

abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic

authors

Andy Davis Azalia Mirhoseini Geoffrey Hinton Krzysztof Maziarz Noam Shazeer Quoc Le

co-cited works

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.

When Does Synthetic CT Transfer? A Label-Free Donor/Host Diagnostic for Medical Vision-Language Model Routing on Real Lung CT

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

cs.DC · 2026-06-10 · unverdicted · novelty 7.0

ForeMoE uses routing foresight from the rollout stage to enable micro-step load balancing in MoE RL post-training via a hierarchical planner and transfer engine, claiming up to 1.45x speedup on 64 GPUs.

PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer

cs.DC · 2026-06-05 · unverdicted · novelty 7.0

PCCL synthesizes near-optimal topology-aware collective algorithms for arbitrary patterns while being process group-aware and scalable to subsets of devices.

Less is MoE: Trimming Experts in Domain-Specialist Language Models

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.

Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

cs.IR · 2026-06-03 · unverdicted · novelty 7.0

Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

cs.DC · 2026-05-30 · unverdicted · novelty 7.0

ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

math.DS · 2026-05-27 · unverdicted · novelty 7.0

A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

cs.IR · 2026-05-26 · unverdicted · novelty 7.0

L2Rec introduces dual-view personalized low-rank perturbations via DPMoE to let one LLM backbone produce complementary behavioral and semantic adaptations, with cross-view fusion, outperforming baselines on four datasets and in industrial A/B tests.

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.

Dynamic Chunking for Diffusion Language Models

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.

Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction

cs.LG · 2026-05-13 · conditional · novelty 7.0

PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

SDG-MoE: Signed Debate Graph Mixture-of-Experts

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.

Approximation-Free Differentiable Oblique Decision Trees

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.

citing papers explorer

Showing 50 of 296 citing papers.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding cs.CL · 2020-06-30 · unverdicted · none · ref 16 · internal anchor
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
Review Residuals: Update-Conditioned Residual Gating for Transformers cs.LG · 2026-06-30 · unverdicted · none · ref 14 · internal anchor
Review Residuals add an update-conditioned gate to transformer residual connections, yielding depth-stable training and performance gains that emerge and grow with model size from 590M parameters upward.
Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering cs.CL · 2026-06-30 · unverdicted · none · ref 23 · internal anchor
BiRG-LoRA achieves 69.31% macro-average accuracy across CMB, CMExam, MedQA, and MedMCQA, outperforming MoELoRA by 0.89 points with 28.1% fewer trainable parameters under a matched Qwen3-8B protocol.
Mixture-of-Control: State-Aware Fine-Tuning for Transformer-based Models cs.LG · 2026-06-30 · unverdicted · none · ref 10 · internal anchor
Mixture-of-Control adaptively combines local and global control states in transformer fine-tuning by treating per-block states as experts in a sparse MoE setup to improve cross-block communication while keeping memory and compute costs comparable to prior state-based methods.
Language-Assisted Super-Resolution from Real-World Low-Resolution Patches cs.CV · 2026-06-30 · unverdicted · none · ref 46 · 2 links · internal anchor
LA-SR extracts real LR patches from depth-varying regions in single images and uses vision-language models with linguistic content and quality losses for unpaired super-resolution.
Orthogonal Representation Editing: Decoupling Semantic Entanglement in Batch Knowledge Editing of LLMs cs.CL · 2026-06-21 · unverdicted · none · ref 17 · internal anchor
ORE decouples semantic entanglement in LLM hidden states via orthogonal edit vectors and a gated non-linear head, improving batch knowledge editing performance including cross-lingual cases.
Sakana Fugu Technical Report cs.LG · 2026-06-19 · unverdicted · none · ref 213 · internal anchor
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
Factor-Aware Mixture-of-Experts with Pretrained Encoder for Combinatorial Generalization cs.RO · 2026-06-19 · unverdicted · none · ref 21 · internal anchor
FAME combines a factor-aware MoE with frozen pretrained encoders via staged adapter training and joint fine-tuning, reporting 34% gains on Meta-World and 35% in real-world pick-and-place under environmental changes.
Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology cs.CV · 2026-06-16 · unverdicted · none · ref 15 · internal anchor
MixTIME uses a learnable-router MoE to fuse three pathology foundation models for pixel- and slide-level prediction of 17 mIF protein markers from H&E images, improving spatial domain ID, survival prediction, and pathologist-validated report generation.
Profy: Interpretable Visualization of Expertise-Dependent Motor Skills Toward Supporting Piano Practice cs.HC · 2026-06-09 · unverdicted · none · ref 60 · internal anchor
Profy uses take-level expert-amateur labels on 1083 piano recordings to produce time-aligned highlight scores that correlate with expert review points (r=0.61) on held-out amateur clips.
Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction cs.LG · 2026-06-08 · unverdicted · none · ref 15 · internal anchor
Loss-guided adaptive scale refinement on NaCl aqueous system reduces overall force MAE from 399.65 to 381.23 by discovering intermediate scales from initial anchors.
Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning cs.LG · 2026-06-05 · unverdicted · none · ref 42 · internal anchor
SETA decomposes parameters into task-specific and shared sparse experts with adaptive anchoring and routing regularization to improve retention and backward transfer in LLM continual learning.
HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers cs.RO · 2026-06-04 · unverdicted · none · ref 15 · internal anchor
HANDOFF is a distilled mixture-of-experts humanoid whole-body controller that follows a compact task-space interface, matches SOTA velocity tracking, provides large manipulation workspace on Unitree G1, and supports VLM-driven agentic planning with no task-specific data.
Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation cs.LG · 2026-06-02 · unverdicted · none · ref 31 · internal anchor
Hyper-Connections models show stream collapse to a dominant stream with near-identity residual mixing after seeding; symmetry-breaking initialization mitigates dominance and raises performance.
MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence cs.CV · 2026-06-01 · unverdicted · none · ref 11 · internal anchor
MASER trains modality adapters on a shared VLM and routes questions via a learned MLP to the best adapter, reaching 51.3% oracle agreement on Open3D-VQA while using one adapter call per question.
Gravity-Aware Hierarchical Routing for Lightweight SensorLLM on Human Activity Recognition eess.SP · 2026-06-01 · unverdicted · none · ref 13 · internal anchor
Introduces a lightweight gravity-aware routing head that improves macro-F1 on static classes in compressed SensorLLM for human activity recognition on the MHealth dataset.
When Meaning Travels: A Granular Lens on Hybrid-MoE's Role in Idiomatic Understanding for Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 115 · internal anchor
HybridMoE with controlled hybridization and idiomatic property signals yields 5-6% gains in figurative language representation for multilingual vision-language models.
DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts cs.AI · 2026-05-31 · unverdicted · none · ref 20 · internal anchor
DAG-MoE uses a lightweight module to learn DAG-based structural aggregation of selected experts, expanding combination space and enabling intra-layer multi-step reasoning compared to standard weighted-sum MoE.
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training cs.LG · 2026-05-30 · unverdicted · none · ref 46 · internal anchor
GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.
Can BEV Perception Gracefully Degrade under Sensor Failures? cs.CV · 2026-05-29 · unverdicted · none · ref 36 · internal anchor
Grace-BEV enables graceful degradation in BEV perception under sensor failures by using a TrustGate Router for modality trustworthiness and FailSafe Fusion Block for dynamic integration, with modality dropout training, restoring performance to 34.7% mAP under LiDAR failure on nuScenes variants.
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting cs.LG · 2026-05-28 · unverdicted · none · ref 9 · internal anchor
GC-MoE improves MAE on four traffic forecasting benchmarks by routing nodes to combinations of frozen spatio-temporal GNN experts via a graph-conditioned lightweight router, training only ~17K parameters atop 1.5M frozen weights.
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling cs.LG · 2026-05-26 · unverdicted · none · ref 12 · internal anchor
Dense2MoE unifies pruning of attention modules with upcycling of MLPs into MoE experts to produce on-device LLMs that improve the latency-accuracy Pareto frontier.
BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma cs.CV · 2026-05-25 · unverdicted · none · ref 13 · internal anchor
BioFact-MoE applies a biologically factorized MoE architecture to multimodal MRI-report data and reports improved 12-24 month survival AUCs plus selective embedding associations in an N=588 HCC cohort.
Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts cs.AI · 2026-05-22 · unverdicted · none · ref 3 · internal anchor
Empirical routing analysis of Mixtral shows safety-relevant signals are distributed and depth-dependent rather than localized to fixed experts.
Asymmetric Scaling Laws from Sparse Features stat.ML · 2026-05-22 · unverdicted · none · ref 35 · internal anchor
A sparse-activation model predicts double-descent loss with distinct under- and over-parameterized scaling exponents set by sparsity, plus a compute-optimal frontier favoring dataset growth.
EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation cs.CV · 2026-05-21 · unverdicted · none · ref 45 · internal anchor
EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.
Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution q-fin.CP · 2026-05-19 · unverdicted · none · ref 56 · internal anchor
In a two-agent Almgren-Chriss liquidation game, deep RL agents given intra-episode history of prices and own actions achieve supra-competitive outcomes more frequently and persistently than agents without such memory.
Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training cs.SD · 2026-05-18 · unverdicted · none · ref 33 · internal anchor
GST uses gradient-based affinity metrics to form dataset groups and applies progressive scheduling, achieving 30-40% faster convergence than uniform mixture training on 14 AudioQA datasets while matching or exceeding performance.
Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training cs.DC · 2026-05-18 · unverdicted · none · ref 9 · internal anchor
Guard combines online performance monitoring and offline node qualification to detect stragglers and fail-slow behaviors in large-scale training, reporting up to 1.7x higher mean FLOPs utilization and reduction of step-time variance from 20% to 1%.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization cs.RO · 2026-05-17 · unverdicted · none · ref 122 · internal anchor
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
NGM: A Plug-and-Play Training-Free Memory Module for LLMs cs.AI · 2026-05-16 · unverdicted · none · ref 36 · internal anchor
NGM is a plug-and-play n-gram memory module that encodes n-grams from pretrained embeddings and gates their injection to improve LLM performance by 0.5-1.2 points on average across eight benchmarks.
Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching cs.CL · 2026-05-13 · accept · none · ref 1 · internal anchor
At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.
Probing Routing-Conditional Calibration in Attention-Residual Transformers cs.CV · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
Improving Generalization by Permutation Routing Across Model Copies cs.LG · 2026-05-10 · unverdicted · none · ref 9 · internal anchor
Replicating models and routing their local losses via permutations from a mixing kernel Q enables structured message sharing that improves generalization.
Sparse Layers are Critical to Scaling Looped Language Models cs.LG · 2026-05-09 · unverdicted · none · ref 5 · 2 links · internal anchor
Looped-MoE models scale better than dense looped or standard transformers because routing changes across loops, and they enable stronger compute-quality trade-offs via early exits at loop boundaries.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs cs.CL · 2026-05-08 · conditional · none · ref 19 · 2 links · internal anchor
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents cs.LG · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 109 · internal anchor
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
Complexity Horizons of Compressed Models in Analog Circuit Analysis cs.AI · 2026-05-04 · unverdicted · none · ref 24 · internal anchor
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring cs.CL · 2026-05-04 · unverdicted · none · ref 66 · internal anchor
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
DeepPropNet: an operator learning-based predictor for thermal plasma properties physics.plasm-ph · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
DeepPropNet predicts thermal plasma properties with relative L2 errors of 10^{-3} to 10^{-2} for SF6-N2 and C4F7N-CO2-O2 mixtures using single-property and mixture-of-experts architectures trained on high-fidelity data.
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving cs.DC · 2026-04-29 · unverdicted · none · ref 22 · internal anchor
FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.
Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping cs.CV · 2026-04-21 · unverdicted · none · ref 37 · internal anchor
SSDM decouples global geospatial embeddings into structural modulation and semantic injection pathways to improve accuracy and consistency in high-resolution remote sensing land cover mapping.
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling cs.LG · 2026-04-21 · unverdicted · none · ref 18 · internal anchor
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation cs.IR · 2026-04-21 · unverdicted · none · ref 86 · internal anchor
STK-Adapter adds Spatial-Temporal MoE, Event-Aware MoE, and Cross-Modality Alignment MoE to integrate evolving TKG graphs and event chains into LLMs, reducing information loss and improving extrapolation performance over prior methods.
Domain-Specialized Object Detection via Model-Level Mixtures of Experts cs.CV · 2026-04-20 · unverdicted · none · ref 33 · internal anchor
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms cs.CV · 2026-04-16 · unverdicted · none · ref 40 · internal anchor
WILD-SAM is a fine-tuned SAM variant using phase-aware MoE adapters and wavelet subband enhancement that achieves state-of-the-art landslide detection on wrapped InSAR data.
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation cs.LG · 2026-04-14 · unverdicted · none · ref 48 · internal anchor
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence cs.SE · 2026-03-27 · unverdicted · none · ref 54 · internal anchor
Empirical case study on a flagship Android device profiles energy, latency, and quality trade-offs across eight LLMs, revealing a quantization energy paradox and identifying mid-sized models as practical sweet spots.
Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment cs.NI · 2026-03-26 · unverdicted · none · ref 3 · internal anchor
Qwen-2.5-3B achieves 0.793 accuracy and 988 ms median latency on six-class task routing but misses the pre-registered viability bar of 0.85 accuracy and 2000 ms P95 latency.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer