{"total":16,"items":[{"citing_arxiv_id":"2606.29256","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generalization Analysis of Transformers in Distribution Regression","primary_cat":"stat.ML","submitted_at":"2026-06-28T07:54:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces a Transformer framework for distribution regression with a new attention operator enabling lossless compression, proves stronger functional learning than CNNs/FCNs, and provides a generalization bound with applications to LLM fine-tuning and scaling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01822","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hierarchically Decoupled Mixture-of-Experts for Robust Traffic Sign Recognition in Complex Driving Scenarios","primary_cat":"cs.CV","submitted_at":"2026-06-01T07:39:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A hierarchically decoupled heterogeneous MoE framework with YOLO experts and lightweight gating network reports 76.8% mAP50-95 on a composite traffic sign dataset, a 2.3% gain over baseline with 39.4% lower compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25952","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-25T15:28:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VEN-VL introduces an enrich-then-compact visual ensemble MoE approach claiming superior performance-efficiency trade-off in multimodal tasks using fewer condensed visual tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20908","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches","primary_cat":"cs.CV","submitted_at":"2026-05-20T08:54:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SynCB adds a dynamic routing module and joint training to a hybrid concept-plus-neural architecture, reporting up to 3.9 pp higher accuracy than a full neural baseline and up to 6.43 pp better intervention responsiveness than prior hybrids across five datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20891","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-20T08:31:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HDMoE uses hierarchical MoE and RFR modules to address redundant information and fine-grained intra/inter-modality relationships in multimodal cancer survival prediction, with positive results on private liver cancer and TCGA datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17743","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation","primary_cat":"cs.CV","submitted_at":"2026-05-18T01:52:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoASE++ combines activation sparsity experts with domain-adaptive on-policy distillation to achieve state-of-the-art continual test-time adaptation on image classification and segmentation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14200","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization","primary_cat":"cs.LG","submitted_at":"2026-05-13T23:32:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13761","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-15T11:47:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Patch-wise sparse MoE layers in CNNs for semantic segmentation yield architecture-dependent gains up to 3.9 mIoU on Cityscapes and BDD100K with low overhead, but show strong design sensitivity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.05564","ref_index":139,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TabICL: A Tabular Foundation Model for In-Context Learning on Large Data","primary_cat":"cs.LG","submitted_at":"2025-02-08T13:25:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.14660","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mean-field limit from general mixtures of experts to quantum neural networks","primary_cat":"math-ph","submitted_at":"2025-01-24T17:29:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proves mean-field limit and propagation of chaos for gradient-flow trained mixtures of experts with explicit rate depending only on expert count, applied to quantum neural networks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.12031","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RouterBench: A Benchmark for Multi-LLM Routing System","primary_cat":"cs.LG","submitted_at":"2024-03-18T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.15947","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoE-LLaVA: Mixture of Experts for Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2024-01-29T08:13:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.01335","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models","primary_cat":"cs.LG","submitted_at":"2024-01-02T18:53:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2202.08906","ref_index":145,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ST-MoE: Designing Stable and Transferable Sparse Expert Models","primary_cat":"cs.CL","submitted_at":"2022-02-17T21:39:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2101.03961","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","primary_cat":"cs.LG","submitted_at":"2021-01-11T16:11:52+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1701.06538","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","primary_cat":"cs.LG","submitted_at":"2017-01-23T18:10:00+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Sparse Gating (alternate formulation): To obtain a sparse gating vector, we multiply Gσ(x) component-wise with a sparse mask M(Gσ(x)) and normalize the output. The mask itself is a function ofGσ(x) and speciﬁes which experts are assigned to each input example: 18 Under review as a conference paper at ICLR 2017 G(x)i = Gσ(x)iM(Gσ(x))i∑n j=1Gσ(x)jM(Gσ(x))j (16) Top-K Mask: To implement top-k gating in this formulation, we would letM(v) =TopK (v,k ), where: TopK (v,k )i = {1 ifvi is in the topk elements ofv. 0 otherwise. (17) Batchwise Mask: To force each expert to receive the exact same number of examples, we intro- duce an alternative mask function,Mbatchwise(X,m ), which operates over batches of input vectors."}],"limit":50,"offset":0}