pith. machine review for the scientific record. sign in

arxiv: 2605.08842 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: no theorem link

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:37 UTC · model grok-4.3

classification 💻 cs.CL
keywords Mixture-of-ExpertsExpert Knowledge ReuseLanguage Model TrainingTensor DecompositionMoE LLMsGeneralizable KnowledgeKnowledge Transfer
0
0 comments X

The pith

Reusing cross-domain expert knowledge from MoE LLMs improves training outcomes for language models of different scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In Mixture-of-Experts language models, analysis reveals that a subset of experts activates consistently across diverse knowledge domains. These experts appear to encode generalizable knowledge tied to the model's ability to generalize. The XPERT framework extracts these experts using inference alone, refines them with tensor decomposition, and reuses the knowledge by adapting it into training processes for other models. Experiments demonstrate that this leads to stronger results on language understanding and dialogue generation tasks as well as quicker convergence during training. This approach treats pre-trained MoE models as sources of reusable structured knowledge rather than just final products.

Core claim

XPERT identifies a subset of consistently activated experts in pre-trained MoE LLMs that encode cross-domain generalizable knowledge. It refines their representations through tensor decomposition and adapts the extracted knowledge for reuse in the training of language models across scales. This results in models that achieve stronger performance and faster convergence on language understanding and dialogue generation benchmarks compared to strong baselines.

What carries the argument

XPERT, the framework that extracts cross-domain experts from MoE LLMs via inference-only analysis, refines them using tensor decomposition, and adapts them for reuse in downstream model training.

If this is right

  • Reused expert knowledge leads to consistently stronger performance on language understanding benchmarks.
  • Models trained with XPERT converge faster than those using standard methods.
  • The benefits apply to language models at various scales.
  • MoE LLMs function as structured and reusable knowledge sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could enable more efficient knowledge transfer from large MoE models to smaller ones without full retraining.
  • Similar inference-based analysis might uncover reusable components in other types of modular models.
  • Refining expert knowledge this way may offer a path to reduce computational costs in developing new language models.

Load-bearing premise

The subset of experts identified as consistently activated truly captures cross-domain generalizable knowledge that can be refined by tensor decomposition and transferred to improve training without introducing biases or losing important details.

What would settle it

If language models trained using the expert knowledge extracted by XPERT do not show improved performance or faster convergence on language understanding and dialogue generation benchmarks relative to baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.08842 by Boyu Shi, Chang Liu, Xin Geng, Xu Yang.

Figure 1
Figure 1. Figure 1: Experts activation frequencies of layer 15 in OLMoE-7B across different domains. Additional examples are provided in Appendix A. knowledge in MoE-based LLMs can be similarly extracted and transferred to improve the efficiency and effectiveness of model training. We propose XPERT, a framework that extracts cross-domain common expert knowledge from MoE LLMs and reuses it to support more effective training of… view at source ↗
Figure 2
Figure 2. Figure 2: The framework of XPERT. Z is the tensor formed by stacking the selected experts parameter matrices, and (G, U) represents the refined knowledge corresponding to a specific parameter matrix. After parameter-scale adaptation in Step 3, the extracted expert knowledge is used to initialize the FFN layers of language models with different scales. systematically extracted and transferred to new models. Reusable … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of fine-tuning performance between 16- layer XPERT-OLMoE and Scratch under varying pre-training data budgets (2B, 5B, and 10B tokens). Superior performance across downstream benchmarks [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Rouge-L and loss curves between Scratch and XPERT-initialized models on the DollyEval dataset. enables more precise and stable knowledge transfer that better matches the capacity of the target model. In addi￾tion, XPERT avoids repeated forward passes through the teacher model, resulting in substantially lower training cost compared to distillation. Additional results are provided in Appendix … view at source ↗
Figure 5
Figure 5. Figure 5: Effect of domain diversity in expert selection on down￾stream SFT performance of XPERT-initialized models. Mixed￾Domain uses a diverse corpus spanning multiple knowledge do￾mains, including (but not limited to) Wikipedia, GitHub, and arXiv. with Scratch, XPERT provides a stronger starting point that enables the model to benefit more effectively from the same training signal, leading to performance gains co… view at source ↗
Figure 6
Figure 6. Figure 6: Expert activation frequencies of layer 0 (left) and layer 7 (right) in OLMoE-7B across different domains. B. Experimental Details We set the embedding dimension of the target models to 2048, with an FFN intermediate size of 1024. To evaluate the effect of model scale, we experiment with models of 8, 12, and 16 transformer layers. Since expert knowledge is extracted independently from each block of the sour… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of fine-tuning performance between 16-layer GeneLLM-OLMoE and Scratch under varying pre-training token budgets matrices, we first stack them into a third-order tensor and apply Tucker decomposition to extract compact representations along each dimension. Formally, given a set of matrices {Xi ∈ R d1×d2 } n i=1, we construct a tensor X ∈ R d1×d2×n where X (:, :, i) = Xi . This tensor captures both… view at source ↗
read the original abstract

Mixture-of-Experts (MoE) language models organize knowledge into explicitly routed expert modules, making expert-level representations traceable and analyzable. By analyzing expert activation patterns in MoE large language models (LLMs), we find that a subset of experts is consistently activated across diverse knowledge domains. These common experts encode cross-domain, generalizable knowledge that is closely related to model generalization, naturally raising the question of how such identifiable expert knowledge can be practically reused. Motivated by this observation, we propose XPERT, a framework that extracts, consolidates, and reuses expert knowledge from pre-trained MoE LLMs to support more effective training of language models across different model scales. XPERT identifies cross-domain experts via inference-only analysis, refines their representations through tensor decomposition, and adapts the extracted knowledge to reuse in downstream models. Experiments on language understanding and dialogue generation benchmarks show that models benefiting from reused expert knowledge achieve consistently stronger performance and faster convergence compared to strong baselines. These results highlight MoE LLMs as structured and reusable knowledge sources, and demonstrate the value of expert-level knowledge reuse for improving model training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes XPERT, a framework that extracts a subset of consistently activated experts from pre-trained MoE LLMs via inference-only analysis across diverse domains, refines their representations through tensor decomposition, and adapts the resulting factors for reuse during training of language models at different scales. Experiments on language understanding and dialogue generation benchmarks are reported to show consistent performance gains and faster convergence relative to strong baselines.

Significance. If the results are robust, the work is significant for demonstrating that MoE architectures can function as structured, reusable knowledge sources rather than black-box models. The inference-only identification of cross-domain experts combined with tensor decomposition offers a concrete, scalable approach to knowledge consolidation that could improve training efficiency. Credit is given for the empirical demonstration of gains across multiple benchmarks and for focusing on practical transfer rather than purely theoretical analysis.

major comments (2)
  1. [§3 (XPERT Framework)] The central premise that high activation frequency identifies causally generalizable cross-domain knowledge (rather than routing artifacts, initialization effects, or co-activation patterns) is load-bearing for the transfer claim, yet the manuscript provides no causal interventions, ablations against random or low-frequency expert subsets, or controls that isolate the contribution of the selected experts from the added parameters and adaptation schedule. This directly affects whether the reported gains can be attributed to the extracted knowledge.
  2. [§4 (Experiments)] §4 (Experiments): the abstract and results claim 'consistently stronger performance and faster convergence' but the description does not specify the exact baselines, number of random seeds, statistical tests, or ablations on the tensor decomposition step (e.g., full experts vs. decomposed factors). Without these, it is impossible to verify that the improvements exceed what would be obtained by simply increasing model capacity or altering the training schedule.
minor comments (2)
  1. [Abstract] The abstract states that the common experts 'encode cross-domain, generalizable knowledge' but does not define the precise activation threshold or consistency metric used to select them; this notation should be formalized in §3 with an equation.
  2. [§4 (Experiments)] Figure captions and tables in the experimental section would benefit from explicit reporting of standard deviations or confidence intervals alongside mean performance numbers to support the 'consistent' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to improve the manuscript. We address the two major comments point by point below, agreeing where revisions are needed to strengthen the claims and experimental rigor.

read point-by-point responses
  1. Referee: [§3 (XPERT Framework)] The central premise that high activation frequency identifies causally generalizable cross-domain knowledge (rather than routing artifacts, initialization effects, or co-activation patterns) is load-bearing for the transfer claim, yet the manuscript provides no causal interventions, ablations against random or low-frequency expert subsets, or controls that isolate the contribution of the selected experts from the added parameters and adaptation schedule. This directly affects whether the reported gains can be attributed to the extracted knowledge.

    Authors: We agree that the manuscript would benefit from stronger evidence isolating the role of high-frequency experts. Our selection is grounded in the empirical observation that these experts show consistent activation across diverse domains in the source MoE model, which we link to generalization. To directly address the concern, the revised manuscript will include new ablations comparing the selected experts against (i) randomly chosen expert subsets and (ii) low-frequency experts, while matching parameter count and training schedule. These controls will help demonstrate that gains are attributable to the cross-domain knowledge rather than capacity or schedule effects alone. revision: yes

  2. Referee: [§4 (Experiments)] §4 (Experiments): the abstract and results claim 'consistently stronger performance and faster convergence' but the description does not specify the exact baselines, number of random seeds, statistical tests, or ablations on the tensor decomposition step (e.g., full experts vs. decomposed factors). Without these, it is impossible to verify that the improvements exceed what would be obtained by simply increasing model capacity or altering the training schedule.

    Authors: We acknowledge that the experimental details require greater precision. In the revision we will: (1) explicitly list all baselines (standard fine-tuning, random expert initialization, and capacity-matched models without knowledge transfer); (2) report results over multiple random seeds with standard deviations; (3) include statistical significance tests (e.g., paired t-tests); and (4) add an ablation directly comparing reuse of full expert weights versus the tensor-decomposed factors. These changes will confirm that observed gains exceed those from capacity increases or schedule variations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical selection and transfer validated externally

full rationale

The paper identifies consistently activated experts via inference-only analysis on diverse domains, applies tensor decomposition to refine representations, and transfers the factors into target models. Performance gains are measured on separate language understanding and dialogue benchmarks against strong baselines. No equations or steps reduce the claimed improvement to a parameter fitted on the same data used for evaluation. The selection criterion (activation frequency) is an observable input, not defined in terms of the downstream gains. Self-citations, if present, are not load-bearing for the core transfer mechanism, which remains falsifiable by the reported experiments. This is a standard empirical pipeline with no self-definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that consistently activated experts in MoE models hold transferable general knowledge; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption A subset of experts is consistently activated across diverse knowledge domains and encodes cross-domain generalizable knowledge related to model generalization.
    Stated as an observation from analyzing expert activation patterns in the abstract.

pith-pipeline@v0.9.0 · 5494 in / 1258 out tokens · 46411 ms · 2026-05-12T02:37:23.226350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 11 internal anchors

  1. [1]

    Radiology: Artificial Intelligence , volume=

    On the opportunities and risks of foundation models for natural language processing in radiology , author=. Radiology: Artificial Intelligence , volume=. 2022 , publisher=

  2. [2]

    Journal of Machine Learning Research , volume=

    Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

  3. [3]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  4. [4]

    PaLM 2 Technical Report

    Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

  5. [5]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  6. [6]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  7. [7]

    Language Models are Few-Shot Learners

    Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , volume=

  8. [8]

    Qwen Technical Report

    Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

  9. [9]

    International Conference on Learning Representations , year=

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations , year=

  10. [10]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  11. [11]

    LLM-QAT: Data-free quantization aware training for large language models.arXiv:2305.17888, 2023

    Llm-qat: Data-free quantization aware training for large language models , author=. arXiv preprint arXiv:2305.17888 , year=

  12. [12]

    B it D istiller: Unleashing the Potential of Sub-4-Bit LLM s via Self-Distillation

    Du, DaYou and Zhang, Yijia and Cao, Shijie and Guo, Jiaqi and Cao, Ting and Chu, Xiaowen and Xu, Ningyi. B it D istiller: Unleashing the Potential of Sub-4-Bit LLM s via Self-Distillation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.7

  13. [13]

    A Survey on Model Compression for Large Language Models

    Zhu, Xunyu and Li, Jian and Liu, Yong and Ma, Can and Wang, Weiping. A Survey on Model Compression for Large Language Models. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00704

  14. [14]

    MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture , pages=

    Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference , author=. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture , pages=

  15. [15]

    arXiv preprint arXiv:2002.09168 , year=

    Residual knowledge distillation , author=. arXiv preprint arXiv:2002.09168 , year=

  16. [16]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

  17. [17]

    Proceedings of the 29th International Conference on Computational Linguistics , pages=

    Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks , author=. Proceedings of the 29th International Conference on Computational Linguistics , pages=

  18. [18]

    CoRR , year=

    Parrot: Multilingual Visual Instruction Tuning , author=. CoRR , year=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Wings: Learning multimodal llms without text-only forgetting , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    2025 , eprint=

    Efficient Evaluation of Large Language Models via Collaborative Filtering , author=. 2025 , eprint=

  21. [21]

    Advances in Neural Information Processing Systems , volume=

    Bridge the modality and capability gaps in vision-language model selection , author=. Advances in Neural Information Processing Systems , volume=

  22. [22]

    Learnware: small models do big , volume=

    Zhou, Zhi-Hua and Tan, Zhi-Hao , year=. Learnware: small models do big , volume=. Science China Information Sciences , publisher=. doi:10.1007/s11432-023-3823-6 , number=

  23. [23]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    OneBit: Towards Extremely Low-bit Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  24. [24]

    2024 , cdate=

    Gunho Park and Baeseong Park and Minsub Kim and Sungjae Lee and Jeonghoon Kim and Beomseok Kwon and Se Jung Kwon and Byeongwook Kim and Youngjoo Lee and Dongsoo Lee , title=. 2024 , cdate=

  25. [25]

    int8 (): 8-bit matrix multiplication for transformers at scale , author=

    Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in neural information processing systems , volume=

  26. [26]

    2023 , eprint=

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , author=. 2023 , eprint=

  27. [27]

    Proceedings of Machine Learning and Systems , volume=

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of Machine Learning and Systems , volume=

  28. [28]

    In-context learning distillation: Transferring few-shot learning ability of pre-trained language models,

    In-context learning distillation: Transferring few-shot learning ability of pre-trained language models , author=. arXiv preprint arXiv:2212.10670 , year=

  29. [29]

    Scott: Self-consistent chain-of-thought distillation,

    Scott: Self-consistent chain-of-thought distillation , author=. arXiv preprint arXiv:2305.01879 , year=

  30. [30]

    Advances in Neural Information Processing Systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

  31. [31]

    Advances in Neural Information Processing Systems , volume=

    Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning , author=. Advances in Neural Information Processing Systems , volume=

  32. [32]

    International Conference on Machine Learning , pages=

    Sparsegpt: Massive language models can be accurately pruned in one-shot , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  33. [33]

    InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vi- enna, Austria

    A simple and effective pruning approach for large language models , author=. arXiv preprint arXiv:2306.11695 , year=

  34. [34]

    The Twelfth International Conference on Learning Representations , year=

    A Simple and Effective Pruning Approach for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  35. [35]

    Proceedings of the VLDB Endowment , volume=

    Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity , author=. Proceedings of the VLDB Endowment , volume=. 2023 , publisher=

  36. [36]

    Advances in neural information processing systems , volume=

    Llm-pruner: On the structural pruning of large language models , author=. Advances in neural information processing systems , volume=

  37. [37]

    CoRR , year=

    Shortened LLaMA: A Simple Depth Pruning for Large Language Models , author=. CoRR , year=

  38. [38]

    Slicegpt: Compress large language models by deleting rows and columns,

    Slicegpt: Compress large language models by deleting rows and columns , author=. arXiv preprint arXiv:2401.15024 , year=

  39. [39]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Improved knowledge distillation via teacher assistant , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  40. [40]

    arXiv preprint arXiv:2305.02279 , year=

    Learngene: Inheriting condensed knowledge from the ancestry model to descendant models , author=. arXiv preprint arXiv:2305.02279 , year=

  41. [41]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Transformer as Linear Expansion of Learngene , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  42. [42]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Building Variable-Sized Models via Learngene Pool , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  43. [43]

    2025 , eprint=

    OLMoE: Open Mixture-of-Experts Language Models , author=. 2025 , eprint=

  44. [44]

    2024 , eprint=

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models , author=. 2024 , eprint=

  45. [45]

    Proceedings of NAACL-HLT , pages=

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of NAACL-HLT , pages=

  46. [46]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

  47. [47]

    International Conference on Learning Representations , year=

    Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

  48. [48]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  49. [49]

    Communications of the ACM , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

  50. [50]

    Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , booktitle=. Mini. 2024 , url=

  51. [51]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  52. [52]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  53. [53]

    ACL , year=

    Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. ACL , year=

  54. [54]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  55. [55]

    Psychometrika , volume=

    Some mathematical notes on three-mode factor analysis , author=. Psychometrika , volume=. 1966 , publisher=

  56. [56]

    2024 , eprint=

    Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs , author=. 2024 , eprint=

  57. [57]

    Proceedings of the eighteenth international conference on artificial intelligence and law , pages=

    When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings , author=. Proceedings of the eighteenth international conference on artificial intelligence and law , pages=

  58. [58]

    Proceedings of the Conference on Health, Inference, and Learning , pages =

    MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , author =. Proceedings of the Conference on Health, Inference, and Learning , pages =. 2022 , editor =

  59. [59]

    International conference on machine learning , pages=

    Unified scaling laws for routed language models , author=. International conference on machine learning , pages=. 2022 , organization=

  60. [60]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021a

    Harder tasks need more experts: Dynamic routing in moe models , author=. arXiv preprint arXiv:2403.07652 , year=

  61. [61]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

  62. [62]

    Journal of Machine Learning Research , volume=

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

  63. [63]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

  64. [64]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

  65. [65]

    Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

    Understanding the difficulty of training deep feedforward neural networks , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

  66. [66]

    Proceedings of the IEEE international conference on computer vision , pages=

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification , author=. Proceedings of the IEEE international conference on computer vision , pages=

  67. [67]

    Mishkin and J

    All you need is a good init , author=. arXiv preprint arXiv:1511.06422 , year=

  68. [68]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  69. [69]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  70. [70]

    Science China Information Sciences , volume=

    Learnware: Small models do big , author=. Science China Information Sciences , volume=. 2024 , publisher=

  71. [71]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Towards making learnware specification and market evolvable , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  72. [72]

    2019 , eprint=

    Importance Estimation for Neural Network Pruning , author=. 2019 , eprint=

  73. [73]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    The lottery ticket hypothesis: Finding sparse, trainable neural networks , author=. arXiv preprint arXiv:1803.03635 , year=

  74. [74]

    International conference on machine learning , pages=

    Parameter-efficient transfer learning for NLP , author=. International conference on machine learning , pages=. 2019 , organization=

  75. [75]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

  76. [76]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Prefix-tuning: Optimizing continuous prompts for generation , author=. arXiv preprint arXiv:2101.00190 , year=

  77. [77]

    IEEE Transactions on Software Engineering , volume=

    Code comment inconsistency detection based on confidence learning , author=. IEEE Transactions on Software Engineering , volume=. 2024 , publisher=