pith. sign in

arxiv: 2605.25676 · v1 · pith:JVQU7ZU3new · submitted 2026-05-25 · 💻 cs.CL

Llamion Technical Report

Pith reviewed 2026-06-29 22:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords model architecture conversionknowledge distillationparameter mappingLlama familyOrion modellanguage model adaptation
0
0 comments X

The pith

Llamion converts Orion-14B to Llama architecture via KEPT and recovers its benchmark behavior with 123M tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Orion-14B can be transformed into the Llama-family architecture using the KEPT recipe, which maps parameters directly where possible, initializes normalization layers with a training-free method, and applies distillation to align outputs. This process restores the original model's scores on H6, MT-Bench, and KoMMLU after only four days of training on a single A100 with roughly 123 million tokens. The resulting Llamion-Base model reaches 66.87 percent on KoMMLU, beating the prior leaderboard leader by more than seven points. Capabilities never seen during the transfer step, such as Python programming and 200K-token context, remain intact after the change.

Core claim

KEPT performs the Orion-to-Llama conversion by applying Normal Parameter Mapping to matching modules, Optimized Parameter Mapping to convert LayerNorm to RMSNorm, and Cross-architecture Knowledge Distillation from a frozen equal-size teacher, allowing the converted 14B model to match source behavior on standard benchmarks while preserving out-of-distribution skills.

What carries the argument

KEPT (Efficient Knowledge Preservation for Transformation) recipe that combines Normal Parameter Mapping, Optimized Parameter Mapping for LayerNorm-to-RMSNorm initialization, and Cross-architecture Knowledge Distillation.

If this is right

  • Llamion-Base exceeds the next-best Open Ko LLM Leaderboard entry by more than 7 absolute points on KoMMLU.
  • Capabilities entirely absent from the transfer corpus survive the architectural transition.
  • The three released checkpoints load in Hugging Face Transformers with trust_remote_code set to false.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mapping-plus-distillation steps could be tested on other source-target architecture pairs to check whether the compute savings generalize.
  • If OPM initialization proves robust across different weight-decay strengths, it may reduce the need for full retraining when swapping normalization layers.
  • Long-context handling that survives the conversion suggests the method preserves positional or attention patterns beyond what the small distillation set explicitly teaches.

Load-bearing premise

That Optimized Parameter Mapping is optimal under the near-zero-mean activation regime induced by weight decay and that cross-architecture distillation will align outputs on any reasonable input distribution without degrading unseen capabilities.

What would settle it

If the converted model fails to recover Orion's scores on H6, MT-Bench, and KoMMLU after the KEPT procedure, or if Python programming and 200K-token context handling degrade despite never appearing in the transfer corpus.

Figures

Figures reproduced from arXiv: 2605.25676 by Heuiseok Lim, Hwanseok Jang, Hyeonseok Moon, Hyungjin Lee, Jeseung Lee, Juhyoung Park, Kisu Yang, Taewoo Lee, Yoonna Jang.

Figure 1
Figure 1. Figure 1: XKD. The frozen Msrc (Orion) acts as an equal-size teacher; Mtgt (Llamion), initialized by NPM/OPM, is trained to reproduce the teacher’s hidden states (top) or logits (bottom). 4.4 Module-Level Mapping [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average H6 score versus the number of XKD [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: KoMMLU few-shot accuracy versus model size on the Open Ko LLM Leaderboard. Llamion-Base leads the sub-15B band by >7 absolute points over the next-best entry at submission time. sensitive to perturbations of the next-token distribu￾tion. This is why the UT-vs-XKD gap appears mod￾est on H6 but is dramatic on MT-Bench, and is why future architectural-transformation work should re￾port generative metrics alon… view at source ↗
Figure 4
Figure 4. Figure 4: Perplexity versus input length on English Wikipedia (left) and Korean Wikipedia (right). Llamion [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fine-tuning performance on four Korean generation tasks (GPT-4-graded G-Eval). Fine-tuned Llamion [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative interactions with Llamion-14B-Chat: coherent multi-turn dialogue and competent Python [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

We release Llamion, a family of 14B-parameter open-weight language models obtained by transforming Orion-14B into the standardized Llama-family architecture. The transformation is performed by Efficient Knowledge Preservation for Transformation (KEPT), a recipe that combines (i) Normal Parameter Mapping (NPM) for unchanged modules, (ii) Optimized Parameter Mapping (OPM), a training-free LayerNorm-to-RMSNorm initialization we prove optimal under the near-zero-mean activation regime induced by weight decay, and (iii) Cross-architecture Knowledge Distillation (XKD), an equal-size frozen-teacher distillation that aligns the converted model's outputs with the source model's on any reasonable input distribution. Llamion recovers Orion's behaviour on H6, MT-Bench, and KoMMLU with only ~123M tokens on a single A100 in four days; Llamion-Base reaches 66.87% on KoMMLU, exceeding the next-best entry of the Open Ko LLM Leaderboard by >7.0 absolute points at submission time. Capabilities entirely absent from the transfer corpus (Python programming and 200K-token context handling) survive the architectural transition intact. We release three checkpoints (Base, Chat, LongChat) that load with trust_remote_code=False in the Hugging Face Transformers library.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents Llamion, a 14B-parameter model family obtained by converting Orion-14B to Llama-family architecture via the KEPT recipe. KEPT combines Normal Parameter Mapping (NPM) for unchanged modules, Optimized Parameter Mapping (OPM) claimed as a training-free optimal LayerNorm-to-RMSNorm initialization under a near-zero-mean activation regime from weight decay, and Cross-architecture Knowledge Distillation (XKD) that aligns outputs with a frozen teacher on any reasonable distribution. The work reports recovering Orion performance on H6, MT-Bench, and KoMMLU using only ~123M tokens on one A100 over four days, with Llamion-Base reaching 66.87% on KoMMLU (exceeding the prior Open Ko LLM Leaderboard best by >7 points at submission), and claims that capabilities absent from the transfer corpus (Python programming, 200k context) survive intact. Three checkpoints are released that load without trust_remote_code in Hugging Face Transformers.

Significance. If the central claims hold after verification, the result would demonstrate a low-compute architecture-conversion technique that preserves benchmark behavior and unseen capabilities, which could be useful for adapting existing models to standardized formats. The open release of compatible checkpoints and the reported efficiency (123M tokens, single A100) are concrete strengths. The leaderboard margin on KoMMLU would be notable if supported by error bars and ablations.

major comments (3)
  1. [Abstract / OPM description] Abstract and OPM section: the claim that OPM is proved optimal under the near-zero-mean activation regime induced by weight decay is load-bearing for the training-free conversion assertion, yet the text provides neither the derivation/equations establishing optimality nor empirical confirmation that Orion-14B activations satisfy the near-zero-mean condition.
  2. [XKD description / capability preservation results] XKD and results sections: the assertion that XKD aligns outputs on any reasonable input distribution without degrading capabilities absent from the 123M-token corpus (Python, 200k context) is central to the 'intact survival' claim, but no control experiments, ablations, or before/after comparisons on those held-out tasks are described.
  3. [KoMMLU results] Benchmark results: the reported 66.87% KoMMLU score and >7-point lead lack error bars, multiple runs, or statistical tests; combined with the absence of ablation results noted in the abstract, this weakens confidence in the leaderboard claim and the overall performance-recovery narrative.
minor comments (2)
  1. [Abstract] The abstract states performance recovery on H6 and MT-Bench but does not specify the exact metrics or baselines used for 'recovers Orion's behaviour'.
  2. [Experimental setup] No details are given on the exact composition of the 123M-token corpus or the training hyperparameters for XKD.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our technical report. We address each major comment point-by-point below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / OPM description] Abstract and OPM section: the claim that OPM is proved optimal under the near-zero-mean activation regime induced by weight decay is load-bearing for the training-free conversion assertion, yet the text provides neither the derivation/equations establishing optimality nor empirical confirmation that Orion-14B activations satisfy the near-zero-mean condition.

    Authors: We acknowledge that the submitted manuscript states the optimality claim for OPM but does not include the explicit derivation or equations. The proof proceeds from the near-zero-mean activation assumption (induced by weight decay) to show that the RMSNorm initialization derived from LayerNorm parameters minimizes the output discrepancy in closed form. We will add a dedicated subsection in the revised OPM description containing the full mathematical derivation, supporting equations, and an empirical check of activation statistics on Orion-14B samples to confirm the regime holds. revision: yes

  2. Referee: [XKD description / capability preservation results] XKD and results sections: the assertion that XKD aligns outputs on any reasonable input distribution without degrading capabilities absent from the 123M-token corpus (Python, 200k context) is central to the 'intact survival' claim, but no control experiments, ablations, or before/after comparisons on those held-out tasks are described.

    Authors: The manuscript bases the 'any reasonable distribution' alignment claim on the equal-size frozen-teacher XKD objective and reports post-conversion performance on held-out capabilities. We agree that explicit ablations isolating XKD's contribution and direct before/after comparisons on Python and long-context tasks are absent. We will expand the results section to include a limitations paragraph noting the lack of such controls and will add any feasible additional analysis of capability preservation that can be performed within the revision timeline. revision: partial

  3. Referee: [KoMMLU results] Benchmark results: the reported 66.87% KoMMLU score and >7-point lead lack error bars, multiple runs, or statistical tests; combined with the absence of ablation results noted in the abstract, this weakens confidence in the leaderboard claim and the overall performance-recovery narrative.

    Authors: The reported KoMMLU score reflects a single evaluation run, consistent with common practice for leaderboard submissions at the time. We agree that error bars, multiple runs, or statistical tests would strengthen the claim. In revision we will explicitly state the single-run nature as a limitation, clarify the leaderboard context, and note that the >7-point margin is reported as observed at submission time without additional variance estimates. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain remains self-contained

full rationale

The paper presents OPM as a training-free mapping proved optimal under an explicitly stated near-zero-mean activation regime, and XKD as an alignment procedure whose preservation of absent capabilities is asserted as an empirical outcome after limited-token training. No equations, fitted parameters, or self-citations are shown that would make the optimality claim, alignment guarantee, or benchmark recoveries reduce by construction to the method's own inputs or prior author work. The reported KoMMLU gains and capability survival are therefore independent empirical results rather than tautological restatements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, parameters, or invented entities are visible.

pith-pipeline@v0.9.1-grok · 5791 in / 1034 out tokens · 23708 ms · 2026-06-29T22:01:03.394938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    Layer Normalization

    GQA: Training generalized multi-query trans- former models from multi-head checkpoints. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 4895– 4901, Singapore. Association for Computational Lin- guistics. Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?Advances in neural information p...

  2. [2]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. DeepSeek-AI. 2024. DeepSeek-V3 technical report. Preprint, arXiv:2412.19437. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, and 1 others. 2024. The llama 3 herd of models.Preprint, arXiv:2407.21783. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- man, Sid Bl...

  3. [3]

    Distilling the Knowledge in a Neural Network

    Measuring massive multitask language under- standing. InInternational Conference on Learning Representations. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. Preprint, arXiv:1503.02531. 9 Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, F...

  4. [4]

    Linearizing large language models.arXiv preprint arXiv:2405.06640, 2024

    Linearizing large language models.Preprint, arXiv:2405.06640. Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seongh- wan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, and Hwalsuk Lee. 2024. Open ko-llm leaderboard: Evaluating large language models in korean with ko- h5 benchmark. InACL Main. Qwen Team. 2024. Qwen2.5 technical report.Preprint, arXiv:2412.15115. Keis...

  5. [5]

    A Survey on Knowledge Distillation of Large Language Models

    Baize: An open-source chat model with parameter-efficient tuning on self-chat data. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 6268– 6278, Singapore. Association for Computational Lin- guistics. Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi ...

  6. [6]

    Ming Zhong, Chenxin An, Weizhu Chen, Jiawei Han, and Pengcheng He

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36. Ming Zhong, Chenxin An, Weizhu Chen, Jiawei Han, and Pengcheng He. 2024. Seeking neural nuggets: Knowledge transfer in large language models from a parametric perspective. InThe Twelfth International Conference on Learning Representations. 11 A ...