pith. machine review for the scientific record. sign in
Pith Number

pith:AXFU3DUT

pith:2024:AXFU3DUTR7SPUGLPK667UTVIY5
not attested not anchored not stored refs resolved

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Adam Santoro, Blake Richards, David Raposo, Peter Conway Humphreys, Sam Ritter, Timothy Lillicrap

Transformer language models can learn to dynamically allocate compute to select tokens at each layer.

arxiv:2404.02258 v1 · 2024-04-02 · cs.LG · cs.CL

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50% faster to step during post-training sampling.

C2weakest assumption

The assumption that a learned top-k router can reliably identify which tokens merit full processing at each layer without degrading overall model capacity or introducing training instabilities, and that this holds across model scales and tasks.

C3one line summary

Mixture-of-Depths enables transformers to dynamically allocate compute by routing only the top-k tokens through each layer's full computations, matching baseline performance with a fraction of the FLOPs per forward pass and up to 50% faster sampling.

References

11 extracted · 11 resolved · 5 Pith anchors

[2] Controlling computation versus quality for neural sequence models 2002
[3] Universal Transformers · arXiv:1807.03819
[5] Depth-adaptive transformer 1910
[7] Adaptive Computation Time for Recurrent Neural Networks · arXiv:1603.08983
[8] Towards a unified view of parameter-efficient transfer learning

Formal links

1 machine-checked theorem link

Cited by

19 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:15.410613Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

05cb4d8e938fe4fa196f57bdfa4ea8c7598877576765f04649a8f2cca081e4fa

Aliases

arxiv: 2404.02258 · arxiv_version: 2404.02258v1 · doi: 10.48550/arxiv.2404.02258 · pith_short_12: AXFU3DUTR7SP · pith_short_16: AXFU3DUTR7SPUGLP · pith_short_8: AXFU3DUT
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/AXFU3DUTR7SPUGLPK667UTVIY5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 05cb4d8e938fe4fa196f57bdfa4ea8c7598877576765f04649a8f2cca081e4fa
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "20196f2ef34c9194cb315bc7bc2d6b7e36cc76b23f47901f1dbbbc05b493a687",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-04-02T19:28:11Z",
    "title_canon_sha256": "ec4a74a99a0bfb9a2c6146a1ee7e2a6ddcd656c28232813ecf9f44bf5d9f3bf9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2404.02258",
    "kind": "arxiv",
    "version": 1
  }
}