Pith Number

pith:AXFU3DUT

pith:2024:AXFU3DUTR7SPUGLPK667UTVIY5

not attested not anchored not stored refs resolved

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Adam Santoro, Blake Richards, David Raposo, Peter Conway Humphreys, Sam Ritter, Timothy Lillicrap

Transformer language models can learn to dynamically allocate compute to select tokens at each layer.

arxiv:2404.02258 v1 · 2024-04-02 · cs.LG · cs.CL

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50% faster to step during post-training sampling.

C2weakest assumption

The assumption that a learned top-k router can reliably identify which tokens merit full processing at each layer without degrading overall model capacity or introducing training instabilities, and that this holds across model scales and tasks.

C3one line summary

Mixture-of-Depths enables transformers to dynamically allocate compute by routing only the top-k tokens through each layer's full computations, matching baseline performance with a fraction of the FLOPs per forward pass and up to 50% faster sampling.

References

11 extracted · 11 resolved · 5 Pith anchors

[2] Controlling computation versus quality for neural sequence models 2002

[3] Universal Transformers · arXiv:1807.03819

[5] Depth-adaptive transformer 1910

[7] Adaptive Computation Time for Recurrent Neural Networks · arXiv:1603.08983

[8] Towards a unified view of parameter-efficient transfer learning

Formal links

1 machine-checked theorem link

Cited by

19 papers in Pith

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

MIDUS: Memory-Infused Depth Up-Scaling

When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models

Path-Constrained Mixture-of-Experts

N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

Receipt and verification

First computed	2026-05-17T23:38:15.410613Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

05cb4d8e938fe4fa196f57bdfa4ea8c7598877576765f04649a8f2cca081e4fa

Aliases

arxiv: 2404.02258 · arxiv_version: 2404.02258v1 · doi: 10.48550/arxiv.2404.02258 · pith_short_12: AXFU3DUTR7SP · pith_short_16: AXFU3DUTR7SPUGLP · pith_short_8: AXFU3DUT

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/AXFU3DUTR7SPUGLPK667UTVIY5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 05cb4d8e938fe4fa196f57bdfa4ea8c7598877576765f04649a8f2cca081e4fa

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "20196f2ef34c9194cb315bc7bc2d6b7e36cc76b23f47901f1dbbbc05b493a687",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-04-02T19:28:11Z",
    "title_canon_sha256": "ec4a74a99a0bfb9a2c6146a1ee7e2a6ddcd656c28232813ecf9f44bf5d9f3bf9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2404.02258",
    "kind": "arxiv",
    "version": 1
  }
}