Grouped Query Experts applies per-group MoE routing to query heads in GQA, matching baseline accuracy while activating half the query heads on 250M models trained for 30B tokens.
Zclip: Adaptive spike mitigation for llm pre-training
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
method 1
citation-polarity summary
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
A Transformer-based self-prior in active inference enables a simulated agent to spontaneously recognize and remove a mark on its face in a mirror by detecting discrepancies in learned visual-proprioceptive experiences.
citing papers explorer
-
Active Inference with a Self-Prior in the Mirror-Mark Task
A Transformer-based self-prior in active inference enables a simulated agent to spontaneously recognize and remove a mark on its face in a mirror by detecting discrepancies in learned visual-proprioceptive experiences.