BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems

Joss Armstrong

arxiv: 2605.22866 · v1 · pith:C5POULQLnew · submitted 2026-05-19 · 💻 cs.AI · cs.LG

BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems

Joss Armstrong This is my paper

Pith reviewed 2026-05-25 05:50 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords hierarchical attributionrouting weightscompound AI systemszero-cost attributionmulti-resolution decompositionShapley comparisonagentic systemsorchestrator routing

0 comments

The pith

BOHM derives multi-resolution attributions for compound AI systems directly from the routing weights the orchestrator already maintains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Compound AI systems route tasks through hierarchies of specialised components, yet Shapley-based attribution requires evaluating arbitrary coalitions of those components, which is often impossible for third-party APIs or when routers concentrate on a few tools. BOHM instead extracts a full attribution tree from the existing routing weights: each leaf receives the product of weights along its root-to-leaf path, and each level-k node receives the induced distribution obtained by summing the path products that reach it. The construction runs at zero marginal cost, needs no internal access to components, and supplies attributions at every depth simultaneously. On LLM hierarchies, agentic driver studies, and a census data tree the resulting rankings correlate strongly with SHAP when the router selects near-optimal tools and recover ground-truth orderings at multiple resolutions. BOHM satisfies efficiency, monotonicity, symmetry and weak suppression but not additivity, positioning it as a complementary primitive rather than a replacement for coalition methods.

Core claim

BOHM defines leaf attribution as the product of routing weights along the path from root to leaf and level-k attribution as the probability distribution over depth-k nodes induced by those same path products. The tree is obtained directly from the routing state already kept by the orchestrator, requires no additional system evaluations or component introspection, and yields simultaneous attributions at every level of the hierarchy.

What carries the argument

The hierarchical attribution tree whose leaves are path products of routing weights and whose internal nodes are the summed products reaching each depth-k node.

If this is right

Attributions become available at every depth of the hierarchy in a single pass without extra coalition evaluations.
Opaque or third-party components receive attributions because the method never inspects their internals.
Disagreement between BOHM and SHAP on a given cell indicates that the deployed router is not selecting the empirically strongest tool.
The method satisfies efficiency, monotonicity, symmetry and weak suppression while remaining computable wherever routing state exists.
Multi-resolution attribution is obtained wherever a router already maintains path probabilities, regardless of the evaluation budget available for flat methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

BOHM could function as an immediate, zero-cost diagnostic for whether an orchestrator's routing policy is near-optimal on the tasks it actually encounters.
The same path-product construction might apply to any tree-structured decision process that records routing probabilities, including non-AI planning or decision trees.
When BOHM and SHAP diverge, the difference itself supplies information about the gap between the router's policy and the true contribution surface.
Extending the method to routers whose weights change over time would require checking whether the path products remain stable or need re-aggregation.

Load-bearing premise

That the routing weights maintained by the orchestrator can be interpreted as multiplicative factors whose path products yield meaningful per-component attributions.

What would settle it

An experiment that independently measures each component's true marginal contribution on a held-out set of tasks, then checks whether the ordering produced by path-product attributions matches the measured ordering when the router is forced to use a demonstrably suboptimal tool.

Figures

Figures reproduced from arXiv: 2605.22866 by Joss Armstrong.

**Figure 2.** Figure 2: BOHM vs SHAP attribution on 18 LLMs over 880 LiveCodeBench problems. Left: [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Structure sensitivity: natural (quality-based) vs random grouping. Left: Kendall [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Opaque-component attribution. 9 components with hidden quality parameters. BOHM [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Per-level attribution trajectories. The root’s weights reflect group-level attribution; each [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Attribution quality as a function of hierarchy depth (branching factor 3) and number [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Attribution quality vs outcome noise (probability of flipped binary outcome). Graceful [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Context-dependent attribution. Different contexts (text, code, math) activate different [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Domain-conditioned attribution across five coding benchmarks. Top row: per-model [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Fixed (LCB-based) vs domain-specific tiering. The HumanEval delta ( [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

read the original abstract

Compound AI systems route tasks through hierarchies of specialised components. Attribution is dominated by Shapley-based methods (SHAP), which decompose a coalition value function into per-component marginal contributions and require evaluation of the system on arbitrary component subsets. That requirement fails for third-party APIs, opaque endpoints, and agentic orchestrators that concentrate routing on a few tools, leaving most coalitions un-evaluable from the deployed orchestrator. We introduce BOHM, which extracts a hierarchical attribution tree directly from the routing weights such systems already maintain: leaf attribution is the path product of root-to-leaf routing weights; level-k attribution is the induced distribution over depth-k nodes. The method has zero marginal cost, requires no access to component internals, and provides multi-resolution attribution at every level simultaneously, which flat methods cannot offer at any evaluation budget. BOHM and SHAP answer different questions and converge when the deployed router routes near-optimally. On 18 LLMs in a 3-level hierarchy over 880 LiveCodeBench problems, BOHM yields Kendall tau=0.928; SHAP reaches tau=0.980 at 9,000x more coalition evaluations per seed. On a 5-driver, 7-benchmark agentic study (35 cells, complete coverage), drivers concentrate routing on a single tool (top-share median 0.65), and cell-level tau(BOHM,SHAP) is predicted by whether the driver's top pick is the empirically best tool (mean +0.22 vs ~+0.01). On a US Census hierarchy (475 leaves, 4 levels), BOHM recovers ground-truth rankings at every level (tau up to 0.722). BOHM satisfies efficiency, monotonicity, symmetry, and weak suppression but not Shapley's additivity. It is best understood as a complementary primitive: a multi-resolution decomposition computable wherever routing state exists, whose disagreement with Shapley is itself diagnostic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BOHM extracts hierarchical attributions from routing weights at zero cost and shows solid empirical overlap with SHAP when the router is already good, but the core link to marginal contributions has no derivation or bounds.

read the letter

The new piece is a tree of attributions built straight from the routing weights the orchestrator already keeps: each leaf gets the product of weights on its root-to-leaf path, and each level-k node gets the sum of the leaves below it. This is distinct from coalition-based Shapley work and runs with no extra queries, which matters for opaque third-party components. The experiments back the practical claim: Kendall tau of 0.928 against SHAP on the LLM hierarchy, lower but still positive on the census data, and a clear signal that agreement drops when the router picks the wrong tool. That diagnostic angle is useful on its own. The soft spot is exactly where the stress-test note flags it. The paper defines the quantities from the weights and then checks correlation with SHAP, but supplies no first-principles argument that the path products equal or approximate marginal effects on the final output. The agreement holds mainly in the near-optimal router regime, and there are no general error bounds or counter-example families. It satisfies efficiency, monotonicity, symmetry and weak suppression but drops additivity, which the authors treat as acceptable for a complementary primitive. The work is aimed at people who run or audit deployed compound systems and need something that scales without extra API calls. It is coherent on its own terms and addresses a real gap, so it deserves a serious referee even though the theoretical grounding is thinner than the empirical results.

Referee Report

2 major / 1 minor

Summary. The paper introduces BOHM, a zero-cost hierarchical attribution method for compound AI systems. It defines leaf attribution as the product of routing weights along each root-to-leaf path and level-k attribution as the induced marginal distribution over depth-k nodes. The method requires no component internals or additional evaluations, provides multi-resolution output simultaneously, satisfies efficiency/monotonicity/symmetry/weak suppression (but not additivity), and reports Kendall-tau agreement with SHAP of 0.928 on 18 LLMs over LiveCodeBench, up to 0.722 on Census data, and a link between agreement and routing quality in an agentic study.

Significance. If the reported correlations and diagnostic property hold, BOHM supplies a practical complementary primitive for attribution in opaque, third-party, or agentic systems where coalition-based methods like SHAP cannot be applied at scale. The zero marginal cost, hierarchical structure, and explicit axiom verification are clear strengths that flat attribution methods lack at any budget.

major comments (2)

[Abstract] Abstract: the reported Kendall-tau agreement (0.928 on LLMs, 0.722 on Census) is presented without error bars, number of seeds, or explicit data-exclusion rules; these omissions are load-bearing for the claim that agreement is high and diagnostic of routing quality.
[Abstract] Method definition (Abstract and results): leaf attribution is defined directly as the path product of routing weights, yet no derivation, approximation guarantee, or error bound is supplied showing why this quantity equals or approximates marginal contribution to system output—the quantity targeted by the SHAP comparisons; support remains conditional on near-optimal routing regimes without a general test or counter-example family.

minor comments (1)

[Abstract] Abstract: the phrase 'SHAP reaches tau=0.980 at 9,000x more coalition evaluations per seed' would benefit from a brief parenthetical on how the 9,000x factor was computed (e.g., exact coalition count per instance).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the reported Kendall-tau agreement (0.928 on LLMs, 0.722 on Census) is presented without error bars, number of seeds, or explicit data-exclusion rules; these omissions are load-bearing for the claim that agreement is high and diagnostic of routing quality.

Authors: We agree that the abstract would benefit from additional statistical context. The full experimental section details the 880 LiveCodeBench problems across 18 LLMs and the 475-leaf Census hierarchy, along with the agentic study's 35 cells. We will revise the abstract to reference the number of seeds (where multiple runs were performed) and add a parenthetical note on variability or direct the reader to the experimental details. Data-exclusion rules (e.g., complete coverage in the agentic study) will be stated explicitly in a revised abstract or methods footnote. revision: yes
Referee: [Abstract] Method definition (Abstract and results): leaf attribution is defined directly as the path product of routing weights, yet no derivation, approximation guarantee, or error bound is supplied showing why this quantity equals or approximates marginal contribution to system output—the quantity targeted by the SHAP comparisons; support remains conditional on near-optimal routing regimes without a general test or counter-example family.

Authors: The manuscript positions BOHM as extracting attributions directly from existing routing weights rather than approximating marginal contributions in general; it explicitly states that the two methods answer different questions and converge when routing is near-optimal. The agentic study (35 cells) provides an empirical test linking agreement to routing quality (mean +0.22 when top pick matches best tool). We will add a short methods paragraph deriving the path-product definition from the routing probability interpretation and clarifying the conditional nature of agreement, without claiming a general approximation guarantee. A counter-example family is not required because disagreement is presented as a diagnostic feature rather than a flaw. revision: yes

Circularity Check

0 steps flagged

No significant circularity; BOHM is a direct definition from routing weights with external empirical checks

full rationale

The paper defines leaf attribution explicitly as the path product of routing weights and level-k attribution as the induced marginal distribution; these are introduced as a construction rather than derived from first principles as equivalent to marginal contributions. It states that BOHM and SHAP answer different questions and converge only when the router is near-optimal, with Kendall-tau agreement reported as an external measurement on benchmarks. No self-citations appear in the load-bearing steps, no parameters are fitted to target attributions, and the axiom checks (efficiency, monotonicity, etc.) are performed directly on the defined quantities. The derivation chain is therefore self-contained as a zero-cost extraction method whose relation to Shapley values is measured rather than assumed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method introduces no free parameters or invented entities. It rests on one domain assumption about the semantic meaning of routing weights.

axioms (1)

domain assumption Routing weights maintained by the system can be interpreted as selection factors whose path products yield meaningful attributions.
Invoked when defining leaf attribution as the path product and level-k distributions.

pith-pipeline@v0.9.0 · 5878 in / 1266 out tokens · 42366 ms · 2026-05-25T05:50:40.726358+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Leaf attribution is the path product of root-to-leaf routing weights; level-k attribution is the induced distribution over depth-k nodes.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BOHM satisfies efficiency, monotonicity, symmetry, and weak suppression but not Shapley's additivity.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 5 internal anchors

[1]

Agentic AI: a comprehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59 (1):11, 2026

Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. Agentic AI: a comprehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59 (1):11, 2026. doi: 10.1007/s10462-025-11422-4

work page doi:10.1007/s10462-025-11422-4 2026
[2]

Implicit Evaluation Under Minimal Information: Price Formation in Hierarchical Component Selection

Joss Armstrong. Implicit evaluation under minimal information: Price formation in hierarchical component selection, 2026. URLhttps://doi.org/10.48550/arXiv.2605.00921. Preprint

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.00921 2026
[3]

The multiplicative weights update method: A meta-algorithm and applications.Theory of Computing, 8(1):121–164, 2012

Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: A meta-algorithm and applications.Theory of Computing, 8(1):121–164, 2012

work page 2012
[4]

The nonstochastic multiarmed bandit problem.SIAM Journal on Computing, 32(1):48–77, 2002

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem.SIAM Journal on Computing, 32(1):48–77, 2002

work page 2002
[5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Algorithms to estimate Shapley value feature attributions.arXiv preprint arXiv:2207.07605, 2022

Hugh Chen, Ian C Covert, Scott M Lundberg, and Su-In Lee. Algorithms to estimate Shapley value feature attributions.arXiv preprint arXiv:2207.07605, 2022. 11

work page arXiv 2022
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Feudal reinforcement learning.Advances in Neural Information Processing Systems, 5, 1993

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning.Advances in Neural Information Processing Systems, 5, 1993

work page 1993
[9]

DeepSeek-V3 technical report, 2024

DeepSeek-AI. DeepSeek-V3 technical report, 2024. 671B-parameter Mixture-of-Experts model with 37B activated parameters per token. Architecture inherited by DeepSeek-V3.2 (685B) used in this work

work page 2024
[10]

Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

work page 2022
[11]

A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119–139, 1997

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119–139, 1997

work page 1997
[12]

Data Shapley: Equitable valuation of data for machine learning

Amirata Ghorbani and James Zou. Data Shapley: Equitable valuation of data for machine learning. InInternational Conference on Machine Learning, 2019

work page 2019
[13]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[14]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

work page 2021
[15]

Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

work page 1991
[16]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Ion Stoica, and Koushik Sen. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Attention is not explanation

Sarthak Jain and Byron C Wallace. Attention is not explanation. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[18]

GShard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021

work page 2021
[19]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

work page 2022
[20]

Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, 2023. 12

work page 2023
[21]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[22]

Devstral-Small-2-24B-2512: Agentic software-engineering language model, 2025

Mistral AI and All Hands AI. Devstral-Small-2-24B-2512: Agentic software-engineering language model, 2025. 24B-parameter dense model finetuned from Mistral-Small-3.1-24B. Available at https://huggingface.co/mistralai/Devstral-Small-2-24B-2512

work page 2025
[23]

PISA 2022 database, 2022

OECD. PISA 2022 database, 2022. Programme for International Student Assessment. Available athttps://www.oecd.org/pisa/data/2022database/

work page 2022
[24]

GPT-OSS-120B: Open-weight reasoning model, 2025

OpenAI. GPT-OSS-120B: Open-weight reasoning model, 2025. Available at https:// huggingface.co/openai/gpt-oss-120b

work page 2025
[25]

Why should I trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?”: Explaining the predictions of any classifier. InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016

work page 2016
[26]

Inter-module credit assignment in modular reinforcement learning.Neural Networks, 16(7):985–994, 2003

Kazuyuki Samejima, Kenji Doya, and Mitsuo Kawato. Inter-module credit assignment in modular reinforcement learning.Neural Networks, 16(7):985–994, 2003

work page 2003
[27]

A value forn-person games

Lloyd S Shapley. A value forn-person games. InContributions to the Theory of Games, volume 2, pages 307–317. Princeton University Press, 1953

work page 1953
[28]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

work page 2017
[29]

S&P 500 index constituents and GICS classification, 2024

S&P Dow Jones Indices. S&P 500 index constituents and GICS classification, 2024. Global Industry Classification Standard. Market data from Yahoo Finance

work page 2024
[30]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational Conference on Machine Learning, pages 3319–3328, 2017

work page 2017
[31]

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

work page 1999
[32]

American community survey 2022 public use microdata sample (PUMS),

US Census Bureau. American community survey 2022 public use microdata sample (PUMS),

work page 2022
[33]

Available athttps://www.census.gov/programs-surveys/acs/microdata.html

work page
[34]

Forest inventory and analysis national program: FIA database, 2024

USDA Forest Service. Forest inventory and analysis national program: FIA database, 2024. Available athttps://www.fia.fs.usda.gov/tools-data/

work page 2024
[35]

FeUdal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InInternational Conference on Machine Learning, 2017

work page 2017
[36]

Attention is not not explanation

Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2019

work page 2019
[37]

The shift from models to compound AI systems.Berkeley AI Research Blog, 2024

Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound AI systems.Berkeley AI Research Blog, 2024. Title: The Shift from Models to Compound AI Systems. 13

work page 2024
[38]

GLM-5.1-FP8: A capable open-weight large language model, 2025

Z.ai (Zhipu AI). GLM-5.1-FP8: A capable open-weight large language model, 2025. Available at https://huggingface.co/zai-org/GLM-5.1-FP8. Used as orchestrator for agentic routing experiments

work page 2025
[39]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. BigCodeBench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024. A Additional experiments A.1 S&P 500 institutional hierarchy As a seco...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

the routing substrate (Algorithm 1) has converged to the single-selector equilibrium of Lemma 1 at every router on the root-to-leaf paths

work page
[41]

the SHAP coalition value functionv(S)is non-decreasing in component qualities, in the sense that the marginal contributionϕi =E S[v(S∪{i})−v(S)]is monotone inpi; and

work page
[42]

Then BOHM leaf attribution and SHAP marginal contributions induce the same total order over the components in that sibling set

the comparison is restricted to a sibling set (leaves sharing a common parent router), or, equivalently, the hierarchy is flat (a single router over allNcomponents). Then BOHM leaf attribution and SHAP marginal contributions induce the same total order over the components in that sibling set. Under condition (1), Lemma 1 givesw∗ v,i monotone inpi at each ...

work page 2022

[1] [1]

Agentic AI: a comprehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59 (1):11, 2026

Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. Agentic AI: a comprehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59 (1):11, 2026. doi: 10.1007/s10462-025-11422-4

work page doi:10.1007/s10462-025-11422-4 2026

[2] [2]

Implicit Evaluation Under Minimal Information: Price Formation in Hierarchical Component Selection

Joss Armstrong. Implicit evaluation under minimal information: Price formation in hierarchical component selection, 2026. URLhttps://doi.org/10.48550/arXiv.2605.00921. Preprint

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.00921 2026

[3] [3]

The multiplicative weights update method: A meta-algorithm and applications.Theory of Computing, 8(1):121–164, 2012

Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: A meta-algorithm and applications.Theory of Computing, 8(1):121–164, 2012

work page 2012

[4] [4]

The nonstochastic multiarmed bandit problem.SIAM Journal on Computing, 32(1):48–77, 2002

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem.SIAM Journal on Computing, 32(1):48–77, 2002

work page 2002

[5] [5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Algorithms to estimate Shapley value feature attributions.arXiv preprint arXiv:2207.07605, 2022

Hugh Chen, Ian C Covert, Scott M Lundberg, and Su-In Lee. Algorithms to estimate Shapley value feature attributions.arXiv preprint arXiv:2207.07605, 2022. 11

work page arXiv 2022

[7] [7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Feudal reinforcement learning.Advances in Neural Information Processing Systems, 5, 1993

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning.Advances in Neural Information Processing Systems, 5, 1993

work page 1993

[9] [9]

DeepSeek-V3 technical report, 2024

DeepSeek-AI. DeepSeek-V3 technical report, 2024. 671B-parameter Mixture-of-Experts model with 37B activated parameters per token. Architecture inherited by DeepSeek-V3.2 (685B) used in this work

work page 2024

[10] [10]

Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

work page 2022

[11] [11]

A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119–139, 1997

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119–139, 1997

work page 1997

[12] [12]

Data Shapley: Equitable valuation of data for machine learning

Amirata Ghorbani and James Zou. Data Shapley: Equitable valuation of data for machine learning. InInternational Conference on Machine Learning, 2019

work page 2019

[13] [13]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021

[14] [14]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

work page 2021

[15] [15]

Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

work page 1991

[16] [16]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Ion Stoica, and Koushik Sen. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Attention is not explanation

Sarthak Jain and Byron C Wallace. Attention is not explanation. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019

work page 2019

[18] [18]

GShard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021

work page 2021

[19] [19]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

work page 2022

[20] [20]

Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, 2023. 12

work page 2023

[21] [21]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[22] [22]

Devstral-Small-2-24B-2512: Agentic software-engineering language model, 2025

Mistral AI and All Hands AI. Devstral-Small-2-24B-2512: Agentic software-engineering language model, 2025. 24B-parameter dense model finetuned from Mistral-Small-3.1-24B. Available at https://huggingface.co/mistralai/Devstral-Small-2-24B-2512

work page 2025

[23] [23]

PISA 2022 database, 2022

OECD. PISA 2022 database, 2022. Programme for International Student Assessment. Available athttps://www.oecd.org/pisa/data/2022database/

work page 2022

[24] [24]

GPT-OSS-120B: Open-weight reasoning model, 2025

OpenAI. GPT-OSS-120B: Open-weight reasoning model, 2025. Available at https:// huggingface.co/openai/gpt-oss-120b

work page 2025

[25] [25]

Why should I trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?”: Explaining the predictions of any classifier. InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016

work page 2016

[26] [26]

Inter-module credit assignment in modular reinforcement learning.Neural Networks, 16(7):985–994, 2003

Kazuyuki Samejima, Kenji Doya, and Mitsuo Kawato. Inter-module credit assignment in modular reinforcement learning.Neural Networks, 16(7):985–994, 2003

work page 2003

[27] [27]

A value forn-person games

Lloyd S Shapley. A value forn-person games. InContributions to the Theory of Games, volume 2, pages 307–317. Princeton University Press, 1953

work page 1953

[28] [28]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

work page 2017

[29] [29]

S&P 500 index constituents and GICS classification, 2024

S&P Dow Jones Indices. S&P 500 index constituents and GICS classification, 2024. Global Industry Classification Standard. Market data from Yahoo Finance

work page 2024

[30] [30]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational Conference on Machine Learning, pages 3319–3328, 2017

work page 2017

[31] [31]

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

work page 1999

[32] [32]

American community survey 2022 public use microdata sample (PUMS),

US Census Bureau. American community survey 2022 public use microdata sample (PUMS),

work page 2022

[33] [33]

Available athttps://www.census.gov/programs-surveys/acs/microdata.html

work page

[34] [34]

Forest inventory and analysis national program: FIA database, 2024

USDA Forest Service. Forest inventory and analysis national program: FIA database, 2024. Available athttps://www.fia.fs.usda.gov/tools-data/

work page 2024

[35] [35]

FeUdal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InInternational Conference on Machine Learning, 2017

work page 2017

[36] [36]

Attention is not not explanation

Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2019

work page 2019

[37] [37]

The shift from models to compound AI systems.Berkeley AI Research Blog, 2024

Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound AI systems.Berkeley AI Research Blog, 2024. Title: The Shift from Models to Compound AI Systems. 13

work page 2024

[38] [38]

GLM-5.1-FP8: A capable open-weight large language model, 2025

Z.ai (Zhipu AI). GLM-5.1-FP8: A capable open-weight large language model, 2025. Available at https://huggingface.co/zai-org/GLM-5.1-FP8. Used as orchestrator for agentic routing experiments

work page 2025

[39] [39]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. BigCodeBench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024. A Additional experiments A.1 S&P 500 institutional hierarchy As a seco...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

the routing substrate (Algorithm 1) has converged to the single-selector equilibrium of Lemma 1 at every router on the root-to-leaf paths

work page

[41] [41]

the SHAP coalition value functionv(S)is non-decreasing in component qualities, in the sense that the marginal contributionϕi =E S[v(S∪{i})−v(S)]is monotone inpi; and

work page

[42] [42]

Then BOHM leaf attribution and SHAP marginal contributions induce the same total order over the components in that sibling set

the comparison is restricted to a sibling set (leaves sharing a common parent router), or, equivalently, the hierarchy is flat (a single router over allNcomponents). Then BOHM leaf attribution and SHAP marginal contributions induce the same total order over the components in that sibling set. Under condition (1), Lemma 1 givesw∗ v,i monotone inpi at each ...

work page 2022