Recognition: unknown
Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals
Pith reviewed 2026-05-10 17:11 UTC · model grok-4.3
The pith
Code health metrics can route software engineering tasks to the cheapest LLM tier that still passes verification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Triage defines three LLM capability tiers and routes tasks based on pre-computed code health sub-factors and task metadata so that each task reaches the cheapest tier whose output passes the identical verification gate as the frontier model. The paper analytically derives two falsifiable conditions for cost-effectiveness: the light-tier pass rate on healthy code must exceed the inter-tier cost ratio, and code health must discriminate the needed tier with at least a small effect size. Evaluation on SWE-bench Lite compares a heuristic-threshold policy, a trained ML classifier, and a perfect-hindsight oracle to quantify the cost-quality trade-off and identify which health sub-factors drive the
What carries the argument
Triage, the routing framework that converts code health sub-factors into tier assignments while enforcing a common verification gate across tiers.
If this is right
- Heuristic thresholds on code health sub-factors can already produce measurable cost reductions on routine tasks.
- A trained classifier can learn to route more accurately than fixed thresholds.
- Certain code health sub-factors will turn out to be stronger predictors of required tier than others.
- The same verification gate across tiers guarantees that quality does not degrade when cheaper models are used.
- The evaluation protocol can be reused to test new health metrics or new tier definitions.
Where Pith is reading between the lines
- Adopting the approach would let coding agents run at lower average inference cost while keeping the same final correctness check.
- Teams that keep their codebases clean would gain a direct economic benefit through cheaper AI assistance.
- The same routing logic could be tested on non-coding agent tasks where analogous quality or complexity signals exist.
Load-bearing premise
Code health metrics must discriminate tasks that a light-tier model can solve from those that require a heavier model with enough accuracy to offset the cost difference.
What would settle it
Measure the light-tier model's pass rate on a set of tasks whose code health scores are above a chosen threshold and check whether that rate falls below the cost ratio between the light tier and the next tier.
read the original abstract
Context: AI coding agents route every task to a single frontier large language model (LLM), paying premium inference cost even when many tasks are routine. Objectives: We propose Triage, a framework that uses code health metrics -- indicators of software maintainability -- as a routing signal to assign each task to the cheapest model tier whose output passes the same verification gate as the expensive model. Methods: Triage defines three capability tiers (light, standard, heavy -- mirroring, e.g., Haiku, Sonnet, Opus) and routes tasks based on pre-computed code health sub-factors and task metadata. We design an evaluation comparing three routing policies on SWE-bench Lite (300 tasks across three model tiers): heuristic thresholds, a trained ML classifier, and a perfect-hindsight oracle. Results: We analytically derived two falsifiable conditions under which the tier-dependent asymmetry (medium LLMs benefit from clean code while frontier models do not) yields cost-effective routing: the light-tier pass rate on healthy code must exceed the inter-tier cost ratio, and code health must discriminate the required model tier with at least a small effect size ($\hat{p} \geq 0.56$). Conclusion: Triage transforms a diagnostic code quality metric into an actionable model-selection signal. We present a rigorous evaluation protocol to test the cost--quality trade-off and identify which code health sub-factors drive routing decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Triage, a routing framework that uses pre-computed code health metrics (indicators of software maintainability) to assign software engineering tasks to one of three LLM tiers (light, standard, heavy) such that the cheapest tier whose output still passes the same verification gate as a frontier model is selected. It analytically derives two falsifiable conditions for cost-effective routing and outlines an evaluation protocol that compares heuristic thresholds, a trained ML classifier, and a perfect-hindsight oracle on SWE-bench Lite.
Significance. If the derived conditions prove valid and the protocol yields measurable cost savings without loss of verification pass rate, the work could meaningfully reduce inference expenditure for AI coding agents on routine tasks. The emphasis on analytically derived, falsifiable conditions rather than fitted heuristics, together with an explicit three-policy comparison protocol, supplies a clear path for reproducible follow-up experiments.
major comments (2)
- [Results] Results section: the two conditions (light-tier pass rate on healthy code exceeding the inter-tier cost ratio; code-health discrimination with p-hat >= 0.56) are presented as analytically derived, yet no equations, derivation steps, or explicit assumptions appear in the manuscript, preventing verification that the conditions are parameter-free or independent of the evaluation data.
- [Methods] Methods / Evaluation protocol: the protocol is described at a high level (heuristic thresholds, ML classifier, oracle on SWE-bench Lite) but supplies no concrete definitions of the code-health sub-factors, the feature set for the ML policy, the exact cost and quality metrics, or the statistical test for the p-hat threshold, all of which are load-bearing for the claim that the protocol can identify which sub-factors drive routing decisions.
minor comments (2)
- [Abstract] Abstract: the tier examples (Haiku, Sonnet, Opus) are given but the manuscript never states the precise model identifiers or context-length/cost values used for the three tiers in the SWE-bench Lite experiments.
- [Abstract] The abstract claims the conditions are 'falsifiable' but does not indicate the exact statistical procedure or data split that would constitute a falsification test.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's potential impact and for the constructive major comments. We address each point below and commit to expanding the manuscript with the requested analytical and methodological details.
read point-by-point responses
-
Referee: [Results] Results section: the two conditions (light-tier pass rate on healthy code exceeding the inter-tier cost ratio; code-health discrimination with p-hat >= 0.56) are presented as analytically derived, yet no equations, derivation steps, or explicit assumptions appear in the manuscript, preventing verification that the conditions are parameter-free or independent of the evaluation data.
Authors: We agree that the derivation steps and assumptions were omitted, which prevents independent verification. The two conditions follow directly from the cost-effectiveness inequality under the assumption of a fixed verification gate: cost savings occur when the light-tier pass rate on healthy code exceeds the inter-tier cost ratio, and the discrimination power must satisfy a minimum effect-size threshold (derived via power analysis for the chosen sample size). We will add a new subsection to the Results section containing the full derivation, the complete list of assumptions (identical verification across tiers, fixed per-token pricing, and no dependence on task-specific distributions), and a proof that the conditions are parameter-free and independent of the SWE-bench Lite data. This will make the claims fully verifiable. revision: yes
-
Referee: [Methods] Methods / Evaluation protocol: the protocol is described at a high level (heuristic thresholds, ML classifier, oracle on SWE-bench Lite) but supplies no concrete definitions of the code-health sub-factors, the feature set for the ML policy, the exact cost and quality metrics, or the statistical test for the p-hat threshold, all of which are load-bearing for the claim that the protocol can identify which sub-factors drive routing decisions.
Authors: We acknowledge that the evaluation protocol was presented at a high level. In the revised manuscript we will expand the Methods section to include: explicit definitions of each code-health sub-factor (cyclomatic complexity, maintainability index, duplication ratio, and test coverage as computed by the static-analysis pipeline); the complete feature vector for the ML policy (the sub-factors plus task metadata such as file size and dependency count); precise cost metrics (current API token prices for the three tiers); quality metrics (binary pass/fail on the verification suite); and the exact statistical procedure for the p-hat threshold (one-sided binomial test with effect-size justification). These additions will render the protocol fully reproducible and allow direct analysis of which sub-factors drive routing decisions. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper analytically derives two falsifiable conditions (light-tier pass rate exceeding inter-tier cost ratio; code health discrimination with p-hat >= 0.56) as mathematical requirements for cost-effective routing under the stated tier asymmetry, without reducing them to fitted parameters, self-definitions, or self-citations. The evaluation protocol compares heuristic, ML, and oracle policies on the external SWE-bench Lite benchmark, treating code health metrics as independent inputs transformed into routing signals. No load-bearing step equates a prediction to its own inputs by construction, and the framework remains self-contained against external benchmarks with no imported uniqueness theorems or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Code health metrics can reliably indicate when a lighter LLM tier will produce output that passes the same verification as a heavier tier.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2601.02200
Markus Borg, Nadim Hagatulah, Adam Tornhill, and Emma S \"o derberg. Code for machines, not just humans: Quantifying AI -friendliness with code health metrics, 2026. URL https://arxiv.org/abs/2601.02200
-
[2]
A survey on collaborative mech- anisms between large and small language models,
Yi Chen, JiaHao Zhao, and HaoHao Han. A survey on collaborative mechanisms between large and small language models, 2025. URL https://arxiv.org/abs/2505.07460
- [3]
-
[4]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench : Can language models resolve real-world GitHub issues? In Proceedings of the International Conference on Learning Representations (ICLR), 2024
2024
-
[5]
Recommendations for analysing and meta-analysing small sample size software engineering experiments
Barbara Kitchenham and Lech Madeyski. Recommendations for analysing and meta-analysing small sample size software engineering experiments. Empirical Software Engineering, 29 0 (6): 0 137, 2024
2024
-
[6]
Lundberg and Su-In Lee
Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), pages 4765--4774, 2017
2017
-
[7]
RouteLLM: Learning to Route LLMs with Preference Data
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM : Learning to route LLMs with preference data, 2024. URL https://arxiv.org/abs/2406.18665
work page internal anchor Pith review arXiv 2024
-
[8]
Adam Tornhill and Markus Borg. Code red: The business impact of code quality -- a quantitative study of 39 proprietary production codebases. In Proceedings of the International Conference on Technical Debt, TechDebt '22, pages 11--20. ACM, 2022. doi:10.1145/3524843.3528091. URL https://doi.org/10.1145/3524843.3528091
-
[9]
Token-level LLM collaboration via FusionRoute , 2026
Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, and Zhuokai Zhao. Token-level LLM collaboration via FusionRoute , 2026. URL https://arxiv.org/abs/2601.05106
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.