arxiv: 2604.23088 · v2 · submitted 2026-04-25 · 💻 cs.SE · cs.AI· cs.CL· cs.PL

Recognition: unknown

Code Broker: A Multi-Agent System for Automated Code Quality Assessment

Samer Attrah

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.PL

keywords multi-agent systemscode quality assessmentLLM agentsPython code analysisautomated reportingstatic analysis integrationdeveloper feedbackagent orchestration

0 comments

The pith

Parallel specialized agents in a multi-agent system generate readable code quality feedback that complements traditional linting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Code Broker, a system that coordinates multiple agents to review Python code and produce structured quality reports. It aims to establish that dispatching specialized agents in parallel for distinct tasks like correctness and style assessment yields feedback that developers find more readable and actionable than standard static analysis alone. This would matter because it points to a way of combining semantic reasoning from language models with deterministic checks to support better code improvement workflows. The work is framed as a technical report on the system design and includes a preliminary qualitative evaluation on sample codebases of different sizes.

Core claim

Code Broker realizes a hierarchical five-agent architecture in which a root orchestrator coordinates a sequential pipeline agent that dispatches three specialized agents concurrently—a Correctness Assessor, a Style Assessor, and a Description Generator—before an Improvement Recommender synthesizes their outputs. The system quantifies four quality dimensions (correctness, security, style, and maintainability) on a normalized scale, fuses LLM-based semantic reasoning with Pylint static analysis signals, and renders reports in Markdown and HTML formats suitable for integration into developer workflows.

What carries the argument

The hierarchical five-agent architecture that coordinates a sequential pipeline with concurrent specialized agents for correctness, style, and description generation, followed by synthesis in an improvement recommender, fusing LLM semantic reasoning with deterministic static analysis.

Load-bearing premise

The LLM-based specialized agents produce reliable, non-hallucinated assessments of code correctness and maintainability without any quantitative validation against ground truth.

What would settle it

A quantitative study measuring agreement rates between the system's reports and independent human expert reviews on a fixed set of Python code samples containing known issues.

Figures

Figures reproduced from arXiv: 2604.23088 by Samer Attrah.

**Figure 1.** Figure 1: Hierarchical five-agent architecture of Code Broker. The orchestrator coordinates a sequential pipeline, view at source ↗

read the original abstract

We present Code Broker, a multi agent system built on Google s Agent Development Kit ADK that analyses Python source code from individual files, local directory trees, or remote GitHub repositories and generates structured, actionable quality assessment reports. The system realises a hierarchical five agent architecture in which a root orchestrator coordinates a sequential pipeline agent that, in turn, dispatches three specialised agents concurrently a Correctness Assessor, a Style Assessor, and a Description Generator before synthesising their findings through an Improvement Recommender. Reports quantify four quality dimensions correctness, security, style, and maintainability on a normalised scale and are rendered in both Markdown and HTML for integration into diverse developer workflows. Code Broker fuses LLM based semantic reasoning with deterministic static analysis signals from Pylint, employs asynchronous execution with exponential backoff retry logic to improve robustness under transient API failures, and explores lightweight session memory for retaining and querying prior assessment context across runs. We frame this paper as a technical report on system design, prompt engineering, and tool orchestration, and present a preliminary qualitative evaluation on representative Python codebases of varying scale. The results indicate that parallel specialised agents produce readable, developer oriented feedback that complements traditional linting, while also foregrounding current limitations in evaluation depth, security tooling, large repository handling, and the exclusive reliance on in memory persistence. All code and reproducibility materials are publicly available: https://github.com/Samir-atra/agents_intensive_dev.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Code Broker is a clear engineering write-up of a five-agent Python code reviewer with public code, but its evaluation is too preliminary to support claims of reliable, complementary feedback.

read the letter

The main thing to know is that this paper describes a concrete multi-agent system rather than a new algorithm or result. It builds a hierarchical setup on Google's ADK: a root orchestrator routes work through a pipeline agent to three concurrent assessors (correctness, style, description) and then an improvement recommender. The system pulls in Pylint for deterministic signals, adds async handling with retries, and outputs Markdown or HTML reports on correctness, security, style, and maintainability. The authors release the full code and prompts on GitHub, which is useful for anyone wanting to replicate or extend the exact pipeline.

Referee Report

1 major / 2 minor

Summary. The paper presents Code Broker, a hierarchical five-agent multi-agent system built on Google's Agent Development Kit for automated Python code quality assessment. A root orchestrator coordinates a pipeline agent that dispatches three concurrent specialized agents (Correctness Assessor, Style Assessor, Description Generator) whose outputs are synthesized by an Improvement Recommender. The system fuses LLM-based semantic reasoning with deterministic Pylint signals to quantify correctness, security, style, and maintainability, producing Markdown and HTML reports. It describes prompt engineering, asynchronous execution with retry logic, and lightweight session memory, and includes a preliminary qualitative evaluation on representative Python codebases claiming that parallel agents yield readable, developer-oriented feedback that complements traditional linting. All code is publicly released.

Significance. If the positive indications from the evaluation hold under rigorous testing, the work would demonstrate a practical way to integrate semantic LLM reasoning with static analysis for more actionable code reviews, advancing automated software engineering tools. The public release of code and reproducibility materials is a clear strength supporting further research. However, the current preliminary qualitative framing limits immediate impact, as the central claims about feedback quality and complementarity rest on unquantified observations.

major comments (1)

[preliminary qualitative evaluation] The preliminary qualitative evaluation (described in the abstract and the evaluation section): the manuscript states only that a 'preliminary qualitative evaluation' was performed on 'representative Python codebases of varying scale' and that the results 'indicate' readable, developer-oriented feedback that complements linting. No sample count, selection method, rubric for assessing 'readable' or 'complements', inter-rater scores, comparison to ground truth (e.g., test-suite outcomes or expert annotations), or quantitative metrics are supplied. This directly undermines the load-bearing claim that the parallel specialized agents (Correctness Assessor and Description Generator) produce reliable non-hallucinated assessments.

minor comments (2)

[Abstract] Abstract: 'Google s Agent Development Kit' is missing the possessive apostrophe and should read 'Google's Agent Development Kit'.
[limitations discussion] The description of limitations (large repository handling, security tooling, in-memory persistence) is appropriately self-critical but could be expanded with concrete examples of failure cases observed during the qualitative runs to aid readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive criticism of our work. We accept that the description of the preliminary qualitative evaluation is insufficiently detailed and will revise the manuscript to provide greater transparency and context for our claims.

read point-by-point responses

Referee: [preliminary qualitative evaluation] The preliminary qualitative evaluation (described in the abstract and the evaluation section): the manuscript states only that a 'preliminary qualitative evaluation' was performed on 'representative Python codebases of varying scale' and that the results 'indicate' readable, developer-oriented feedback that complements linting. No sample count, selection method, rubric for assessing 'readable' or 'complements', inter-rater scores, comparison to ground truth (e.g., test-suite outcomes or expert annotations), or quantitative metrics are supplied. This directly undermines the load-bearing claim that the parallel specialized agents (Correctness Assessor and Description Generator) produce reliable non-hallucinated assessments.

Authors: We agree with the referee that the current presentation of the evaluation does not provide enough information to fully evaluate the reliability of the agents' assessments. As the manuscript is positioned as a technical report on the design and implementation of the multi-agent system, the evaluation was kept preliminary to focus on the novel aspects of agent hierarchy and tool integration. However, to strengthen the paper, we will revise the evaluation section to include details on the number of codebases evaluated, how they were selected, the criteria used to judge the feedback as readable and complementary (such as manual comparison to Pylint outputs and assessment of semantic depth), and illustrative examples. Additionally, we will add an explicit discussion of limitations, including the absence of quantitative metrics, inter-rater reliability, and ground truth comparisons, and how this affects the strength of our claims regarding non-hallucinated outputs. We will highlight that the system's use of Pylint for correctness and style provides a deterministic baseline that the LLM agents build upon, reducing the risk of pure hallucinations. These changes will be incorporated in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive system report with no derivation chain

full rationale

The manuscript is framed explicitly as a technical report on system architecture, prompt engineering, and a preliminary qualitative evaluation of a multi-agent code assessment tool. It contains no equations, no fitted parameters, no predictions derived from data, and no first-principles claims that could be reduced to their own inputs. The evaluation section describes results only in qualitative terms on representative codebases without quantitative metrics, ground-truth comparisons, or self-referential fitting. Because no load-bearing derivation or prediction step exists, none of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that LLM agents can reliably interpret code semantics and that the chosen orchestration produces better feedback than single-tool baselines. No free parameters are explicitly fitted in the abstract, but agent prompts are hand-engineered. The agents themselves are custom components rather than new physical entities.

free parameters (1)

Agent prompts and orchestration logic
Specific prompts for Correctness Assessor, Style Assessor, Description Generator and Improvement Recommender are engineered by hand to produce desired output formats.

axioms (1)

domain assumption LLM semantic reasoning on source code yields actionable quality signals that complement static analysis
Invoked when the three specialised agents are dispatched concurrently and their outputs are synthesised.

invented entities (1)

Hierarchical five-agent architecture with root orchestrator, pipeline agent, three concurrent assessors and recommender no independent evidence
purpose: To coordinate parallel assessment of code quality dimensions
New system design introduced in the paper; no independent external evidence provided beyond the authors' qualitative tests.

pith-pipeline@v0.9.0 · 5556 in / 1305 out tokens · 72939 ms · 2026-05-08T08:16:35.574868+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 21 canonical work pages · 2 internal anchors

[1]

(2026).Inside Kaggle’s AI Agents Intensive Course with Google

Google & Kaggle. (2026).Inside Kaggle’s AI Agents Intensive Course with Google. Google Blog. https://blog.google/innovation-and-ai/ technology/developers-tools/ ai-agents-intensive-recap/

2026
[2]

(2023).ReAct: Syn- ergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023).ReAct: Syn- ergizing Reasoning and Acting in Language Models. International Conference on Learning Representa- tions (ICLR)

2023
[3]

(2023).LLM-powered Autonomous Agents

Weng, L. (2023).LLM-powered Autonomous Agents. Lil’Log. https://lilianweng.github.io/posts/ 2023-06-23-agent/

2023
[4]

Vassallo, C., Panichella, S., Palomba, F., Proksch, S., Zaidman, A., & Gall, H. C. (2019).How de- velopers engage with static analysis tools in differ- ent contexts. Empirical Software Engineering, 24(2), 1419–1457

2019
[5]

(2021).CodeXGLUE: A Ma- chine Learning Benchmark Dataset for Code Un- derstanding and Generation

Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., et al. (2021).CodeXGLUE: A Ma- chine Learning Benchmark Dataset for Code Un- derstanding and Generation. NeurIPS Datasets and Benchmarks Track

2021
[6]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Hong, S., Zheng, X., Chen, J., Cheng, Y., Wang, J., Zhang, C., et al. (2023).MetaGPT: Meta Program- ming for A Multi-Agent Collaborative Framework. arXiv preprint arXiv:2308.00352

work page internal anchor Pith review arXiv 2023
[7]

ChatDev: Communicative Agents for Software Development

Qian, C., Cong, X., Yang, C., Chen, W., Su, Y., Xu, J., et al. (2023).Communicative Agents for Software Development. arXiv preprint arXiv:2307.07924. 7 Samer Attrah Code Broker: AI-Powered Code Assessment Agent

work page internal anchor Pith review arXiv 2023
[8]

C., & Shen, J

Chen, J., Guo, X., Chen, S., Cheung, S. C., & Shen, J. (2025).Multi-Agent Systems for Dataset Adaptation in Software Engineering: Capabilities, Limitations, and Future Directions. arXiv preprint arXiv:2511.21380

work page arXiv 2025
[9]

(2026).Agyn: A Multi- Agent System for Team-Based Autonomous Software Engineering

Benkovich, N., & Valkov, V. (2026).Agyn: A Multi- Agent System for Team-Based Autonomous Software Engineering. arXiv preprint arXiv:2602.01465

work page arXiv 2026
[10]

(2026).Loosely-Structured Software: Engineer- ing Context, Structure, and Evolution Entropy in Runtime-Rewired Multi-Agent Systems

Zhang, W., Zhou, Y., Qu, H., & Li, H. (2026).Loosely-Structured Software: Engineer- ing Context, Structure, and Evolution Entropy in Runtime-Rewired Multi-Agent Systems. arXiv preprint arXiv:2603.15690

work page arXiv 2026
[11]

(2025).Designing LLM-based Multi-Agent Systems for Software Engineering Tasks: Quality Attributes, Design Patterns and Rationale

Cai, Y., Li, R., Liang, P., Shahin, M., & Li, Z. (2025).Designing LLM-based Multi-Agent Systems for Software Engineering Tasks: Quality Attributes, Design Patterns and Rationale. arXiv preprint arXiv:2511.08475

work page arXiv 2025
[12]

K., Mahala, G., Hoda, R., Zheng, X., & Conati, C

Dam, H. K., Mahala, G., Hoda, R., Zheng, X., & Conati, C. (2025).Towards autonomous normative multi-agent systems for Human-AI software engi- neering teams. arXiv preprint arXiv:2512.02329

work page arXiv 2025
[13]

arXiv preprint arXiv:2601.09822 , year =

Tang, Y., & Runkler, T. (2026).LLM-Based Agentic Systems for Software Engineering: Challenges and Opportunities. arXiv preprint arXiv:2601.09822

work page arXiv 2026
[14]

Llm-based multi-agent systems for software engineering: Literature review, vision and the road ahead,

He, J., Treude, C., & Lo, D. (2024).LLM-Based Multi-Agent Systems for Software Engineering: Lit- erature Review, Vision and the Road Ahead. arXiv preprint arXiv:2404.04834

work page arXiv 2024
[15]

(2025).Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi- Agent System oriented Software Engineering

Ronanki, K. (2025).Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi- Agent System oriented Software Engineering. arXiv preprint arXiv:2505.04251

work page arXiv 2025
[16]

(2025).Analyz- ing Code Injection Attacks on LLM-based Multi- Agent Systems in Software Development

Bowers, B., Khapre, S., & Kalita, J. (2025).Analyz- ing Code Injection Attacks on LLM-based Multi- Agent Systems in Software Development. arXiv preprint arXiv:2512.21818

work page arXiv 2025
[17]

Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024

Phan, H. N., Nguyen, T. N., Nguyen, P. X., & Bui, N. D. Q. (2024).HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale. arXiv preprint arXiv:2409.16299

work page arXiv 2024
[18]

Zhu, A., Dugan, L., & Callison-Burch, C. (2024). ReDel: A Toolkit for LLM-Powered Recursive Multi- Agent Systems. arXiv preprint arXiv:2408.02248

work page arXiv 2024
[19]

Elhashemy, H., Lotfy, Y., & Tang, Y. (2025). Bridging the Prototype-Production Gap: A Multi- Agent System for Notebooks Transformation. arXiv preprint arXiv:2511.07257

work page arXiv 2025
[20]

(2019).An Architectural Style for Self-Adaptive Multi-Agent Systems

Weyns, D., & Oquendo, F. (2019).An Architectural Style for Self-Adaptive Multi-Agent Systems. arXiv preprint arXiv:1909.03475

work page arXiv 2019
[21]

J., H¨ ubner, J

Amaral, C. J., H¨ ubner, J. F., & Kampik, T. (2020). Towards Jacamo-rest: A Resource-Oriented Ab- straction for Managing Multi-Agent Systems. arXiv preprint arXiv:2006.05619

work page arXiv 2020
[22]

Xia, Y., Wang, T., Zhang, S., Weng, Z., Cao, B., & Liew, S. C. (2025).HiveMind: Contribution-Guided Online Prompt Optimization of LLM Multi-Agent Systems. arXiv preprint arXiv:2512.06432

work page arXiv 2025
[23]

C., Ferrando, A., Panisson, A

Engelmann, D. C., Ferrando, A., Panisson, A. R., Ancona, D., Bordini, R. H., & Mascardi, V. (2022). RV4JaCa – Runtime Verification for Multi-Agent Systems. arXiv preprint arXiv:2207.09708

work page arXiv 2022
[24]

(2024).VITA- MIN: A Compositional Framework for Model Checking of Multi-Agent Systems

Ferrando, A., & Malvone, V. (2024).VITA- MIN: A Compositional Framework for Model Checking of Multi-Agent Systems. arXiv preprint arXiv:2403.02170

work page arXiv 2024
[25]

(2025).Assessing and Enhancing the Robustness of LLM-based Multi-Agent Sys- tems Through Chaos Engineering

Owotogbe, J. (2025).Assessing and Enhancing the Robustness of LLM-based Multi-Agent Sys- tems Through Chaos Engineering. arXiv preprint arXiv:2505.03096

work page arXiv 2025
[26]

(2025).Moving From Mono- lithic To Microservices Architecture for Multi-Agent Systems

Goyal, M., & Bhasin, P. (2025).Moving From Mono- lithic To Microservices Architecture for Multi-Agent Systems. arXiv preprint arXiv:2505.07838

work page arXiv 2025
[27]

(2026).Code Broker: Multi-Agent System for Automated Code Quality Assessment

Attrah, S. (2026).Code Broker: Multi-Agent System for Automated Code Quality Assessment. Main Project Repository. https://github.com/ Samir-atra/agents_intensive_dev

2026
[28]

(2026).Code Broker Package: Reusable Python Distribution for Automated Code Assess- ment

Attrah, S. (2026).Code Broker Package: Reusable Python Distribution for Automated Code Assess- ment. GitHub Repository. https://github.com/ Samir-atra/Code_broker_pkg 8

2026