CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

Enrico Vicario; Marco Becattini; Matteo Minin; Niccol\`o Caselli; Roberto Verdecchia

arxiv: 2606.18976 · v1 · pith:D53RPZBXnew · submitted 2026-06-17 · 💻 cs.SE · cs.AI

CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

Marco Becattini , Niccol\`o Caselli , Matteo Minin , Roberto Verdecchia , Enrico Vicario This is my paper

Pith reviewed 2026-06-26 20:09 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords software architecture assessmentmulti-agent LLMautomated educational feedbackrequirements traceabilityhallucination mitigationdocument extractionLaTeX report generation

0 comments

The pith

CAPRA uses multiple LLM agents plus fuzzy evidence matching to generate reliable feedback on student software architecture reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CAPRA as a multi-agent system that extracts text and diagrams from architecture deliverables, checks structural completeness and requirements traceability, then produces personalized LaTeX feedback. It adds a deterministic Evidence Anchoring step and a ConsistencyManager agent to limit hallucinations and keep output grounded in the submitted document. A test on ten student reports found the system met 88.8 percent of eight binary criteria under strict two-rater rules, reached moderate agreement with human evaluators, and finished each report in slightly over four minutes. The authors argue this shows automated architectural feedback is now feasible in education, though human review is still required for subjective judgments.

Core claim

CAPRA coordinates specialized agents with a Python microservice for multi-modal extraction and applies Evidence Anchoring via normalized Levenshtein fuzzy matching together with a ConsistencyManager that cross-verifies and deduplicates findings, allowing the system to produce template-compliant feedback while satisfying 88.8 percent of evaluation criteria on ten student reports and achieving kappa 0.582 agreement with human raters.

What carries the argument

The Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance together with the ConsistencyManager agent that cross-verifies, deduplicates, and merges agent findings.

If this is right

Each report can be processed in slightly over four minutes without manual extraction of text or diagrams.
Feedback meets 88.8 percent of the eight-criterion taxonomy under strict aggregation.
Moderate inter-rater agreement (kappa 0.582) is reached with human evaluators on the same reports.
Template and tone compliance can be enforced while still flagging specific traceability and completeness issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring-plus-consistency pattern could be tested on other document types such as requirements specifications or design documents.
Processing time and criterion scores might change if the underlying vision model or fuzzy-matching threshold is altered.
The current eight-criterion taxonomy could be expanded to include additional dimensions such as diagram clarity or notation consistency.

Load-bearing premise

The Evidence Anchoring step and ConsistencyManager will reliably prevent hallucinations and produce educationally accurate feedback on structural completeness and requirements traceability for varied student submissions.

What would settle it

Running the system on a new set of at least 30 architecture reports from a different course or institution and measuring whether the strict two-rater criterion satisfaction rate stays above 80 percent or drops due to missed issues or unsupported claims.

Figures

Figures reproduced from arXiv: 2606.18976 by Enrico Vicario, Marco Becattini, Matteo Minin, Niccol\`o Caselli, Roberto Verdecchia.

**Figure 1.** Figure 1: CAPRA system architecture: four-stage pipeline from PDF ingestion (Document Parsing), through parallel multi-agent evaluation (Verification Agents), evidenceanchored deduplication (Evidence Anchoring), to final report generation (Report Generation) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: LLM Compactor: Similar features within a cluster (e.g., “UI mockups for interfaces”, “Navigation menu design”, “Input validation forms”) are merged by an LLM into a single canonical Summary Feature with a structured checklist used by the FeatureCheckAgent. 3.3 Evidence Anchoring and Deduplication Input: The collection of raw analytical findings. Output: A verified, evidence-backed set of architectural crit… view at source ↗

**Figure 3.** Figure 3: Confidence modulation function: the multiplier applied to the initial confidence score Ci as a function of the similarity score S. PDF report. By relying on deterministic LATEX templates rather than generating the entire document via LLM, CAPRA ensures that the compilation process is significantly faster, cheaper (due to drastically reduced token usage), and free of formatting errors. The system combines t… view at source ↗

read the original abstract

Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAPRA gives a concrete multi-agent design for grounded feedback on architecture reports but the evaluation on 10 cases is too thin to support strong claims about reliability.

read the letter

The main takeaway is that this paper builds a working multi-agent LLM pipeline for reviewing student software architecture deliverables, using a microservice to pull text and UML diagrams, fuzzy Levenshtein anchoring to tie feedback to source material, and a ConsistencyManager to clean up outputs. That combination is presented as new relative to earlier code or essay graders.

The system description is clear on the pieces: specialized agents, vision-enabled extraction, the eight-criterion taxonomy, and the reported runtimes around four minutes per report. The 88.8% criteria satisfaction and moderate kappa of 0.582 on the ten reports give a plausible first signal that the approach can produce template-compliant LaTeX feedback without obvious collapse.

The soft spots sit mainly in the evaluation and the anchoring step. Ten reports is a small base for any performance claim, and the abstract leaves the selection process and full inter-rater details thin. The stress-test worry about normalized Levenshtein missing paraphrased requirements or diagram descriptions looks real; if the match fails, downstream severity or traceability judgments can still drift, and no threshold ablation is shown. The paper itself notes that human oversight stays necessary for subjective parts, which matches the moderate agreement numbers.

This work is for people building automated assessment tools inside software engineering education. A reader already working on multi-agent setups or SE feedback systems can pull the architecture and taxonomy details without much trouble.

It deserves peer review. The implementation is specific enough and the gap it targets is real; the current evidence is preliminary but the design choices are laid out plainly enough that referees can ask for the needed expansions.

Referee Report

2 major / 1 minor

Summary. The manuscript presents CAPRA, a multi-agent LLM system for automated assessment of software architecture deliverables. It coordinates specialized agents with a Python microservice for multi-modal extraction (PyMuPDF and gpt-4o for text and UML diagrams), employs deterministic Evidence Anchoring via normalized Levenshtein fuzzy matching to ground feedback, and uses a ConsistencyManager agent for cross-verification. Feedback is generated as template-compliant LaTeX. A preliminary evaluation on 10 student reports using an eight-criterion taxonomy reports 88.8% criterion satisfaction under strict two-rater aggregation, moderate inter-rater agreement (kappa=0.582), and average processing time slightly over 4 minutes per report. The authors note that human oversight remains necessary for subjective dimensions.

Significance. If the anchoring and consistency mechanisms prove reliable across varied submissions, the work could meaningfully advance scalable, personalized feedback in software engineering education for complex tasks like structural completeness and requirements traceability. Strengths include the explicit design to mitigate hallucinations through deterministic grounding, the structured multi-agent coordination, and the use of a binary taxonomy for evaluation. These elements provide a concrete engineering contribution. The preliminary scale of the evaluation, however, constrains the immediate significance pending further validation.

major comments (2)

[Methods (Evidence Anchoring)] Methods section (Evidence Anchoring step): The reliability claim rests on normalized Levenshtein fuzzy matching to anchor feedback to source spans and prevent ungrounded judgments. No ablation study, threshold sensitivity analysis, or failure-case examination is reported for paraphrased requirements or non-identical UML descriptions common in student reports. Because the matching is surface-string based, systematic retrieval failures would directly affect downstream severity and traceability outputs and thereby the reported 88.8% success rate.
[Evaluation] Evaluation section: The performance numbers (88.8% criterion satisfaction, kappa=0.582) derive from only 10 reports. The manuscript does not detail the report selection process, the precise operationalization of each of the eight taxonomy criteria, or the full inter-rater data matrix. These omissions make it difficult to assess whether the moderate agreement and high satisfaction rate generalize or are sensitive to rater subjectivity.

minor comments (1)

[Abstract] Abstract: The eight-criterion taxonomy is summarized at a high level; a concise enumerated list or reference to a table in the main text would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that greater transparency on the Evidence Anchoring implementation and the evaluation protocol is warranted. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Methods (Evidence Anchoring)] Methods section (Evidence Anchoring step): The reliability claim rests on normalized Levenshtein fuzzy matching to anchor feedback to source spans and prevent ungrounded judgments. No ablation study, threshold sensitivity analysis, or failure-case examination is reported for paraphrased requirements or non-identical UML descriptions common in student reports. Because the matching is surface-string based, systematic retrieval failures would directly affect downstream severity and traceability outputs and thereby the reported 88.8% success rate.

Authors: We accept that the current description is insufficient. The normalized Levenshtein threshold (0.75) was chosen to keep anchoring deterministic and independent of additional LLM calls. In the revised manuscript we will add a dedicated paragraph in the Methods section that (a) reports the threshold selection rationale, (b) presents a post-hoc sensitivity check on the 10 reports showing how criterion satisfaction changes at thresholds 0.65–0.85, and (c) enumerates the observed failure modes (primarily paraphrased functional requirements and diagram labels that differ in wording but not semantics). We will also state explicitly that surface-string matching remains a limitation and that embedding-based retrieval is planned for future versions. These additions will allow readers to assess the robustness of the 88.8 % figure without requiring new experiments at this stage. revision: partial
Referee: [Evaluation] Evaluation section: The performance numbers (88.8% criterion satisfaction, kappa=0.582) derive from only 10 reports. The manuscript does not detail the report selection process, the precise operationalization of each of the eight taxonomy criteria, or the full inter-rater data matrix. These omissions make it difficult to assess whether the moderate agreement and high satisfaction rate generalize or are sensitive to rater subjectivity.

Authors: We agree that additional methodological detail is needed. The revised manuscript will include: (1) a description of the report selection process (random sample of 10 consenting submissions from one undergraduate software architecture course), (2) an appendix that gives the exact operational definition and scoring rubric for each of the eight criteria together with one positive and one negative example per criterion, and (3) the complete per-criterion inter-rater agreement table (or at minimum the raw agreement counts underlying the reported kappa). These changes will improve reproducibility while preserving the explicitly preliminary framing of the evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metrics from independent human evaluation

full rationale

The paper reports empirical metrics (88.8% criterion satisfaction, kappa=0.582) obtained via external human-rater comparison against a fixed eight-criterion taxonomy. These quantities are not derived from or equivalent to any internal system parameters, fitted values, or self-citations. No equations, predictions, or uniqueness claims reduce to inputs by construction. The Evidence Anchoring mechanism is a design choice whose reliability is externally tested rather than tautological. This is the normal case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the system description relies on standard LLM capabilities, existing libraries (PyMuPDF, gpt-4o), and fuzzy matching without introducing new postulated constructs.

pith-pipeline@v0.9.1-grok · 5846 in / 1287 out tokens · 34941 ms · 2026-06-26T20:09:50.353927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 14 canonical work pages

[1]

Computer Science Education15(2), 83–102 (2005)

Ala-Mutka, K.M.: A survey of automated assessment approaches for programming assignments. Computer Science Education15(2), 83–102 (2005)

2005
[2]

Becattini et al

Artifex Software, Inc.: PyMuPDF: Python bindings for MuPDF (2024), https: //pymupdf.readthedocs.io, version 1.24, accessed 2024 18 M. Becattini et al

2024
[3]

In: Proceedings of the IEEE/ACM International Workshop on New Trends in Software Architecture (SATrends)

Becattini, M., Verdecchia, R., Vicario, E.: SALLMA: A software architecture for LLM-based multi-agent systems. In: Proceedings of the IEEE/ACM International Workshop on New Trends in Software Architecture (SATrends). pp. 5–8 (2025). https://doi.org/10.1109/SATrends66715.2025.00006

work page doi:10.1109/satrends66715.2025.00006 2025
[4]

In: Proceedings of the 17th International Conference on Computer Supported Education, CSEDU 2025

Bouali, N., Gerhold, M., Rehman, T.U., Ahmed, F.: Toward automated UML diagram assessment: Comparing LLM-generated scores with teaching assistants. In: Proceedings of the 17th International Conference on Computer Supported Education, CSEDU 2025. vol. 1, pp. 158–169 (2025). https://doi.org/10.5220/ 0013481900003932

2025
[5]

In: Advances in Neural Information Processing Systems 33

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33. vol. 33, pp. 1877–1901 (2020)

1901
[6]

In: Advances in Knowledge Discovery and Data Mining – PAKDD 2013

Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Advances in Knowledge Discovery and Data Mining – PAKDD 2013. Lecture Notes in Computer Science, vol. 7819, pp. 160–172. Springer (2013). https://doi.org/10.1007/978-3-642-37456-2_14

work page doi:10.1007/978-3-642-37456-2_14 2013
[7]

In: The Twelfth International Conference on Learning Representations (2024)

Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C.M., Yu, H., Lu, Y., Hung, Y.H., Qian, C., Qin, Y., Cong, X., Xie, R., Liu, Z., Sun, M., Zhou, J.: AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors. In: The Twelfth International Conference on Learning Representations (2024)

2024
[8]

Yu, and Qingsong Wen

Chu, Z., Wang, S., Xie, J., Zhu, T., Yan, Y., Ye, J., Zhong, A., Hu, X., Liang, J., Yu, P.S., Wen, Q.: LLM agents for education: Advances and applications. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 13782–13810. Association for Computational Linguistics,...

work page doi:10.18653/v1/2025.findings-emnlp.743 2025
[9]

Educational and Psycho- logical Measurement20(1), 37–46 (1960)

Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psycho- logical Measurement20(1), 37–46 (1960)

1960
[10]

Applied Sciences15(10), 5683 (2025)

Emirtekin, E.: Large language model-powered automated assessment: A sys- tematic review. Applied Sciences15(10), 5683 (2025). https://doi.org/10.3390/ app15105683, https://www.mdpi.com/2076-3417/15/10/5683

2025
[11]

doi: 10.24963/ijcai.2024/

Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N.V., Wiest, O., Zhang, X.: Large language model based multi-agents: A survey of progress and challenges. In: Larson, K. (ed.) Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24. pp. 8048–8057. International Joint Conferences on Artificial Intelligenc...

work page doi:10.24963/ijcai.2024/ 2024
[12]

Review of Educational Research 77(1), 81–112 (2007)

Hattie, J., Timperley, H.: The power of feedback. Review of Educational Research 77(1), 81–112 (2007)

2007
[13]

In: The Twelfth International Conference on Learning Representations (2024)

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S.K.S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., Schmidhuber, J.: MetaGPT: Meta programming for a multi-agent collaborative framework. In: The Twelfth International Conference on Learning Representations (2024)

2024
[14]

Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang, H.: Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology33(8), 1–79 (2024). https://doi.org/10.1145/3695988, article 220

work page doi:10.1145/3695988 2024
[15]

2025 , issue_date =

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43(2), 1–55 (Jan 2025). https://doi.org/10.1145/3703155 CAPRA: Scaling Feedback on Software Architecture Deliverables 19

work page doi:10.1145/3703155 2025
[16]

In: Proceedings of the 10th Koli Calling International Conference on Computing Education Research

Ihantola, P., Ahoniemi, T., Karavirta, V., Seppälä, O.: Review of recent systems for automatic assessment of programming assignments. In: Proceedings of the 10th Koli Calling International Conference on Computing Education Research. pp. 86–93 (2010)

2010
[17]

Khasentino, J., Belyaeva, A., Liu, X., Yang, Z., Furlotte, N

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv.55(12), 1–38 (Mar 2023). https://doi.org/10.1145/3571730

work page doi:10.1145/3571730 2023
[18]

Learning and Individual Differences103, 102274 (2023)

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., Stadler, M., Weller, J., Kuhn, J., Kasneci, G.: ChatGPT for good? On opportunities and challenges of large...

2023
[19]

ACM Trans

Keuning, H., Jeuring, J., Heeren, B.: A systematic literature review of automated feedback generation for programming exercises. ACM Trans. Comput. Educ.19(1), 1–43 (2019)

2019
[20]

In: The Twelfth International Conference on Learning Representations (2024)

Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., Seo, M.: Prometheus: Inducing fine-grained evaluation capabil- ity in language models. In: The Twelfth International Conference on Learning Representations (2024)

2024
[21]

IEEE Trans

Kitchenham, B.A., Pfleeger, S.L., Pickard, L., Jones, P.W., Hoaglin, D.C., Emam, K.E., Rosenberg, J.: Preliminary guidelines for empirical research in software engineering. IEEE Trans. Softw. Eng.28(8), 721–734 (2002)

2002
[22]

In: Proceedings of the 49th ACM Technical Symposium on Computer Science Education

Krusche, S., Seitz, A.: Artemis: An automatic assessment management system for interactive learning. In: Proceedings of the 49th ACM Technical Symposium on Computer Science Education. pp. 284–289 (Feb 2018). https://doi.org/10.1145/ 3159450.3159602

arXiv 2018
[23]

Biometrics33(1), 159–174 (1977)

Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics33(1), 159–174 (1977)

1977
[24]

Soviet Physics Doklady10, 707–710 (1966)

Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady10, 707–710 (1966)

1966
[25]

In: Advances in Neural Information Processing Systems 36

Li, G., Hammoud, H.A.A.K., Itani, H., Khizbullin, D., Ghanem, B.: CAMEL: Communicative agents for “mind” exploration of large language model society. In: Advances in Neural Information Processing Systems 36. vol. 36 (2023)

2023
[26]

: A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges

Li, X., Wang, S., Zeng, S., Wu, Y., Yang, Y.: A survey on LLM-based multi- agent systems: Workflow, infrastructure, and challenges. Vicinagearth1(1) (2024). https://doi.org/10.1007/s44336-024-00009-2

work page doi:10.1007/s44336-024-00009-2 2024
[27]

arXiv preprint arXiv:2512.02498 (2025)

Li, Y., Yang, G., Liu, H., Wang, B., Zhang, C.: dots.ocr: Multilingual document layout parsing in a single vision-language model. arXiv preprint arXiv:2512.02498 (2025)

arXiv 2025
[28]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)

2023
[29]

IEEE Transactions on Pattern Analysis and Machine Intelligence15(9), 926–932 (1993)

Marzal, A., Vidal, E.: Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence15(9), 926–932 (1993). https://doi.org/10.1109/34.232078

work page doi:10.1109/34.232078 1993
[30]

ACM Trans

Messer, M., Brown, N.C.C., Kolling, M., Shi, M.: Automated grading and feedback tools for programming education: A systematic review. ACM Trans. Comput. Educ. 24(1), 1–43 (2024). https://doi.org/10.1145/3636515

work page doi:10.1145/3636515 2024
[31]

ACM Computing Surveys , year = 2001, volume = 33, number = 1, month =

Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (Mar 2001). https://doi.org/10.1145/375360.375365 20 M. Becattini et al

work page doi:10.1145/375360.375365 2001
[32]

https://openai.com/index/hello-gpt-4o/ (2024), accessed: June 2026

OpenAI: Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ (2024), accessed: June 2026

2024
[33]

In: Proceedings of the 16th International Conference on Educational Data Mining (2023)

Phung, T., Cambronero, J., Gulwani, S., Kohn, T., Majumdar, R., Singla, A., Soares, G.: Generating high-precision feedback for programming syntax errors using large language models. In: Proceedings of the 16th International Conference on Educational Data Mining (2023)

2023
[34]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z., Sun, M.: ChatDev: Communicative agents for software development. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

2024
[35]

In: Proceedings of the 44th ACM Technical Symposium on Computer Science Education

Radermacher, A., Walia, G.: Gaps between industry expectations and the abilities of graduates. In: Proceedings of the 44th ACM Technical Symposium on Computer Science Education. pp. 525–530 (2013)

2013
[36]

Empirical Software Engineering14(2), 131–164 (2009)

Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering14(2), 131–164 (2009)

2009
[37]

In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation

Singh, R., Gulwani, S., Solar-Lezama, A.: Automated feedback generation for introductory programming assignments. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. pp. 15–26 (2013)

2013
[38]

Tenhunen, S., Männistö, T., Luukkainen, M., Ihantola, P.: A systematic literature review of capstone courses in software engineering. Inf. Softw. Technol.159(C), 107191 (Jul 2023). https://doi.org/10.1016/j.infsof.2023.107191

work page doi:10.1016/j.infsof.2023.107191 2023
[39]

and Spring Contributors: Spring AI: An application framework for AI engineering

VMware, Inc. and Spring Contributors: Spring AI: An application framework for AI engineering. https://spring.io/projects/spring-ai (2024), version 1.0.x. Accessed: February 2026

2024
[40]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Liu, T., Sui, Z.: Large language models are not fair evaluators. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

2024
[41]

In: The Eleventh International Conference on Learning Representations (2023)

Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2023)

2023
[42]

arXiv preprint arXiv:2308.08155 (2023)

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A.H., White, R.W., Burger, D., Wang, C.: AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155 (2023)

Pith/arXiv arXiv 2023
[43]

In: Proceedings of the Interna- tional Conference on Research in Adaptive and Convergent Systems

Xie, W., Niu, J., Xue, C.J., Guan, N.: Grade like a human: Rethinking auto- mated assessment with large language models. In: Proceedings of the Interna- tional Conference on Research in Adaptive and Convergent Systems. pp. 1–8. RACS ’25, Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/3769002.3769962

work page doi:10.1145/3769002.3769962 2025
[44]

IEEE Transactions on Pattern Analysis and Machine Intelligence29(6), 1091–1095 (2007)

Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence29(6), 1091–1095 (2007). https: //doi.org/10.1109/TPAMI.2007.1078

work page doi:10.1109/tpami.2007.1078 2007
[45]

In: Advances in Neural Information Processing Systems 36

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In: Advances in Neural Information Processing Systems 36. vol. 36 (2023)

2023

[1] [1]

Computer Science Education15(2), 83–102 (2005)

Ala-Mutka, K.M.: A survey of automated assessment approaches for programming assignments. Computer Science Education15(2), 83–102 (2005)

2005

[2] [2]

Becattini et al

Artifex Software, Inc.: PyMuPDF: Python bindings for MuPDF (2024), https: //pymupdf.readthedocs.io, version 1.24, accessed 2024 18 M. Becattini et al

2024

[3] [3]

In: Proceedings of the IEEE/ACM International Workshop on New Trends in Software Architecture (SATrends)

Becattini, M., Verdecchia, R., Vicario, E.: SALLMA: A software architecture for LLM-based multi-agent systems. In: Proceedings of the IEEE/ACM International Workshop on New Trends in Software Architecture (SATrends). pp. 5–8 (2025). https://doi.org/10.1109/SATrends66715.2025.00006

work page doi:10.1109/satrends66715.2025.00006 2025

[4] [4]

In: Proceedings of the 17th International Conference on Computer Supported Education, CSEDU 2025

Bouali, N., Gerhold, M., Rehman, T.U., Ahmed, F.: Toward automated UML diagram assessment: Comparing LLM-generated scores with teaching assistants. In: Proceedings of the 17th International Conference on Computer Supported Education, CSEDU 2025. vol. 1, pp. 158–169 (2025). https://doi.org/10.5220/ 0013481900003932

2025

[5] [5]

In: Advances in Neural Information Processing Systems 33

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33. vol. 33, pp. 1877–1901 (2020)

1901

[6] [6]

In: Advances in Knowledge Discovery and Data Mining – PAKDD 2013

Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Advances in Knowledge Discovery and Data Mining – PAKDD 2013. Lecture Notes in Computer Science, vol. 7819, pp. 160–172. Springer (2013). https://doi.org/10.1007/978-3-642-37456-2_14

work page doi:10.1007/978-3-642-37456-2_14 2013

[7] [7]

In: The Twelfth International Conference on Learning Representations (2024)

Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C.M., Yu, H., Lu, Y., Hung, Y.H., Qian, C., Qin, Y., Cong, X., Xie, R., Liu, Z., Sun, M., Zhou, J.: AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors. In: The Twelfth International Conference on Learning Representations (2024)

2024

[8] [8]

Yu, and Qingsong Wen

Chu, Z., Wang, S., Xie, J., Zhu, T., Yan, Y., Ye, J., Zhong, A., Hu, X., Liang, J., Yu, P.S., Wen, Q.: LLM agents for education: Advances and applications. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 13782–13810. Association for Computational Linguistics,...

work page doi:10.18653/v1/2025.findings-emnlp.743 2025

[9] [9]

Educational and Psycho- logical Measurement20(1), 37–46 (1960)

Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psycho- logical Measurement20(1), 37–46 (1960)

1960

[10] [10]

Applied Sciences15(10), 5683 (2025)

Emirtekin, E.: Large language model-powered automated assessment: A sys- tematic review. Applied Sciences15(10), 5683 (2025). https://doi.org/10.3390/ app15105683, https://www.mdpi.com/2076-3417/15/10/5683

2025

[11] [11]

doi: 10.24963/ijcai.2024/

Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N.V., Wiest, O., Zhang, X.: Large language model based multi-agents: A survey of progress and challenges. In: Larson, K. (ed.) Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24. pp. 8048–8057. International Joint Conferences on Artificial Intelligenc...

work page doi:10.24963/ijcai.2024/ 2024

[12] [12]

Review of Educational Research 77(1), 81–112 (2007)

Hattie, J., Timperley, H.: The power of feedback. Review of Educational Research 77(1), 81–112 (2007)

2007

[13] [13]

In: The Twelfth International Conference on Learning Representations (2024)

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S.K.S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., Schmidhuber, J.: MetaGPT: Meta programming for a multi-agent collaborative framework. In: The Twelfth International Conference on Learning Representations (2024)

2024

[14] [14]

Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang, H.: Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology33(8), 1–79 (2024). https://doi.org/10.1145/3695988, article 220

work page doi:10.1145/3695988 2024

[15] [15]

2025 , issue_date =

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43(2), 1–55 (Jan 2025). https://doi.org/10.1145/3703155 CAPRA: Scaling Feedback on Software Architecture Deliverables 19

work page doi:10.1145/3703155 2025

[16] [16]

In: Proceedings of the 10th Koli Calling International Conference on Computing Education Research

Ihantola, P., Ahoniemi, T., Karavirta, V., Seppälä, O.: Review of recent systems for automatic assessment of programming assignments. In: Proceedings of the 10th Koli Calling International Conference on Computing Education Research. pp. 86–93 (2010)

2010

[17] [17]

Khasentino, J., Belyaeva, A., Liu, X., Yang, Z., Furlotte, N

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv.55(12), 1–38 (Mar 2023). https://doi.org/10.1145/3571730

work page doi:10.1145/3571730 2023

[18] [18]

Learning and Individual Differences103, 102274 (2023)

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., Stadler, M., Weller, J., Kuhn, J., Kasneci, G.: ChatGPT for good? On opportunities and challenges of large...

2023

[19] [19]

ACM Trans

Keuning, H., Jeuring, J., Heeren, B.: A systematic literature review of automated feedback generation for programming exercises. ACM Trans. Comput. Educ.19(1), 1–43 (2019)

2019

[20] [20]

In: The Twelfth International Conference on Learning Representations (2024)

Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., Seo, M.: Prometheus: Inducing fine-grained evaluation capabil- ity in language models. In: The Twelfth International Conference on Learning Representations (2024)

2024

[21] [21]

IEEE Trans

Kitchenham, B.A., Pfleeger, S.L., Pickard, L., Jones, P.W., Hoaglin, D.C., Emam, K.E., Rosenberg, J.: Preliminary guidelines for empirical research in software engineering. IEEE Trans. Softw. Eng.28(8), 721–734 (2002)

2002

[22] [22]

In: Proceedings of the 49th ACM Technical Symposium on Computer Science Education

Krusche, S., Seitz, A.: Artemis: An automatic assessment management system for interactive learning. In: Proceedings of the 49th ACM Technical Symposium on Computer Science Education. pp. 284–289 (Feb 2018). https://doi.org/10.1145/ 3159450.3159602

arXiv 2018

[23] [23]

Biometrics33(1), 159–174 (1977)

Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics33(1), 159–174 (1977)

1977

[24] [24]

Soviet Physics Doklady10, 707–710 (1966)

Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady10, 707–710 (1966)

1966

[25] [25]

In: Advances in Neural Information Processing Systems 36

Li, G., Hammoud, H.A.A.K., Itani, H., Khizbullin, D., Ghanem, B.: CAMEL: Communicative agents for “mind” exploration of large language model society. In: Advances in Neural Information Processing Systems 36. vol. 36 (2023)

2023

[26] [26]

: A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges

Li, X., Wang, S., Zeng, S., Wu, Y., Yang, Y.: A survey on LLM-based multi- agent systems: Workflow, infrastructure, and challenges. Vicinagearth1(1) (2024). https://doi.org/10.1007/s44336-024-00009-2

work page doi:10.1007/s44336-024-00009-2 2024

[27] [27]

arXiv preprint arXiv:2512.02498 (2025)

Li, Y., Yang, G., Liu, H., Wang, B., Zhang, C.: dots.ocr: Multilingual document layout parsing in a single vision-language model. arXiv preprint arXiv:2512.02498 (2025)

arXiv 2025

[28] [28]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)

2023

[29] [29]

IEEE Transactions on Pattern Analysis and Machine Intelligence15(9), 926–932 (1993)

Marzal, A., Vidal, E.: Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence15(9), 926–932 (1993). https://doi.org/10.1109/34.232078

work page doi:10.1109/34.232078 1993

[30] [30]

ACM Trans

Messer, M., Brown, N.C.C., Kolling, M., Shi, M.: Automated grading and feedback tools for programming education: A systematic review. ACM Trans. Comput. Educ. 24(1), 1–43 (2024). https://doi.org/10.1145/3636515

work page doi:10.1145/3636515 2024

[31] [31]

ACM Computing Surveys , year = 2001, volume = 33, number = 1, month =

Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (Mar 2001). https://doi.org/10.1145/375360.375365 20 M. Becattini et al

work page doi:10.1145/375360.375365 2001

[32] [32]

https://openai.com/index/hello-gpt-4o/ (2024), accessed: June 2026

OpenAI: Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ (2024), accessed: June 2026

2024

[33] [33]

In: Proceedings of the 16th International Conference on Educational Data Mining (2023)

Phung, T., Cambronero, J., Gulwani, S., Kohn, T., Majumdar, R., Singla, A., Soares, G.: Generating high-precision feedback for programming syntax errors using large language models. In: Proceedings of the 16th International Conference on Educational Data Mining (2023)

2023

[34] [34]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z., Sun, M.: ChatDev: Communicative agents for software development. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

2024

[35] [35]

In: Proceedings of the 44th ACM Technical Symposium on Computer Science Education

Radermacher, A., Walia, G.: Gaps between industry expectations and the abilities of graduates. In: Proceedings of the 44th ACM Technical Symposium on Computer Science Education. pp. 525–530 (2013)

2013

[36] [36]

Empirical Software Engineering14(2), 131–164 (2009)

Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering14(2), 131–164 (2009)

2009

[37] [37]

In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation

Singh, R., Gulwani, S., Solar-Lezama, A.: Automated feedback generation for introductory programming assignments. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. pp. 15–26 (2013)

2013

[38] [38]

Tenhunen, S., Männistö, T., Luukkainen, M., Ihantola, P.: A systematic literature review of capstone courses in software engineering. Inf. Softw. Technol.159(C), 107191 (Jul 2023). https://doi.org/10.1016/j.infsof.2023.107191

work page doi:10.1016/j.infsof.2023.107191 2023

[39] [39]

and Spring Contributors: Spring AI: An application framework for AI engineering

VMware, Inc. and Spring Contributors: Spring AI: An application framework for AI engineering. https://spring.io/projects/spring-ai (2024), version 1.0.x. Accessed: February 2026

2024

[40] [40]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Liu, T., Sui, Z.: Large language models are not fair evaluators. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

2024

[41] [41]

In: The Eleventh International Conference on Learning Representations (2023)

Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2023)

2023

[42] [42]

arXiv preprint arXiv:2308.08155 (2023)

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A.H., White, R.W., Burger, D., Wang, C.: AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155 (2023)

Pith/arXiv arXiv 2023

[43] [43]

In: Proceedings of the Interna- tional Conference on Research in Adaptive and Convergent Systems

Xie, W., Niu, J., Xue, C.J., Guan, N.: Grade like a human: Rethinking auto- mated assessment with large language models. In: Proceedings of the Interna- tional Conference on Research in Adaptive and Convergent Systems. pp. 1–8. RACS ’25, Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/3769002.3769962

work page doi:10.1145/3769002.3769962 2025

[44] [44]

IEEE Transactions on Pattern Analysis and Machine Intelligence29(6), 1091–1095 (2007)

Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence29(6), 1091–1095 (2007). https: //doi.org/10.1109/TPAMI.2007.1078

work page doi:10.1109/tpami.2007.1078 2007

[45] [45]

In: Advances in Neural Information Processing Systems 36

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In: Advances in Neural Information Processing Systems 36. vol. 36 (2023)

2023