Recognition: unknown
WebMAC: A Multi-Agent Collaborative Framework for Scenario Testing of Web Systems
Pith reviewed 2026-05-10 13:37 UTC · model grok-4.3
The pith
A multi-agent framework clarifies incomplete test scenarios and applies equivalence partitioning to raise web script success rates by 30 to 60 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WebMAC consists of three multi-agent modules that together complete natural language scenario descriptions via interactive clarification, transform those scenarios through equivalence class partitioning to meet adequacy criteria, and generate corresponding test scripts. When evaluated on four web systems, the resulting scripts execute successfully at rates 30 to 60 percent higher than the prior state-of-the-art method, testing efficiency rises by 29 percent, token consumption falls by 47.6 percent, and more errors are detected.
What carries the argument
Three sequential multi-agent modules that perform interactive scenario clarification, equivalence class partitioning for test adequacy, and script generation.
Load-bearing premise
The interactive multi-agent clarification step reliably produces complete and unbiased scenario descriptions, and the subsequent equivalence partitioning step produces test cases that cover all important behaviors without gaps or invalid tests.
What would settle it
Running WebMAC on a web system whose known critical edge cases and failure modes are documented in advance and finding that it detects no more errors than the baseline method while showing no gain in script success rate.
Figures
read the original abstract
Scenario testing is an important technique for detecting errors in web systems. Testers draft test scenarios and convert them into test scripts for execution. Early methods relied on testers to convert test scenarios into test scripts. Recent LLM-based scenario testing methods can generate test scripts from natural language descriptions of test scenarios. However, these methods are not only limited by the incompleteness of descriptions but also overlook test adequacy criteria, making it difficult to detect potential errors. To address these limitations, this paper proposes WebMAC, a multi-agent collaborative framework for scenario testing of web systems. WebMAC can complete natural language descriptions of test scenarios through interactive clarification and transform adequate instantiated test scenarios via equivalence class partitioning. WebMAC consists of three multi-agent modules, responsible respectively for completing natural language descriptions of test scenarios, transforming test scenarios, and converting test scripts. We evaluated WebMAC on four web systems. Compared with the SOTA method, WebMAC improves the execution success rate of generated test scripts by 30%-60%, increases testing efficiency by 29%, and reduces token consumption by 47.6%. Furthermore, WebMAC can effectively detect more errors in web systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes WebMAC, a multi-agent collaborative framework for scenario testing of web systems consisting of three modules: one for interactively completing natural-language scenario descriptions, one for transforming scenarios into adequate instantiated tests via equivalence class partitioning, and one for converting them into executable scripts. Evaluated on four web systems, it claims to outperform an unnamed SOTA baseline by raising generated script execution success rates 30-60%, boosting testing efficiency 29%, cutting token consumption 47.6%, and detecting more errors.
Significance. If the empirical gains and the reliability of the agent-driven clarification and partitioning steps are substantiated, the work could advance LLM-based automated testing in software engineering by systematically addressing description incompleteness and incorporating test-adequacy criteria, potentially yielding more effective and resource-efficient error detection for web applications.
major comments (3)
- [Abstract and Evaluation] Abstract and Evaluation section: the headline claims of 30-60% success-rate improvement, 29% efficiency gain, and 47.6% token reduction are presented without any description of the SOTA baseline, number of experimental runs, statistical tests, error bars, or raw data tables, rendering the quantitative results unverifiable and load-bearing for the central performance claims.
- [§3.2] §3.2 (Scenario Transformation Module): the description of agent-driven equivalence class partitioning provides no concrete implementation details, coverage criteria, validation metrics, or human-review results confirming that the generated partitions are complete, unbiased, and free of invalid cases; this assumption is load-bearing for the claim that WebMAC detects more errors than SOTA.
- [Evaluation] Evaluation section: generalization from four web systems is asserted without justification of system selection criteria, diversity of scenarios, or analysis of cases where clarification failed or partitioning missed edge cases, undermining the broader applicability claim.
minor comments (2)
- [Abstract] The paper should explicitly name and cite the SOTA method used for comparison to allow reproducibility and fair assessment of novelty.
- [§3] Figure captions and algorithm pseudocode could be expanded to clarify the exact interaction protocol among the three agent modules.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review of our manuscript. We address each major comment point by point below, providing clarifications and committing to specific revisions to enhance verifiability, methodological transparency, and the discussion of generalizability.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: the headline claims of 30-60% success-rate improvement, 29% efficiency gain, and 47.6% token reduction are presented without any description of the SOTA baseline, number of experimental runs, statistical tests, error bars, or raw data tables, rendering the quantitative results unverifiable and load-bearing for the central performance claims.
Authors: We agree that additional experimental details are required to make the quantitative claims verifiable. In the revised manuscript, we will expand the Evaluation section (and update the abstract if space permits) to name and briefly describe the SOTA baseline, report the exact number of experimental runs performed, include statistical tests (such as paired t-tests or Wilcoxon signed-rank tests with p-values), add error bars to all performance figures, and provide a table of raw or aggregated data for success rates, efficiency, token usage, and error detection counts. These additions will directly address the verifiability concern while preserving the reported improvements. revision: yes
-
Referee: [§3.2] §3.2 (Scenario Transformation Module): the description of agent-driven equivalence class partitioning provides no concrete implementation details, coverage criteria, validation metrics, or human-review results confirming that the generated partitions are complete, unbiased, and free of invalid cases; this assumption is load-bearing for the claim that WebMAC detects more errors than SOTA.
Authors: We acknowledge that §3.2 would benefit from greater specificity. We will revise this section to include concrete implementation details: the exact prompts and decision logic used by the partitioning agents, the equivalence class criteria (e.g., partitioning along input domains, boundary values, and valid/invalid combinations), the coverage criteria applied (such as ensuring each class is instantiated at least once), and internal validation metrics (e.g., partition count, validity rate, and overlap checks performed by the agents). Regarding human review, we did not conduct a large-scale study; we will therefore add a discussion of automated safeguards against bias and invalid cases, plus results from a small pilot human validation if feasible, or explicitly note the reliance on agent-driven checks. This will strengthen the link to improved error detection. revision: partial
-
Referee: [Evaluation] Evaluation section: generalization from four web systems is asserted without justification of system selection criteria, diversity of scenarios, or analysis of cases where clarification failed or partitioning missed edge cases, undermining the broader applicability claim.
Authors: We will strengthen the Evaluation section by adding explicit justification for selecting the four web systems, including criteria such as diversity of application domains, underlying technologies, and user-interaction complexity. We will also characterize the diversity of the test scenarios employed and include a dedicated analysis of challenging cases, such as scenarios requiring multiple clarification rounds or partitions that initially missed certain edge cases, together with how the framework mitigated these issues. These additions will provide a more balanced view of applicability without overstating the results. revision: yes
Circularity Check
No circularity: empirical claims rest on external evaluation, not self-referential derivations
full rationale
The paper contains no equations, derivations, or first-principles predictions. Its central claims are empirical performance improvements (success rate, efficiency, token use, error detection) measured on four web systems against an external SOTA baseline. The framework description (three-agent modules for clarification and equivalence partitioning) is presented as a design choice whose adequacy is asserted via experimental results rather than reduced to fitted parameters or self-citations. No load-bearing step reduces to its own inputs by construction; the work is self-contained as an applied engineering contribution.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can reliably complete and clarify natural language test scenario descriptions via multi-agent interaction without introducing errors or biases
- domain assumption Equivalence class partitioning can be automatically applied to instantiated test scenarios to ensure test adequacy
invented entities (1)
-
Three specialized multi-agent modules (description completion, scenario transformation, script conversion)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Amalfitano,D.,Fasolino,A.R.,Tramontana,P.,2011.Aguicrawling-based technique for android mobile application testing, in: 2011 IEEE fourth international conference on software testing, verification and validation workshops, IEEE. pp. 252–261. Arora, C., Herda, T., Homm, V.,
2011
-
[2]
Generating test scenarios from nl requirements using retrieval-augmented llms: An industrial study, in: 2024 IEEE 32nd International Requirements Engineering Conference (RE), IEEE. pp. 240–251. Artifact-Link, . Artifact of this study. Available online:https://github. com/Wanzy0209/WebMAC. Bergsmann, S., Schmidt, A., Fischer, S., Ramler, R.,
2024
-
[3]
Equivalence class partitioning and boundary valueanalysis-areview,in:20152ndInternationalConferenceonCom- puting for Sustainable Global Development (INDIACom), IEEE. pp. 1557–1562. Bo, X., Zhang, Z., Dai, Q., Feng, X., Wang, L., Li, R., Chen, X., Wen, J.R.,2024. Reflectivemulti-agentcollaborationbasedonlargelanguage models. Advances in Neural Information ...
2024
-
[4]
arXiv preprint arXiv:2207.10397 , year=
Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 . Dalal, S.R., Mallows, C.L.,
-
[5]
Technometrics 40, 234–243
Factor-covering designs for testing software. Technometrics 40, 234–243. Fields,L.,Adams,B.J.,Verhave,T.,1993. Theeffectsofequivalenceclass structureontestperformances. ThePsychologicalRecord43,697–712. Fields,L.,Verhave,T.,1987. Thestructureofequivalenceclasses. Journal of the experimental analysis of behavior 48, 317–332. Guo, T., Chen, X., Wang, Y., Ch...
1993
-
[6]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 . Huang, D., Zhang, J.M., Luck, M., Bu, Q., Qing, Y., Cui, H., 2023a. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010 . Huang, Q., Wan, Z., Xing, Z., Wang, C., Chen, J., Xu, X....
work page internal anchor Pith review arXiv
-
[7]
Test case generation for requirements in natural language-an llm comparison study, in: Proceed- ings of the 18th Innovations in Software Engineering Conference, pp. 1–5. Li, C., Xiong, Y., Li, Z., Yang, W., Pan, M., 2023a. Mobile test script generation from natural language descriptions, in: 2023 IEEE 23rd International Conference on Software Quality, Rel...
-
[8]
Fill in the blank: Context-aware automated text input generation for mobile gui testing, in: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE. pp. 1355–1367. Liu, Z., Chen, C., Wang, J., Chen, M., Wu, B., Che, X., Wang, D., Wang, Q.,2024. Makellmatestingexpert:Bringinghuman-likeinteractionto mobile gui testing via functiona...
2023
-
[9]
Crashscope: A practical tool for automated testing of android applications, in: 2017 IEEE/ACM 39th international conference on software engineering companion (ICSE-C), IEEE. pp. 15–18. Qian, C., Xie, Z., Wang, Y., Liu, W., Zhu, K., Xia, H., Dang, Y., Du, Z., Chen, W., Yang, C., et al.,
2017
-
[10]
arXiv preprint arXiv:2406.07155 , year=
Scaling large language model-based multi-agent collaboration. arXiv preprint arXiv:2406.07155 . Radcliffe, N.J., et al.,
-
[11]
Testgpt-server:Automaticallytestingmicroserviceswithlargelanguage models at bytedance, in: Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pp. 192–203. Wang,J.,Huang,Y.,Chen,C.,Liu,Z.,Wang,S.,Wang,Q.,2024a.Software testingwithlargelanguagemodels:Survey,landscape,andvision. IEEE Transactions on Software Engi...
-
[12]
Beyond self-talk: A communication-centric survey of llm-based multi-agent systems. arXiv preprint arXiv:2502.14321 . Yang, L., Yang, C., Gao, S., Wang, W., Wang, B., Zhu, Q., Chu, X., Zhou, J., Liang, G., Wang, Q., et al.,
-
[13]
Scenario-Guided LLM-based Mobile App GUI Testing
On the evaluation of large language models in unit test generation, in: Proceedings of the 39th IEEE/ACMInternationalConferenceonAutomatedSoftwareEngineer- ing, pp. 1607–1619. Yu,S.,Ling,Y.,Fang,C.,Zhou,Q.,Chen,C.,Zhu,S.,Chen,Z.,2025. Llm- guided scenario-based gui testing. arXiv preprint arXiv:2506.05079 . Yuan, Z., Liu, M., Ding, S., Wang, K., Chen, Y.,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Proceedings of the ACM on Software Engineering 1, 1703–1726
Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering 1, 1703–1726. Zampetti,F.,DiSorbo,A.,Visaggio,C.A.,Canfora,G.,DiPenta,M.,2020. Demystifying the adoption of behavior-driven development in open source projects. Information and Software Technology 123, 106311. Zhu, C., Dastani, M., Wang, S.,
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.