arxiv: 2604.13559 · v1 · submitted 2026-04-15 · 💻 cs.SE

Recognition: unknown

WebMAC: A Multi-Agent Collaborative Framework for Scenario Testing of Web Systems

Gong Chen, Qing Huang, Xiaoyuan Xie, Zhenyu Wan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:37 UTC · model grok-4.3

classification 💻 cs.SE

keywords multi-agent systemsscenario testingweb systemstest script generationequivalence class partitioningsoftware testingtest adequacy

0 comments

The pith

A multi-agent framework clarifies incomplete test scenarios and applies equivalence partitioning to raise web script success rates by 30 to 60 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Web testing begins with natural language scenario descriptions that are often incomplete and lack systematic coverage of possible inputs. The paper presents a framework in which separate agent teams first complete those descriptions through back-and-forth clarification, then partition the scenarios into equivalence classes to produce adequate test cases, and finally convert the cases into executable scripts. Prior single-agent LLM methods stop at the first step and therefore miss errors. If the approach works as described, testers obtain scripts that run more reliably, consume fewer tokens, and surface more defects across different web applications.

Core claim

WebMAC consists of three multi-agent modules that together complete natural language scenario descriptions via interactive clarification, transform those scenarios through equivalence class partitioning to meet adequacy criteria, and generate corresponding test scripts. When evaluated on four web systems, the resulting scripts execute successfully at rates 30 to 60 percent higher than the prior state-of-the-art method, testing efficiency rises by 29 percent, token consumption falls by 47.6 percent, and more errors are detected.

What carries the argument

Three sequential multi-agent modules that perform interactive scenario clarification, equivalence class partitioning for test adequacy, and script generation.

Load-bearing premise

The interactive multi-agent clarification step reliably produces complete and unbiased scenario descriptions, and the subsequent equivalence partitioning step produces test cases that cover all important behaviors without gaps or invalid tests.

What would settle it

Running WebMAC on a web system whose known critical edge cases and failure modes are documented in advance and finding that it detects no more errors than the baseline method while showing no gain in script success rate.

Figures

Figures reproduced from arXiv: 2604.13559 by Gong Chen, Qing Huang, Xiaoyuan Xie, Zhenyu Wan.

**Figure 1.** Figure 1: The technical evolution of scenario testing the automation of scenario testing, they still face two major limitations. First, incomplete natural language descriptions of test scenarios lead LLMs to generate incorrect test scripts. Testers often focus on key test fields while neglecting other necessary ones. For example, as shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Motivation Example. (a) Illustrates how the incomplete natural language description of the test scenario leads to incorrect test script generation by the LLM. The natural language description of the test scenario is used as input, and the LLM converts it into a corresponding test script, which is then used to test the target web system. (b) Demonstrates how the SOTA method overlooks test adequacy criteria … view at source ↗

**Figure 3.** Figure 3: Overview of WebMAC have succeeded. As the execution result did not align with the test oracle, an error was detected. This demonstrates that adhering to test adequacy criteria can detect potential errors. In this paper, we crawl and filter the HTML information from the target web system to analyze the completeness of the scenario description by examining the parameters they specify. When the description of… view at source ↗

**Figure 4.** Figure 4: Clarification Process in WebMAC three form fields: a username field, a password field, and a telephone field. However, the test scenario only mentions the correct username and the correct password, missing the crucial telephone field. As a result, the LLM may generate a test script that leads to registration failure, violating the test oracle. If the correct telephone information were specified in the scen… view at source ↗

**Figure 5.** Figure 5: Transformation Process in WebMAC To effectively transform test scenarios, we design a multi-agent system as the transformation module. This module consists of a retrieval tool, a PICT tool, and two agents: the Equivalence Class Generator and the Test Oracle Generator. These agents leverage the contextual information provided by the clarification module to retrieve equivalence class partitions, generate e… view at source ↗

**Figure 6.** Figure 6: shows a specific testing process. We iteratively feed the instantiated test scenarios output by the transformation module into the testing module. For instance, based on the descriptions contained in a test scenario: “When I add a person with first name ‘John12’ and last name ‘Smith#’ as a new pet owner with address ‘MainStreet’ with city ‘NewYork$’ and telephone ‘609591623”’, the agent Coder writes the t… view at source ↗

**Figure 7.** Figure 7: The execution success rate of test scripts generated by WebMAC and the SOTA method on four different web systems performs significantly better than the SOTA method when handling incomplete descriptions of test scenarios. When handling incomplete descriptions of test scenarios, the SOTA method often generates incorrect test scripts, which leads to test script submission failures. Although the SOTA method c… view at source ↗

read the original abstract

Scenario testing is an important technique for detecting errors in web systems. Testers draft test scenarios and convert them into test scripts for execution. Early methods relied on testers to convert test scenarios into test scripts. Recent LLM-based scenario testing methods can generate test scripts from natural language descriptions of test scenarios. However, these methods are not only limited by the incompleteness of descriptions but also overlook test adequacy criteria, making it difficult to detect potential errors. To address these limitations, this paper proposes WebMAC, a multi-agent collaborative framework for scenario testing of web systems. WebMAC can complete natural language descriptions of test scenarios through interactive clarification and transform adequate instantiated test scenarios via equivalence class partitioning. WebMAC consists of three multi-agent modules, responsible respectively for completing natural language descriptions of test scenarios, transforming test scenarios, and converting test scripts. We evaluated WebMAC on four web systems. Compared with the SOTA method, WebMAC improves the execution success rate of generated test scripts by 30%-60%, increases testing efficiency by 29%, and reduces token consumption by 47.6%. Furthermore, WebMAC can effectively detect more errors in web systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebMAC gives a concrete three-module multi-agent pipeline for fixing incomplete web test scenarios and adding equivalence-class coverage, but the reported gains rest on unshown validation of those steps.

read the letter

The main thing here is a straightforward engineering idea: use one set of agents to interactively clarify natural-language test scenarios, a second set to apply equivalence-class partitioning to generate adequate instances, and a third to turn those into executable scripts. That framing and the split into three specialized modules looks new compared with earlier single-prompt LLM testers for web systems. The paper evaluates on four real web applications and claims clear wins on script success rate, efficiency, and token use, plus finding more bugs than the prior method. Those are the kinds of numbers practitioners care about, and the focus on test adequacy criteria is a useful reminder that raw LLM output often skips coverage rules. The write-up stays practical and cites the relevant recent LLM testing papers without overclaiming. The soft spots are in the evidence. The abstract and stress-test note both flag that we get no details on how the clarification dialogues were run, what coverage criteria the partitioning agents actually used, or any human review of whether the generated cases were complete and unbiased. With only four systems and no reported statistical tests or error bars, the 30-60% success-rate lift and the “more errors detected” claim are hard to assess. If the agents sometimes leave ambiguities or miss edge classes, the whole advantage disappears, and nothing in the provided summary shows that was checked. This is the kind of paper that belongs in a software engineering venue focused on testing tools rather than a top-tier methods journal. A serious referee would push for the missing experimental protocol, coverage metrics, and at least one ablation on the partitioning step, but the core idea is solid enough to review rather than desk-reject. I would bring it to a reading group for the practical angle and the multi-agent split, though I would not cite it yet without seeing the full methods and data.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes WebMAC, a multi-agent collaborative framework for scenario testing of web systems consisting of three modules: one for interactively completing natural-language scenario descriptions, one for transforming scenarios into adequate instantiated tests via equivalence class partitioning, and one for converting them into executable scripts. Evaluated on four web systems, it claims to outperform an unnamed SOTA baseline by raising generated script execution success rates 30-60%, boosting testing efficiency 29%, cutting token consumption 47.6%, and detecting more errors.

Significance. If the empirical gains and the reliability of the agent-driven clarification and partitioning steps are substantiated, the work could advance LLM-based automated testing in software engineering by systematically addressing description incompleteness and incorporating test-adequacy criteria, potentially yielding more effective and resource-efficient error detection for web applications.

major comments (3)

[Abstract and Evaluation] Abstract and Evaluation section: the headline claims of 30-60% success-rate improvement, 29% efficiency gain, and 47.6% token reduction are presented without any description of the SOTA baseline, number of experimental runs, statistical tests, error bars, or raw data tables, rendering the quantitative results unverifiable and load-bearing for the central performance claims.
[§3.2] §3.2 (Scenario Transformation Module): the description of agent-driven equivalence class partitioning provides no concrete implementation details, coverage criteria, validation metrics, or human-review results confirming that the generated partitions are complete, unbiased, and free of invalid cases; this assumption is load-bearing for the claim that WebMAC detects more errors than SOTA.
[Evaluation] Evaluation section: generalization from four web systems is asserted without justification of system selection criteria, diversity of scenarios, or analysis of cases where clarification failed or partitioning missed edge cases, undermining the broader applicability claim.

minor comments (2)

[Abstract] The paper should explicitly name and cite the SOTA method used for comparison to allow reproducibility and fair assessment of novelty.
[§3] Figure captions and algorithm pseudocode could be expanded to clarify the exact interaction protocol among the three agent modules.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We address each major comment point by point below, providing clarifications and committing to specific revisions to enhance verifiability, methodological transparency, and the discussion of generalizability.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the headline claims of 30-60% success-rate improvement, 29% efficiency gain, and 47.6% token reduction are presented without any description of the SOTA baseline, number of experimental runs, statistical tests, error bars, or raw data tables, rendering the quantitative results unverifiable and load-bearing for the central performance claims.

Authors: We agree that additional experimental details are required to make the quantitative claims verifiable. In the revised manuscript, we will expand the Evaluation section (and update the abstract if space permits) to name and briefly describe the SOTA baseline, report the exact number of experimental runs performed, include statistical tests (such as paired t-tests or Wilcoxon signed-rank tests with p-values), add error bars to all performance figures, and provide a table of raw or aggregated data for success rates, efficiency, token usage, and error detection counts. These additions will directly address the verifiability concern while preserving the reported improvements. revision: yes
Referee: [§3.2] §3.2 (Scenario Transformation Module): the description of agent-driven equivalence class partitioning provides no concrete implementation details, coverage criteria, validation metrics, or human-review results confirming that the generated partitions are complete, unbiased, and free of invalid cases; this assumption is load-bearing for the claim that WebMAC detects more errors than SOTA.

Authors: We acknowledge that §3.2 would benefit from greater specificity. We will revise this section to include concrete implementation details: the exact prompts and decision logic used by the partitioning agents, the equivalence class criteria (e.g., partitioning along input domains, boundary values, and valid/invalid combinations), the coverage criteria applied (such as ensuring each class is instantiated at least once), and internal validation metrics (e.g., partition count, validity rate, and overlap checks performed by the agents). Regarding human review, we did not conduct a large-scale study; we will therefore add a discussion of automated safeguards against bias and invalid cases, plus results from a small pilot human validation if feasible, or explicitly note the reliance on agent-driven checks. This will strengthen the link to improved error detection. revision: partial
Referee: [Evaluation] Evaluation section: generalization from four web systems is asserted without justification of system selection criteria, diversity of scenarios, or analysis of cases where clarification failed or partitioning missed edge cases, undermining the broader applicability claim.

Authors: We will strengthen the Evaluation section by adding explicit justification for selecting the four web systems, including criteria such as diversity of application domains, underlying technologies, and user-interaction complexity. We will also characterize the diversity of the test scenarios employed and include a dedicated analysis of challenging cases, such as scenarios requiring multiple clarification rounds or partitions that initially missed certain edge cases, together with how the framework mitigated these issues. These additions will provide a more balanced view of applicability without overstating the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external evaluation, not self-referential derivations

full rationale

The paper contains no equations, derivations, or first-principles predictions. Its central claims are empirical performance improvements (success rate, efficiency, token use, error detection) measured on four web systems against an external SOTA baseline. The framework description (three-agent modules for clarification and equivalence partitioning) is presented as a design choice whose adequacy is asserted via experimental results rather than reduced to fitted parameters or self-citations. No load-bearing step reduces to its own inputs by construction; the work is self-contained as an applied engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on domain assumptions about LLM interactive capabilities and standard software testing techniques rather than new physical or mathematical axioms; the primary addition is the engineered multi-agent workflow.

axioms (2)

domain assumption LLMs can reliably complete and clarify natural language test scenario descriptions via multi-agent interaction without introducing errors or biases
Invoked by the first module to address incompleteness of descriptions.
domain assumption Equivalence class partitioning can be automatically applied to instantiated test scenarios to ensure test adequacy
Core mechanism of the second module for transforming scenarios.

invented entities (1)

Three specialized multi-agent modules (description completion, scenario transformation, script conversion) no independent evidence
purpose: Collaborative handling of scenario testing stages
New software architecture introduced to overcome limitations of prior single-agent or non-partitioning approaches.

pith-pipeline@v0.9.0 · 5503 in / 1411 out tokens · 54770 ms · 2026-05-10T13:37:44.651657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Amalfitano,D.,Fasolino,A.R.,Tramontana,P.,2011.Aguicrawling-based technique for android mobile application testing, in: 2011 IEEE fourth international conference on software testing, verification and validation workshops, IEEE. pp. 252–261. Arora, C., Herda, T., Homm, V.,

2011
[2]

Generating test scenarios from nl requirements using retrieval-augmented llms: An industrial study, in: 2024 IEEE 32nd International Requirements Engineering Conference (RE), IEEE. pp. 240–251. Artifact-Link, . Artifact of this study. Available online:https://github. com/Wanzy0209/WebMAC. Bergsmann, S., Schmidt, A., Fischer, S., Ramler, R.,

2024
[3]

Equivalence class partitioning and boundary valueanalysis-areview,in:20152ndInternationalConferenceonCom- puting for Sustainable Global Development (INDIACom), IEEE. pp. 1557–1562. Bo, X., Zhang, Z., Dai, Q., Feng, X., Wang, L., Li, R., Chen, X., Wen, J.R.,2024. Reflectivemulti-agentcollaborationbasedonlargelanguage models. Advances in Neural Information ...

2024
[4]

arXiv preprint arXiv:2207.10397 , year=

Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 . Dalal, S.R., Mallows, C.L.,

work page arXiv
[5]

Technometrics 40, 234–243

Factor-covering designs for testing software. Technometrics 40, 234–243. Fields,L.,Adams,B.J.,Verhave,T.,1993. Theeffectsofequivalenceclass structureontestperformances. ThePsychologicalRecord43,697–712. Fields,L.,Verhave,T.,1987. Thestructureofequivalenceclasses. Journal of the experimental analysis of behavior 48, 317–332. Guo, T., Chen, X., Wang, Y., Ch...

1993
[6]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 . Huang, D., Zhang, J.M., Luck, M., Bu, Q., Qing, Y., Cui, H., 2023a. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010 . Huang, Q., Wan, Z., Xing, Z., Wang, C., Chen, J., Xu, X....

work page internal anchor Pith review arXiv
[7]

Test case generation for requirements in natural language-an llm comparison study, in: Proceed- ings of the 18th Innovations in Software Engineering Conference, pp. 1–5. Li, C., Xiong, Y., Li, Z., Yang, W., Pan, M., 2023a. Mobile test script generation from natural language descriptions, in: 2023 IEEE 23rd International Conference on Software Quality, Rel...

work page arXiv 2023
[8]

Fill in the blank: Context-aware automated text input generation for mobile gui testing, in: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE. pp. 1355–1367. Liu, Z., Chen, C., Wang, J., Chen, M., Wu, B., Che, X., Wang, D., Wang, Q.,2024. Makellmatestingexpert:Bringinghuman-likeinteractionto mobile gui testing via functiona...

2023
[9]

Crashscope: A practical tool for automated testing of android applications, in: 2017 IEEE/ACM 39th international conference on software engineering companion (ICSE-C), IEEE. pp. 15–18. Qian, C., Xie, Z., Wang, Y., Liu, W., Zhu, K., Xia, H., Dang, Y., Du, Z., Chen, W., Yang, C., et al.,

2017
[10]

arXiv preprint arXiv:2406.07155 , year=

Scaling large language model-based multi-agent collaboration. arXiv preprint arXiv:2406.07155 . Radcliffe, N.J., et al.,

work page arXiv
[11]

Testgpt-server:Automaticallytestingmicroserviceswithlargelanguage models at bytedance, in: Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pp. 192–203. Wang,J.,Huang,Y.,Chen,C.,Liu,Z.,Wang,S.,Wang,Q.,2024a.Software testingwithlargelanguagemodels:Survey,landscape,andvision. IEEE Transactions on Software Engi...

work page arXiv
[12]

findings-emnlp.479/

Beyond self-talk: A communication-centric survey of llm-based multi-agent systems. arXiv preprint arXiv:2502.14321 . Yang, L., Yang, C., Gao, S., Wang, W., Wang, B., Zhu, Q., Chu, X., Zhou, J., Liang, G., Wang, Q., et al.,

work page arXiv
[13]

Scenario-Guided LLM-based Mobile App GUI Testing

On the evaluation of large language models in unit test generation, in: Proceedings of the 39th IEEE/ACMInternationalConferenceonAutomatedSoftwareEngineer- ing, pp. 1607–1619. Yu,S.,Ling,Y.,Fang,C.,Zhou,Q.,Chen,C.,Zhu,S.,Chen,Z.,2025. Llm- guided scenario-based gui testing. arXiv preprint arXiv:2506.05079 . Yuan, Z., Liu, M., Ding, S., Wang, K., Chen, Y.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Proceedings of the ACM on Software Engineering 1, 1703–1726

Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering 1, 1703–1726. Zampetti,F.,DiSorbo,A.,Visaggio,C.A.,Canfora,G.,DiPenta,M.,2020. Demystifying the adoption of behavior-driven development in open source projects. Information and Software Technology 123, 106311. Zhu, C., Dastani, M., Wang, S.,

2020