pith. sign in

arxiv: 2606.24815 · v1 · pith:WB4IBDRTnew · submitted 2026-06-23 · 💻 cs.SE · cs.RO

MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models

Pith reviewed 2026-06-25 23:05 UTC · model grok-4.3

classification 💻 cs.SE cs.RO
keywords test oracle generationmulti-agent frameworkvision-language-action modelsrobotic testingfine-grained oraclesVLA modelsautomated testingsimulator-grounded oracles
0
0 comments X

The pith

MANGO generates executable fine-grained oracles for vision-language-action models that detect similar failures as symbolic oracles while adding accurate localization and richer diagnostics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MANGO, a multi-agent framework that automatically generates test oracles for vision-language-action models used in robotic control. These models integrate perception, language understanding, and action generation. Traditional approaches rely on manually built symbolic oracles that assess only final environment states, which are costly to create and offer limited insight into intermediate steps. MANGO creates a library of atomic tasks, defines simulator-grounded oracles for them, and assembles fine-grained executable oracles for complex tasks by breaking instructions into ordered sequences. Evaluation on the LIBERO_10 and RoboCasa benchmarks shows the generated oracles match symbolic ones in failure detection counts, localize faults accurately, and supply more diagnostic details through agent-based refinement.

Core claim

MANGO is a multi-agent framework that automatically generates fine-grained oracles from natural-language descriptions of robotic tasks. It first generates a reusable library of atomic tasks, then simulator-grounded oracle definitions for each atomic task, and finally produces executable fine-grained oracles by decomposing complex instructions into ordered sequences of atomic actions and corresponding oracles. The framework uses collaborative Generator, Assessor, and Judge agents that iteratively refine generated artifacts through structured feedback. On the LIBERO_10 and RoboCasa Humanoid Tabletop benchmarks, MANGO produces oracles that detect a similar number of failures as symbolic oracles

What carries the argument

The collaborative Generator, Assessor, and Judge agents that iteratively refine generated artifacts through structured feedback to produce simulator-grounded oracle definitions from natural language task descriptions.

Load-bearing premise

The multi-agent collaboration can reliably produce correct simulator-grounded oracles from natural language without introducing errors that invalidate failure detection and localization.

What would settle it

A concrete execution trace on a VLA model where a MANGO-generated oracle misclassifies task success or misses an intermediate failure that a symbolic oracle correctly identifies on the same trace.

Figures

Figures reproduced from arXiv: 2606.24815 by Aitor Arrieta, Lionel Briand, Pablo Valle, Shaukat Ali.

Figure 1
Figure 1. Figure 1: Overview of the action generation process in a VLA-enabled robot, mapping multimodal inputs to executable action chunks [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Task execution pipeline for the “Put the black bowl in the bottom drawer of the cabinet and close it” task Another limitation is that, in the case of long-horizon unsuccessful tasks, it is not possible to know which sub-task the VLA-enabled robotic system failed to complete; this information could eventually be used for debugging and repair. III. FINE-GRAINED ORACLES Existing evaluation approaches for VLA-… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of MANGO and a fine-grained oracle example for the task “ [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the three different agent profiles: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Input and output examples of each agent in Module I [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Input and output examples of each agent in Module 2 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Input and output examples of each agent in Module 3 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of the Candidate Fine-Grained Oracle before refinement iterations and the same Fine-Grained Oracle after refinement iterations [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the different agents used in MANGO Module I. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overview of the different agents used in MANGO Module II. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Overview of the different agents used in MANGO Module III. [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models are emerging robotic control systems that integrate perception, language understanding, and action generation in a unified architecture. Existing testing approaches for VLA-enabled robots rely on manually constructed symbolic test oracles that determine task success from final environment states. These oracles are costly to construct, require domain expertise, and are often tightly coupled to specific tasks and environments, limiting scalability and reuse. Furthermore, they provide only end-state assessments of task outcomes, offering limited insight into intermediate behavior and fault localization. To address these limitations, we introduce MANGO, a multi-agent framework that automatically generates fine-grained oracles from natural-language descriptions of robotic tasks. MANGO first generates a reusable library of atomic tasks, then generates simulator-grounded oracle definitions for each atomic task, and finally produces executable fine-grained oracles by decomposing complex instructions into ordered sequences of atomic actions and corresponding oracles. The framework uses collaborative Generator, Assessor, and Judge agents that iteratively refine generated artifacts through structured feedback. We evaluate MANGO on the LIBERO_10 and RoboCasa Humanoid Tabletop benchmarks. Results show that MANGO generates executable, fine-grained oracles that detect a similar number of failures as symbolic oracles while accurately localizing them and providing richer diagnostic information. Through ablation studies, we further analyzed component contributions and the effect of initial task set, while preserving oracle quality. Overall, the results show the feasibility and effectiveness of test oracle generation for VLA-enabled robots testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MANGO, a multi-agent LLM framework (Generator, Assessor, Judge) that automatically generates a library of atomic tasks and simulator-grounded, executable fine-grained test oracles from natural-language robotic task descriptions for Vision-Language-Action (VLA) models. It decomposes complex instructions into ordered atomic actions with corresponding oracles via iterative refinement, and evaluates the approach on the LIBERO_10 and RoboCasa Humanoid Tabletop benchmarks, claiming parity in failure detection with symbolic oracles, superior localization, richer diagnostics, and positive ablation results on component contributions and initial task sets.

Significance. If the quantitative claims hold under rigorous validation, the work could meaningfully advance scalable testing for VLA systems by reducing dependence on costly manual symbolic oracles while adding intermediate diagnostic value. The multi-agent iterative refinement and ablation analysis on task-set sensitivity represent concrete strengths in demonstrating feasibility.

major comments (2)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the central claim of 'detect[ing] a similar number of failures as symbolic oracles while accurately localizing them' is load-bearing yet unsupported by any reported counts, precision/recall figures, or statistical comparison in the provided text; without these data the equivalence and localization advantage cannot be assessed.
  2. [Framework / Methodology] Framework description (Generator-Assessor-Judge loop): the iterative refinement from natural language is presented without an explicit correctness mechanism (formal state invariants, exhaustive coverage argument, or human audit protocol) that would ensure generated atomic oracles match simulator semantics; a modest error rate in oracle conditions would directly undermine both the failure-detection parity and the localization claims.
minor comments (1)
  1. [Abstract] The abstract mentions 'ablation studies' and 'preserving oracle quality' but does not specify the metrics used for quality preservation or the exact ablation configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of results and methodology.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the central claim of 'detect[ing] a similar number of failures as symbolic oracles while accurately localizing them' is load-bearing yet unsupported by any reported counts, precision/recall figures, or statistical comparison in the provided text; without these data the equivalence and localization advantage cannot be assessed.

    Authors: We agree that the abstract would benefit from including explicit quantitative support for the central claim. The Evaluation section reports comparative results on the LIBERO_10 and RoboCasa benchmarks, including the number of failures detected by MANGO-generated oracles versus symbolic oracles, along with localization accuracy and diagnostic richness. To make these results immediately accessible, we will revise the abstract to state the specific failure detection counts, localization metrics, and any precision/recall equivalents or statistical comparisons from the evaluation tables and figures. revision: yes

  2. Referee: [Framework / Methodology] Framework description (Generator-Assessor-Judge loop): the iterative refinement from natural language is presented without an explicit correctness mechanism (formal state invariants, exhaustive coverage argument, or human audit protocol) that would ensure generated atomic oracles match simulator semantics; a modest error rate in oracle conditions would directly undermine both the failure-detection parity and the localization claims.

    Authors: The Generator-Assessor-Judge loop provides an empirical correctness mechanism: the Assessor generates and checks oracle conditions against simulator-executable states, while the Judge evaluates alignment with task semantics and issues structured feedback for iterative refinement until convergence. This is simulator-grounded by design, as oracles are directly executable. We acknowledge the absence of formal invariants or exhaustive coverage proofs, which is inherent to LLM-driven generation; instead, we rely on benchmark validation and ablations demonstrating parity in failure detection. We will revise the Framework section to explicitly detail the Judge's validation steps, discuss potential error propagation, and note the empirical safeguards used. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical evaluation stands on independent benchmarks

full rationale

The paper presents an empirical framework (MANGO) whose central claims rest on experimental results from LIBERO_10 and RoboCasa benchmarks rather than any mathematical derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes appear; the multi-agent process is described procedurally and evaluated externally. No load-bearing self-citation chain or renaming of known results is present in the provided text. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no free parameters, axioms, or invented entities are explicitly quantified or derived in the provided text.

pith-pipeline@v0.9.1-grok · 5806 in / 1122 out tokens · 16609 ms · 2026-06-25T23:05:49.016656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 22 linked inside Pith

  1. [1]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, pp. 2165–2183, PMLR, 2023

  2. [2]

    Openvla: An open- source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  3. [3]

    Gr00t n1: An open foundation model for generalist humanoid robots,

    NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Y...

  4. [4]

    π 0: A vision-language-action flow model for general robot control,

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “π 0: A vision-language-action flow model for general robot control,” 2024

  5. [5]

    Eo- 1: Interleaved vision-text-action pretraining for general robot control,

    D. Qu, H. Song, Q. Chen, Z. Chen, X. Gao, X. Ye, Q. Lv, M. Shi, G. Ren, C. Ruan, M. Yao, H. Yang, J. Bao, B. Zhao, and D. Wang, “Eo- 1: Interleaved vision-text-action pretraining for general robot control,” 2025

  6. [6]

    Vlatest: Testing and evaluating vision-language-action models for robotic ma- nipulation,

    Z. Wang, Z. Zhou, J. Song, Y . Huang, Z. Shu, and L. Ma, “Vlatest: Testing and evaluating vision-language-action models for robotic ma- nipulation,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 1615–1638, 2025

  7. [7]

    Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks,

    S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang,et al., “Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11142–11152, 2025

  8. [8]

    Libero: Benchmarking knowledge transfer for lifelong robot learning,

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023

  9. [9]

    Evaluating real-world robot manipulation policies in simulation,

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao, “Evaluating real-world robot manipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024

  10. [10]

    Evaluating uncertainty and quality of visual language action-enabled robots,

    P. Valle, C. Lu, S. Ali, and A. Arrieta, “Evaluating uncertainty and quality of visual language action-enabled robots,”arXiv preprint arXiv:2507.17049, 2025

  11. [11]

    The oracle problem in software testing: A survey,

    E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,”IEEE transactions on software engineering, vol. 41, no. 5, pp. 507–525, 2014

  12. [12]

    Libero-plus: A progressive robustness benchmark for visual-language-action models,

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei,et al., “Libero-plus: A progressive robustness benchmark for visual-language-action models,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 38574–38583, 2026

  13. [13]

    Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,”arXiv preprint arXiv:2406.02523, 2024

  14. [14]

    Gradient-based learning applied to document recognition,

    Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 2002

  15. [15]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”Advances in neural informa- tion processing systems, vol. 25, 2012

  16. [16]

    Very deep convolutional networks for large-scale image recognition,

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

  17. [17]

    Going deeper with convolutions,

    C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015

  18. [18]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  19. [19]

    Gpt-4 technical report,

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  20. [20]

    Gemini: a family of highly capable multimodal models,

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican,et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  21. [21]

    The llama 3 herd of models,

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan,et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  22. [22]

    Mistral 7b,

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. 24

  23. [23]

    Learning fine- grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

  24. [24]

    Nebula: Do we evaluate vision-language-action agents correctly?,

    J. Peng, Y . Zhang, Y . Duan, T. Liang, V . Chaudhary, and Y . Yin, “Nebula: Do we evaluate vision-language-action agents correctly?,” 2025. https: //arxiv.org/abs/2510.16263

  25. [25]

    World models,

    D. Ha and J. Schmidhuber, “World models,”arXiv preprint arXiv:1803.10122, vol. 2, no. 3, p. 440, 2018

  26. [26]

    A step toward world models: A survey on robotic manipulation,

    P.-F. Zhang, Y . Cheng, X. Sun, S. Wang, F. Li, L. Zhu, and H. T. Shen, “A step toward world models: A survey on robotic manipulation,”arXiv preprint arXiv:2511.02097, 2025

  27. [27]

    Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer,

    G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Bewley, J. Bingham, et al., “Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer,”arXiv preprint arXiv:2510.03342, 2025

  28. [28]

    Cosmos world foundation model platform for physical ai,

    N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopad- hyay, Y . Chen, Y . Cui, Y . Ding,et al., “Cosmos world foundation model platform for physical ai,”arXiv preprint arXiv:2501.03575, 2025

  29. [29]

    Understanding world or predicting future? a comprehensive survey of world models,

    J. Ding, Y . Zhang, Y . Shang, Y . Zhang, Z. Zong, J. Feng, Y . Yuan, H. Su, N. Li, N. Sukiennik,et al., “Understanding world or predicting future? a comprehensive survey of world models,”ACM Computing Surveys, vol. 58, no. 3, pp. 1–38, 2025

  30. [30]

    Deepseek-v3 technical report,

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan,et al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  31. [31]

    Openai gpt-5 system card,

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram,et al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

  32. [32]

    Introducing mistral 3 — mistral ai,

    M. AI, “Introducing mistral 3 — mistral ai,” December 2025. https: //mistral.ai/news/mistral-3/

  33. [33]

    Guidelines for empirical studies in software engineering involving large language models,

    S. Baltes, F. Angermeir, C. Arora, M. M. Bar ´on, C. Chen, L. B ¨ohme, F. Calefato, N. Ernst, D. Falessi, B. Fitzgerald,et al., “Guidelines for empirical studies in software engineering involving large language models,”arXiv preprint arXiv:2508.15503, 2025

  34. [34]

    Sentence-bert: Sentence embeddings using siamese bert-networks,

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pp. 3982–3992, 2019

  35. [35]

    Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,

    D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,” inProceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pp. 1–14, 2017

  36. [36]

    Bertscore: Evaluating text generation with bert,

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

  37. [37]

    Rudolph,Convergence properties of evolutionary algorithms

    G. Rudolph,Convergence properties of evolutionary algorithms. Verlag Dr. Kovaˇc, 1997

  38. [38]

    On the analysis of the (1+ 1) evolutionary algorithm,

    S. Droste, T. Jansen, and I. Wegener, “On the analysis of the (1+ 1) evolutionary algorithm,”Theoretical Computer Science, vol. 276, no. 1- 2, pp. 51–81, 2002

  39. [39]

    Search-based software engineering,

    M. Harman and B. F. Jones, “Search-based software engineering,” Information and software Technology, vol. 43, no. 14, pp. 833–839, 2001

  40. [40]

    A. E. Eiben and J. E. Smith,Introduction to evolutionary computing. Springer, 2015

  41. [41]

    Genetic algorithms in search, optimization, and ma- chine learning. addison,

    D. E. Goldberg, “Genetic algorithms in search, optimization, and ma- chine learning. addison,”Reading, 1989

  42. [42]

    J. H. Holland,Adaptation in natural and artificial systems: an intro- ductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992

  43. [43]

    T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to algorithms. MIT press, 2022

  44. [44]

    How genetic algorithms really work: I. mutation and hillclimbing,

    H. Muhlenbein, “How genetic algorithms really work: I. mutation and hillclimbing,” inProc. 2nd Int. Conf. on Parallel Problem Solving from Nature, 1992, Elsevier, 1992

  45. [45]

    Sentence embedding models for similarity detection of software requirements,

    S. Das, N. Deb, A. Cortesi, and N. Chaki, “Sentence embedding models for similarity detection of software requirements,”SN Computer Science, vol. 2, no. 2, p. 69, 2021

  46. [46]

    On termination criteria of evolutionary algorithms,

    B. J. Jain, H. Pohlheim, and J. Wegener, “On termination criteria of evolutionary algorithms,” inProceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation, pp. 768–768, 2001

  47. [47]

    An orthogonal genetic algorithm with quantization for global numerical optimization,

    Y .-W. Leung and Y . Wang, “An orthogonal genetic algorithm with quantization for global numerical optimization,”IEEE Transactions on Evolutionary computation, vol. 5, no. 1, pp. 41–53, 2001

  48. [48]

    An empirical study of the non-determinism of chatgpt in code generation,

    S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “An empirical study of the non-determinism of chatgpt in code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–28, 2025

  49. [49]

    Evaluation and benchmarking of llm agents: A survey,

    M. Mohammadi, Y . Li, J. Lo, and W. Yip, “Evaluation and benchmarking of llm agents: A survey,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pp. 6129– 6139, 2025

  50. [50]

    Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,

    Q. Xu, G. Wang, L. Briand, and K. Liu, “Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,”ACM Transactions on Software Engineering and Methodology, 2026

  51. [51]

    Towards standardized bench- marks of llms in software modeling tasks: a conceptual framework: J. c´amara et al.,

    J. C ´amara, L. Burgue ˜no, and J. Troya, “Towards standardized bench- marks of llms in software modeling tasks: a conceptual framework: J. c´amara et al.,”Software and Systems Modeling, vol. 23, no. 6, pp. 1309– 1318, 2024

  52. [52]

    Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices,

    J. Romano, J. D. Kromrey, J. Coraggio, J. Skowronek, and L. Devine, “Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices,” inannual meeting of the Southern Association for Institutional Research, pp. 1–51, Citeseer, 2006

  53. [53]

    W. G. Cochran,Sampling techniques. john wiley & sons, 1977

  54. [54]

    Spatialvla: Exploring spatial representations for visual-language-action model,

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang,et al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

  55. [55]

    Thinkact: Vision-language-action reasoning via reinforced visual latent planning,

    C.-P. Huang, Y .-H. Wu, M.-H. Chen, F. Wang, and F.-E. Yang, “Thinkact: Vision-language-action reasoning via reinforced visual latent planning,”Advances in Neural Information Processing Systems, vol. 38, pp. 82782–82802, 2026

  56. [56]

    Automated test oracles: A survey,

    M. Pezze and C. Zhang, “Automated test oracles: A survey,” inAdvances in computers, vol. 95, pp. 1–48, Elsevier, 2014

  57. [57]

    Automatic generation of oracles for exceptional behaviors,

    A. Goffi, A. Gorla, M. D. Ernst, and M. Pezz `e, “Automatic generation of oracles for exceptional behaviors,” inProceedings of the 25th in- ternational symposium on software testing and analysis, pp. 213–224, 2016

  58. [58]

    Evolutionary improvement of assertion oracles,

    V . Terragni, G. Jahangirova, P. Tonella, and M. Pezz `e, “Evolutionary improvement of assertion oracles,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1178–1189, 2020

  59. [59]

    Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,

    C. Menghi, S. Nejati, K. Gaaloul, and L. C. Briand, “Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,” inProceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp. 27–38, 2019

  60. [60]

    Defining and generating multi-level and uncertainty-wise test oracles for cyber-physical systems: P. valle et al.,

    P. Valle, A. Arrieta, L. Han, S. Ali, and T. Yue, “Defining and generating multi-level and uncertainty-wise test oracles for cyber-physical systems: P. valle et al.,”Software and Systems Modeling, vol. 24, no. 3, pp. 679– 704, 2025

  61. [61]

    Toga: A neural method for test oracle generation,

    E. Dinella, G. Ryan, T. Mytkowicz, and S. K. Lahiri, “Toga: A neural method for test oracle generation,” inProceedings of the 44th Interna- tional Conference on Software Engineering, pp. 2130–2141, 2022

  62. [62]

    Using large language models to generate junit tests: An empirical study,

    M. L. Siddiq, J. C. Da Silva Santos, R. H. Tanvir, N. Ulfat, F. Al Rifat, and V . Carvalho Lopes, “Using large language models to generate junit tests: An empirical study,” inProceedings of the 28th international con- ference on evaluation and assessment in software engineering, pp. 313– 322, 2024

  63. [63]

    Togll: Correct and strong test oracle generation with llms,

    S. B. Hossain and M. B. Dwyer, “Togll: Correct and strong test oracle generation with llms,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pp. 1475–1487, IEEE, 2025

  64. [64]

    Improving deep assertion generation via fine-tuning retrieval-augmented pre-trained language models,

    Q. Zhang, C. Fang, Y . Zheng, Y . Zhang, Y . Zhao, R. Huang, J. Zhou, Y . Yang, T. Zheng, and Z. Chen, “Improving deep assertion generation via fine-tuning retrieval-augmented pre-trained language models,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 7, pp. 1–23, 2025

  65. [65]

    Chatassert: Llm-based test oracle generation with external tools assistance,

    I. Hayet, A. Scott, and M. d’Amorim, “Chatassert: Llm-based test oracle generation with external tools assistance,”IEEE Transactions on Software Engineering, vol. 51, no. 1, pp. 305–319, 2024

  66. [66]

    Augmentest: Enhancing tests with llm-driven oracles,

    S. M. Khandaker, F. Kifetew, D. Prandi, and A. Susi, “Augmentest: Enhancing tests with llm-driven oracles,” in2025 IEEE Conference on Software Testing, Verification and Validation (ICST), pp. 279–289, IEEE, 2025

  67. [67]

    Nexus: Execution-grounded multi-agent test oracle synthesis,

    D. Huang, M. Du, J. M. Zhang, Z. Lin, M. Luo, Q. Zhang, and S.- K. Ng, “Nexus: Execution-grounded multi-agent test oracle synthesis,” arXiv preprint arXiv:2510.26423, 2025

  68. [68]

    Mastor: A multi-agent approach to semantic test oracle generation for restful apis,

    S. Deng, R. Huang, Z. Yang, M. Zhang, X. Xie, and R. Wang, “Mastor: A multi-agent approach to semantic test oracle generation for restful apis,”arXiv preprint arXiv:2606.10465, 2026. 25

  69. [69]

    Data-driven grasp synthesis—a survey,

    J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp synthesis—a survey,”IEEE Transactions on robotics, vol. 30, no. 2, pp. 289–309, 2013

  70. [70]

    Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,

    J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,”arXiv preprint arXiv:1703.09312, 2017

  71. [71]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009

  72. [72]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision, pp. 740–755, Springer, 2014

  73. [73]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002

  74. [74]

    Rouge: A package for automatic evaluation of summaries,

    C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, pp. 74–81, 2004

  75. [75]

    Exploring the limits of vision-language-action manipulations in cross-task generalization,

    J. Zhou, K. Ye, J. Liu, T. Ma, Z. Wang, R. Qiu, K.-Y . Lin, Z. Zhao, and J. Liang, “Exploring the limits of vision-language-action manipulations in cross-task generalization,”arXiv preprint arXiv:2505.15660, 2025

  76. [76]

    From intention to execution: Probing the generalization boundaries of vision-language-action mod- els,

    I. Fang, J. Zhang, S. Tong, and C. Feng, “From intention to execution: Probing the generalization boundaries of vision-language-action mod- els,”arXiv preprint arXiv:2506.09930, 2025

  77. [77]

    Task reconstruction and extrapolation for\pi 0using text latent,

    Q. Li, “Task reconstruction and extrapolation for\pi 0using text latent,”arXiv preprint arXiv:2505.03500, 2025

  78. [78]

    Rlbench: The robot learning benchmark & learning environment,

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020

  79. [79]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022

  80. [80]

    Ladev: A language-driven testing and evaluation platform for vision- language-action models in robotic manipulation,

    Z. Wang, Z. Zhou, J. Song, Y . Huang, Z. Shu, and L. Ma, “Ladev: A language-driven testing and evaluation platform for vision- language-action models in robotic manipulation,”arXiv preprint arXiv:2410.05191, 2024

Showing first 80 references.