MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models

Aitor Arrieta; Lionel Briand; Pablo Valle; Shaukat Ali

arxiv: 2606.24815 · v1 · pith:WB4IBDRTnew · submitted 2026-06-23 · 💻 cs.SE · cs.RO

MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models

Pablo Valle , Shaukat Ali , Aitor Arrieta , Lionel Briand This is my paper

Pith reviewed 2026-06-25 23:05 UTC · model grok-4.3

classification 💻 cs.SE cs.RO

keywords test oracle generationmulti-agent frameworkvision-language-action modelsrobotic testingfine-grained oraclesVLA modelsautomated testingsimulator-grounded oracles

0 comments

The pith

MANGO generates executable fine-grained oracles for vision-language-action models that detect similar failures as symbolic oracles while adding accurate localization and richer diagnostics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MANGO, a multi-agent framework that automatically generates test oracles for vision-language-action models used in robotic control. These models integrate perception, language understanding, and action generation. Traditional approaches rely on manually built symbolic oracles that assess only final environment states, which are costly to create and offer limited insight into intermediate steps. MANGO creates a library of atomic tasks, defines simulator-grounded oracles for them, and assembles fine-grained executable oracles for complex tasks by breaking instructions into ordered sequences. Evaluation on the LIBERO_10 and RoboCasa benchmarks shows the generated oracles match symbolic ones in failure detection counts, localize faults accurately, and supply more diagnostic details through agent-based refinement.

Core claim

MANGO is a multi-agent framework that automatically generates fine-grained oracles from natural-language descriptions of robotic tasks. It first generates a reusable library of atomic tasks, then simulator-grounded oracle definitions for each atomic task, and finally produces executable fine-grained oracles by decomposing complex instructions into ordered sequences of atomic actions and corresponding oracles. The framework uses collaborative Generator, Assessor, and Judge agents that iteratively refine generated artifacts through structured feedback. On the LIBERO_10 and RoboCasa Humanoid Tabletop benchmarks, MANGO produces oracles that detect a similar number of failures as symbolic oracles

What carries the argument

The collaborative Generator, Assessor, and Judge agents that iteratively refine generated artifacts through structured feedback to produce simulator-grounded oracle definitions from natural language task descriptions.

Load-bearing premise

The multi-agent collaboration can reliably produce correct simulator-grounded oracles from natural language without introducing errors that invalidate failure detection and localization.

What would settle it

A concrete execution trace on a VLA model where a MANGO-generated oracle misclassifies task success or misses an intermediate failure that a symbolic oracle correctly identifies on the same trace.

Figures

Figures reproduced from arXiv: 2606.24815 by Aitor Arrieta, Lionel Briand, Pablo Valle, Shaukat Ali.

**Figure 1.** Figure 1: Overview of the action generation process in a VLA-enabled robot, mapping multimodal inputs to executable action chunks [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Task execution pipeline for the “Put the black bowl in the bottom drawer of the cabinet and close it” task Another limitation is that, in the case of long-horizon unsuccessful tasks, it is not possible to know which sub-task the VLA-enabled robotic system failed to complete; this information could eventually be used for debugging and repair. III. FINE-GRAINED ORACLES Existing evaluation approaches for VLA-… view at source ↗

**Figure 3.** Figure 3: Overview of MANGO and a fine-grained oracle example for the task “ [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the three different agent profiles: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Input and output examples of each agent in Module I [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Input and output examples of each agent in Module 2 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Input and output examples of each agent in Module 3 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of the Candidate Fine-Grained Oracle before refinement iterations and the same Fine-Grained Oracle after refinement iterations [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Overview of the different agents used in MANGO Module I. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Overview of the different agents used in MANGO Module II. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Overview of the different agents used in MANGO Module III. [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models are emerging robotic control systems that integrate perception, language understanding, and action generation in a unified architecture. Existing testing approaches for VLA-enabled robots rely on manually constructed symbolic test oracles that determine task success from final environment states. These oracles are costly to construct, require domain expertise, and are often tightly coupled to specific tasks and environments, limiting scalability and reuse. Furthermore, they provide only end-state assessments of task outcomes, offering limited insight into intermediate behavior and fault localization. To address these limitations, we introduce MANGO, a multi-agent framework that automatically generates fine-grained oracles from natural-language descriptions of robotic tasks. MANGO first generates a reusable library of atomic tasks, then generates simulator-grounded oracle definitions for each atomic task, and finally produces executable fine-grained oracles by decomposing complex instructions into ordered sequences of atomic actions and corresponding oracles. The framework uses collaborative Generator, Assessor, and Judge agents that iteratively refine generated artifacts through structured feedback. We evaluate MANGO on the LIBERO_10 and RoboCasa Humanoid Tabletop benchmarks. Results show that MANGO generates executable, fine-grained oracles that detect a similar number of failures as symbolic oracles while accurately localizing them and providing richer diagnostic information. Through ablation studies, we further analyzed component contributions and the effect of initial task set, while preserving oracle quality. Overall, the results show the feasibility and effectiveness of test oracle generation for VLA-enabled robots testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MANGO gives a concrete multi-agent pipeline for turning task descriptions into fine-grained oracles, but the evaluation provides no evidence that those oracles are actually correct.

read the letter

The main takeaway is that the authors built a Generator-Assessor-Judge loop to decompose natural-language robotic tasks into atomic actions and produce executable oracles grounded in the simulator. This directly tackles the cost of hand-written symbolic oracles for VLA systems.

What the paper does well is lay out a reusable library of atomic tasks and show how iterative agent feedback can refine the outputs. The evaluation on LIBERO_10 and RoboCasa is a reasonable starting point, and the claim that the generated oracles match symbolic ones on failure count while adding localization is at least a clear target.

The soft spots sit in the validation. The abstract and description give no numbers, no breakdown of how many oracles were checked against the simulator, and no discussion of cases where the generated conditions diverged from actual task semantics. If a non-trivial fraction of oracles contain mis-specified pass/fail logic, then the reported parity and localization advantage rest on shaky ground. The stress-test note about missing safeguards is on target here.

This is for people working on testing embodied AI and LLM agents for software engineering tasks. A reader who needs a worked example of multi-agent oracle generation will find the framework useful even if the results section needs tightening.

I would send it to peer review, with the expectation that the authors add concrete checks on oracle correctness.

Referee Report

2 major / 1 minor

Summary. The paper introduces MANGO, a multi-agent LLM framework (Generator, Assessor, Judge) that automatically generates a library of atomic tasks and simulator-grounded, executable fine-grained test oracles from natural-language robotic task descriptions for Vision-Language-Action (VLA) models. It decomposes complex instructions into ordered atomic actions with corresponding oracles via iterative refinement, and evaluates the approach on the LIBERO_10 and RoboCasa Humanoid Tabletop benchmarks, claiming parity in failure detection with symbolic oracles, superior localization, richer diagnostics, and positive ablation results on component contributions and initial task sets.

Significance. If the quantitative claims hold under rigorous validation, the work could meaningfully advance scalable testing for VLA systems by reducing dependence on costly manual symbolic oracles while adding intermediate diagnostic value. The multi-agent iterative refinement and ablation analysis on task-set sensitivity represent concrete strengths in demonstrating feasibility.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the central claim of 'detect[ing] a similar number of failures as symbolic oracles while accurately localizing them' is load-bearing yet unsupported by any reported counts, precision/recall figures, or statistical comparison in the provided text; without these data the equivalence and localization advantage cannot be assessed.
[Framework / Methodology] Framework description (Generator-Assessor-Judge loop): the iterative refinement from natural language is presented without an explicit correctness mechanism (formal state invariants, exhaustive coverage argument, or human audit protocol) that would ensure generated atomic oracles match simulator semantics; a modest error rate in oracle conditions would directly undermine both the failure-detection parity and the localization claims.

minor comments (1)

[Abstract] The abstract mentions 'ablation studies' and 'preserving oracle quality' but does not specify the metrics used for quality preservation or the exact ablation configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of results and methodology.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the central claim of 'detect[ing] a similar number of failures as symbolic oracles while accurately localizing them' is load-bearing yet unsupported by any reported counts, precision/recall figures, or statistical comparison in the provided text; without these data the equivalence and localization advantage cannot be assessed.

Authors: We agree that the abstract would benefit from including explicit quantitative support for the central claim. The Evaluation section reports comparative results on the LIBERO_10 and RoboCasa benchmarks, including the number of failures detected by MANGO-generated oracles versus symbolic oracles, along with localization accuracy and diagnostic richness. To make these results immediately accessible, we will revise the abstract to state the specific failure detection counts, localization metrics, and any precision/recall equivalents or statistical comparisons from the evaluation tables and figures. revision: yes
Referee: [Framework / Methodology] Framework description (Generator-Assessor-Judge loop): the iterative refinement from natural language is presented without an explicit correctness mechanism (formal state invariants, exhaustive coverage argument, or human audit protocol) that would ensure generated atomic oracles match simulator semantics; a modest error rate in oracle conditions would directly undermine both the failure-detection parity and the localization claims.

Authors: The Generator-Assessor-Judge loop provides an empirical correctness mechanism: the Assessor generates and checks oracle conditions against simulator-executable states, while the Judge evaluates alignment with task semantics and issues structured feedback for iterative refinement until convergence. This is simulator-grounded by design, as oracles are directly executable. We acknowledge the absence of formal invariants or exhaustive coverage proofs, which is inherent to LLM-driven generation; instead, we rely on benchmark validation and ablations demonstrating parity in failure detection. We will revise the Framework section to explicitly detail the Judge's validation steps, discuss potential error propagation, and note the empirical safeguards used. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical evaluation stands on independent benchmarks

full rationale

The paper presents an empirical framework (MANGO) whose central claims rest on experimental results from LIBERO_10 and RoboCasa benchmarks rather than any mathematical derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes appear; the multi-agent process is described procedurally and evaluated externally. No load-bearing self-citation chain or renaming of known results is present in the provided text. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no free parameters, axioms, or invented entities are explicitly quantified or derived in the provided text.

pith-pipeline@v0.9.1-grok · 5806 in / 1122 out tokens · 16609 ms · 2026-06-25T23:05:49.016656+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 22 linked inside Pith

[1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, pp. 2165–2183, PMLR, 2023

2023
[2]

Openvla: An open- source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[3]

Gr00t n1: An open foundation model for generalist humanoid robots,

NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Y...

2025
[4]

π 0: A vision-language-action flow model for general robot control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “π 0: A vision-language-action flow model for general robot control,” 2024

2024
[5]

Eo- 1: Interleaved vision-text-action pretraining for general robot control,

D. Qu, H. Song, Q. Chen, Z. Chen, X. Gao, X. Ye, Q. Lv, M. Shi, G. Ren, C. Ruan, M. Yao, H. Yang, J. Bao, B. Zhao, and D. Wang, “Eo- 1: Interleaved vision-text-action pretraining for general robot control,” 2025

2025
[6]

Vlatest: Testing and evaluating vision-language-action models for robotic ma- nipulation,

Z. Wang, Z. Zhou, J. Song, Y . Huang, Z. Shu, and L. Ma, “Vlatest: Testing and evaluating vision-language-action models for robotic ma- nipulation,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 1615–1638, 2025

2025
[7]

Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks,

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang,et al., “Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11142–11152, 2025

2025
[8]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023

Pith/arXiv arXiv 2023
[9]

Evaluating real-world robot manipulation policies in simulation,

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao, “Evaluating real-world robot manipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024

Pith/arXiv arXiv 2024
[10]

Evaluating uncertainty and quality of visual language action-enabled robots,

P. Valle, C. Lu, S. Ali, and A. Arrieta, “Evaluating uncertainty and quality of visual language action-enabled robots,”arXiv preprint arXiv:2507.17049, 2025

arXiv 2025
[11]

The oracle problem in software testing: A survey,

E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,”IEEE transactions on software engineering, vol. 41, no. 5, pp. 507–525, 2014

2014
[12]

Libero-plus: A progressive robustness benchmark for visual-language-action models,

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei,et al., “Libero-plus: A progressive robustness benchmark for visual-language-action models,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 38574–38583, 2026

2026
[13]

Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,”arXiv preprint arXiv:2406.02523, 2024

Pith/arXiv arXiv 2024
[14]

Gradient-based learning applied to document recognition,

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 2002

2002
[15]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”Advances in neural informa- tion processing systems, vol. 25, 2012

2012
[16]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

Pith/arXiv arXiv 2014
[17]

Going deeper with convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015

2015
[18]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[19]

Gpt-4 technical report,

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[20]

Gemini: a family of highly capable multimodal models,

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican,et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023
[21]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan,et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[22]

Mistral 7b,

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. 24

2023
[23]

Learning fine- grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023
[24]

Nebula: Do we evaluate vision-language-action agents correctly?,

J. Peng, Y . Zhang, Y . Duan, T. Liang, V . Chaudhary, and Y . Yin, “Nebula: Do we evaluate vision-language-action agents correctly?,” 2025. https: //arxiv.org/abs/2510.16263

arXiv 2025
[25]

World models,

D. Ha and J. Schmidhuber, “World models,”arXiv preprint arXiv:1803.10122, vol. 2, no. 3, p. 440, 2018

Pith/arXiv arXiv 2018
[26]

A step toward world models: A survey on robotic manipulation,

P.-F. Zhang, Y . Cheng, X. Sun, S. Wang, F. Li, L. Zhu, and H. T. Shen, “A step toward world models: A survey on robotic manipulation,”arXiv preprint arXiv:2511.02097, 2025

arXiv 2025
[27]

Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer,

G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Bewley, J. Bingham, et al., “Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer,”arXiv preprint arXiv:2510.03342, 2025

Pith/arXiv arXiv 2025
[28]

Cosmos world foundation model platform for physical ai,

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopad- hyay, Y . Chen, Y . Cui, Y . Ding,et al., “Cosmos world foundation model platform for physical ai,”arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[29]

Understanding world or predicting future? a comprehensive survey of world models,

J. Ding, Y . Zhang, Y . Shang, Y . Zhang, Z. Zong, J. Feng, Y . Yuan, H. Su, N. Li, N. Sukiennik,et al., “Understanding world or predicting future? a comprehensive survey of world models,”ACM Computing Surveys, vol. 58, no. 3, pp. 1–38, 2025

2025
[30]

Deepseek-v3 technical report,

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan,et al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[31]

Openai gpt-5 system card,

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram,et al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025
[32]

Introducing mistral 3 — mistral ai,

M. AI, “Introducing mistral 3 — mistral ai,” December 2025. https: //mistral.ai/news/mistral-3/

2025
[33]

Guidelines for empirical studies in software engineering involving large language models,

S. Baltes, F. Angermeir, C. Arora, M. M. Bar ´on, C. Chen, L. B ¨ohme, F. Calefato, N. Ernst, D. Falessi, B. Fitzgerald,et al., “Guidelines for empirical studies in software engineering involving large language models,”arXiv preprint arXiv:2508.15503, 2025

Pith/arXiv arXiv 2025
[34]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pp. 3982–3992, 2019

2019
[35]

Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,

D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,” inProceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pp. 1–14, 2017

2017
[36]

Bertscore: Evaluating text generation with bert,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

Pith/arXiv arXiv 1904
[37]

Rudolph,Convergence properties of evolutionary algorithms

G. Rudolph,Convergence properties of evolutionary algorithms. Verlag Dr. Kovaˇc, 1997

1997
[38]

On the analysis of the (1+ 1) evolutionary algorithm,

S. Droste, T. Jansen, and I. Wegener, “On the analysis of the (1+ 1) evolutionary algorithm,”Theoretical Computer Science, vol. 276, no. 1- 2, pp. 51–81, 2002

2002
[39]

Search-based software engineering,

M. Harman and B. F. Jones, “Search-based software engineering,” Information and software Technology, vol. 43, no. 14, pp. 833–839, 2001

2001
[40]

A. E. Eiben and J. E. Smith,Introduction to evolutionary computing. Springer, 2015

2015
[41]

Genetic algorithms in search, optimization, and ma- chine learning. addison,

D. E. Goldberg, “Genetic algorithms in search, optimization, and ma- chine learning. addison,”Reading, 1989

1989
[42]

J. H. Holland,Adaptation in natural and artificial systems: an intro- ductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992

1992
[43]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to algorithms. MIT press, 2022

2022
[44]

How genetic algorithms really work: I. mutation and hillclimbing,

H. Muhlenbein, “How genetic algorithms really work: I. mutation and hillclimbing,” inProc. 2nd Int. Conf. on Parallel Problem Solving from Nature, 1992, Elsevier, 1992

1992
[45]

Sentence embedding models for similarity detection of software requirements,

S. Das, N. Deb, A. Cortesi, and N. Chaki, “Sentence embedding models for similarity detection of software requirements,”SN Computer Science, vol. 2, no. 2, p. 69, 2021

2021
[46]

On termination criteria of evolutionary algorithms,

B. J. Jain, H. Pohlheim, and J. Wegener, “On termination criteria of evolutionary algorithms,” inProceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation, pp. 768–768, 2001

2001
[47]

An orthogonal genetic algorithm with quantization for global numerical optimization,

Y .-W. Leung and Y . Wang, “An orthogonal genetic algorithm with quantization for global numerical optimization,”IEEE Transactions on Evolutionary computation, vol. 5, no. 1, pp. 41–53, 2001

2001
[48]

An empirical study of the non-determinism of chatgpt in code generation,

S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “An empirical study of the non-determinism of chatgpt in code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–28, 2025

2025
[49]

Evaluation and benchmarking of llm agents: A survey,

M. Mohammadi, Y . Li, J. Lo, and W. Yip, “Evaluation and benchmarking of llm agents: A survey,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pp. 6129– 6139, 2025

2025
[50]

Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,

Q. Xu, G. Wang, L. Briand, and K. Liu, “Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,”ACM Transactions on Software Engineering and Methodology, 2026

2026
[51]

Towards standardized bench- marks of llms in software modeling tasks: a conceptual framework: J. c´amara et al.,

J. C ´amara, L. Burgue ˜no, and J. Troya, “Towards standardized bench- marks of llms in software modeling tasks: a conceptual framework: J. c´amara et al.,”Software and Systems Modeling, vol. 23, no. 6, pp. 1309– 1318, 2024

2024
[52]

Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices,

J. Romano, J. D. Kromrey, J. Coraggio, J. Skowronek, and L. Devine, “Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices,” inannual meeting of the Southern Association for Institutional Research, pp. 1–51, Citeseer, 2006

2006
[53]

W. G. Cochran,Sampling techniques. john wiley & sons, 1977

1977
[54]

Spatialvla: Exploring spatial representations for visual-language-action model,

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang,et al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025
[55]

Thinkact: Vision-language-action reasoning via reinforced visual latent planning,

C.-P. Huang, Y .-H. Wu, M.-H. Chen, F. Wang, and F.-E. Yang, “Thinkact: Vision-language-action reasoning via reinforced visual latent planning,”Advances in Neural Information Processing Systems, vol. 38, pp. 82782–82802, 2026

2026
[56]

Automated test oracles: A survey,

M. Pezze and C. Zhang, “Automated test oracles: A survey,” inAdvances in computers, vol. 95, pp. 1–48, Elsevier, 2014

2014
[57]

Automatic generation of oracles for exceptional behaviors,

A. Goffi, A. Gorla, M. D. Ernst, and M. Pezz `e, “Automatic generation of oracles for exceptional behaviors,” inProceedings of the 25th in- ternational symposium on software testing and analysis, pp. 213–224, 2016

2016
[58]

Evolutionary improvement of assertion oracles,

V . Terragni, G. Jahangirova, P. Tonella, and M. Pezz `e, “Evolutionary improvement of assertion oracles,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1178–1189, 2020

2020
[59]

Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,

C. Menghi, S. Nejati, K. Gaaloul, and L. C. Briand, “Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,” inProceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp. 27–38, 2019

2019
[60]

Defining and generating multi-level and uncertainty-wise test oracles for cyber-physical systems: P. valle et al.,

P. Valle, A. Arrieta, L. Han, S. Ali, and T. Yue, “Defining and generating multi-level and uncertainty-wise test oracles for cyber-physical systems: P. valle et al.,”Software and Systems Modeling, vol. 24, no. 3, pp. 679– 704, 2025

2025
[61]

Toga: A neural method for test oracle generation,

E. Dinella, G. Ryan, T. Mytkowicz, and S. K. Lahiri, “Toga: A neural method for test oracle generation,” inProceedings of the 44th Interna- tional Conference on Software Engineering, pp. 2130–2141, 2022

2022
[62]

Using large language models to generate junit tests: An empirical study,

M. L. Siddiq, J. C. Da Silva Santos, R. H. Tanvir, N. Ulfat, F. Al Rifat, and V . Carvalho Lopes, “Using large language models to generate junit tests: An empirical study,” inProceedings of the 28th international con- ference on evaluation and assessment in software engineering, pp. 313– 322, 2024

2024
[63]

Togll: Correct and strong test oracle generation with llms,

S. B. Hossain and M. B. Dwyer, “Togll: Correct and strong test oracle generation with llms,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pp. 1475–1487, IEEE, 2025

2025
[64]

Improving deep assertion generation via fine-tuning retrieval-augmented pre-trained language models,

Q. Zhang, C. Fang, Y . Zheng, Y . Zhang, Y . Zhao, R. Huang, J. Zhou, Y . Yang, T. Zheng, and Z. Chen, “Improving deep assertion generation via fine-tuning retrieval-augmented pre-trained language models,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 7, pp. 1–23, 2025

2025
[65]

Chatassert: Llm-based test oracle generation with external tools assistance,

I. Hayet, A. Scott, and M. d’Amorim, “Chatassert: Llm-based test oracle generation with external tools assistance,”IEEE Transactions on Software Engineering, vol. 51, no. 1, pp. 305–319, 2024

2024
[66]

Augmentest: Enhancing tests with llm-driven oracles,

S. M. Khandaker, F. Kifetew, D. Prandi, and A. Susi, “Augmentest: Enhancing tests with llm-driven oracles,” in2025 IEEE Conference on Software Testing, Verification and Validation (ICST), pp. 279–289, IEEE, 2025

2025
[67]

Nexus: Execution-grounded multi-agent test oracle synthesis,

D. Huang, M. Du, J. M. Zhang, Z. Lin, M. Luo, Q. Zhang, and S.- K. Ng, “Nexus: Execution-grounded multi-agent test oracle synthesis,” arXiv preprint arXiv:2510.26423, 2025

arXiv 2025
[68]

Mastor: A multi-agent approach to semantic test oracle generation for restful apis,

S. Deng, R. Huang, Z. Yang, M. Zhang, X. Xie, and R. Wang, “Mastor: A multi-agent approach to semantic test oracle generation for restful apis,”arXiv preprint arXiv:2606.10465, 2026. 25

Pith/arXiv arXiv 2026
[69]

Data-driven grasp synthesis—a survey,

J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp synthesis—a survey,”IEEE Transactions on robotics, vol. 30, no. 2, pp. 289–309, 2013

2013
[70]

Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,

J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,”arXiv preprint arXiv:1703.09312, 2017

Pith/arXiv arXiv 2017
[71]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009

2009
[72]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision, pp. 740–755, Springer, 2014

2014
[73]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002

2002
[74]

Rouge: A package for automatic evaluation of summaries,

C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, pp. 74–81, 2004

2004
[75]

Exploring the limits of vision-language-action manipulations in cross-task generalization,

J. Zhou, K. Ye, J. Liu, T. Ma, Z. Wang, R. Qiu, K.-Y . Lin, Z. Zhao, and J. Liang, “Exploring the limits of vision-language-action manipulations in cross-task generalization,”arXiv preprint arXiv:2505.15660, 2025

arXiv 2025
[76]

From intention to execution: Probing the generalization boundaries of vision-language-action mod- els,

I. Fang, J. Zhang, S. Tong, and C. Feng, “From intention to execution: Probing the generalization boundaries of vision-language-action mod- els,”arXiv preprint arXiv:2506.09930, 2025

arXiv 2025
[77]

Task reconstruction and extrapolation for\pi 0using text latent,

Q. Li, “Task reconstruction and extrapolation for\pi 0using text latent,”arXiv preprint arXiv:2505.03500, 2025

Pith/arXiv arXiv 2025
[78]

Rlbench: The robot learning benchmark & learning environment,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020

2020
[79]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022

2022
[80]

Ladev: A language-driven testing and evaluation platform for vision- language-action models in robotic manipulation,

Z. Wang, Z. Zhou, J. Song, Y . Huang, Z. Shu, and L. Ma, “Ladev: A language-driven testing and evaluation platform for vision- language-action models in robotic manipulation,”arXiv preprint arXiv:2410.05191, 2024

arXiv 2024

Showing first 80 references.

[1] [1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, pp. 2165–2183, PMLR, 2023

2023

[2] [2]

Openvla: An open- source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[3] [3]

Gr00t n1: An open foundation model for generalist humanoid robots,

NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Y...

2025

[4] [4]

π 0: A vision-language-action flow model for general robot control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “π 0: A vision-language-action flow model for general robot control,” 2024

2024

[5] [5]

Eo- 1: Interleaved vision-text-action pretraining for general robot control,

D. Qu, H. Song, Q. Chen, Z. Chen, X. Gao, X. Ye, Q. Lv, M. Shi, G. Ren, C. Ruan, M. Yao, H. Yang, J. Bao, B. Zhao, and D. Wang, “Eo- 1: Interleaved vision-text-action pretraining for general robot control,” 2025

2025

[6] [6]

Vlatest: Testing and evaluating vision-language-action models for robotic ma- nipulation,

Z. Wang, Z. Zhou, J. Song, Y . Huang, Z. Shu, and L. Ma, “Vlatest: Testing and evaluating vision-language-action models for robotic ma- nipulation,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 1615–1638, 2025

2025

[7] [7]

Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks,

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang,et al., “Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11142–11152, 2025

2025

[8] [8]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023

Pith/arXiv arXiv 2023

[9] [9]

Evaluating real-world robot manipulation policies in simulation,

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao, “Evaluating real-world robot manipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024

Pith/arXiv arXiv 2024

[10] [10]

Evaluating uncertainty and quality of visual language action-enabled robots,

P. Valle, C. Lu, S. Ali, and A. Arrieta, “Evaluating uncertainty and quality of visual language action-enabled robots,”arXiv preprint arXiv:2507.17049, 2025

arXiv 2025

[11] [11]

The oracle problem in software testing: A survey,

E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,”IEEE transactions on software engineering, vol. 41, no. 5, pp. 507–525, 2014

2014

[12] [12]

Libero-plus: A progressive robustness benchmark for visual-language-action models,

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei,et al., “Libero-plus: A progressive robustness benchmark for visual-language-action models,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 38574–38583, 2026

2026

[13] [13]

Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of ev- eryday tasks for generalist robots,”arXiv preprint arXiv:2406.02523, 2024

Pith/arXiv arXiv 2024

[14] [14]

Gradient-based learning applied to document recognition,

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 2002

2002

[15] [15]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”Advances in neural informa- tion processing systems, vol. 25, 2012

2012

[16] [16]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

Pith/arXiv arXiv 2014

[17] [17]

Going deeper with convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015

2015

[18] [18]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017

[19] [19]

Gpt-4 technical report,

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[20] [20]

Gemini: a family of highly capable multimodal models,

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican,et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023

[21] [21]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan,et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[22] [22]

Mistral 7b,

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. 24

2023

[23] [23]

Learning fine- grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023

[24] [24]

Nebula: Do we evaluate vision-language-action agents correctly?,

J. Peng, Y . Zhang, Y . Duan, T. Liang, V . Chaudhary, and Y . Yin, “Nebula: Do we evaluate vision-language-action agents correctly?,” 2025. https: //arxiv.org/abs/2510.16263

arXiv 2025

[25] [25]

World models,

D. Ha and J. Schmidhuber, “World models,”arXiv preprint arXiv:1803.10122, vol. 2, no. 3, p. 440, 2018

Pith/arXiv arXiv 2018

[26] [26]

A step toward world models: A survey on robotic manipulation,

P.-F. Zhang, Y . Cheng, X. Sun, S. Wang, F. Li, L. Zhu, and H. T. Shen, “A step toward world models: A survey on robotic manipulation,”arXiv preprint arXiv:2511.02097, 2025

arXiv 2025

[27] [27]

Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer,

G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Bewley, J. Bingham, et al., “Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer,”arXiv preprint arXiv:2510.03342, 2025

Pith/arXiv arXiv 2025

[28] [28]

Cosmos world foundation model platform for physical ai,

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopad- hyay, Y . Chen, Y . Cui, Y . Ding,et al., “Cosmos world foundation model platform for physical ai,”arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[29] [29]

Understanding world or predicting future? a comprehensive survey of world models,

J. Ding, Y . Zhang, Y . Shang, Y . Zhang, Z. Zong, J. Feng, Y . Yuan, H. Su, N. Li, N. Sukiennik,et al., “Understanding world or predicting future? a comprehensive survey of world models,”ACM Computing Surveys, vol. 58, no. 3, pp. 1–38, 2025

2025

[30] [30]

Deepseek-v3 technical report,

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan,et al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[31] [31]

Openai gpt-5 system card,

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram,et al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025

[32] [32]

Introducing mistral 3 — mistral ai,

M. AI, “Introducing mistral 3 — mistral ai,” December 2025. https: //mistral.ai/news/mistral-3/

2025

[33] [33]

Guidelines for empirical studies in software engineering involving large language models,

S. Baltes, F. Angermeir, C. Arora, M. M. Bar ´on, C. Chen, L. B ¨ohme, F. Calefato, N. Ernst, D. Falessi, B. Fitzgerald,et al., “Guidelines for empirical studies in software engineering involving large language models,”arXiv preprint arXiv:2508.15503, 2025

Pith/arXiv arXiv 2025

[34] [34]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pp. 3982–3992, 2019

2019

[35] [35]

Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,

D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,” inProceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pp. 1–14, 2017

2017

[36] [36]

Bertscore: Evaluating text generation with bert,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

Pith/arXiv arXiv 1904

[37] [37]

Rudolph,Convergence properties of evolutionary algorithms

G. Rudolph,Convergence properties of evolutionary algorithms. Verlag Dr. Kovaˇc, 1997

1997

[38] [38]

On the analysis of the (1+ 1) evolutionary algorithm,

S. Droste, T. Jansen, and I. Wegener, “On the analysis of the (1+ 1) evolutionary algorithm,”Theoretical Computer Science, vol. 276, no. 1- 2, pp. 51–81, 2002

2002

[39] [39]

Search-based software engineering,

M. Harman and B. F. Jones, “Search-based software engineering,” Information and software Technology, vol. 43, no. 14, pp. 833–839, 2001

2001

[40] [40]

A. E. Eiben and J. E. Smith,Introduction to evolutionary computing. Springer, 2015

2015

[41] [41]

Genetic algorithms in search, optimization, and ma- chine learning. addison,

D. E. Goldberg, “Genetic algorithms in search, optimization, and ma- chine learning. addison,”Reading, 1989

1989

[42] [42]

J. H. Holland,Adaptation in natural and artificial systems: an intro- ductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992

1992

[43] [43]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to algorithms. MIT press, 2022

2022

[44] [44]

How genetic algorithms really work: I. mutation and hillclimbing,

H. Muhlenbein, “How genetic algorithms really work: I. mutation and hillclimbing,” inProc. 2nd Int. Conf. on Parallel Problem Solving from Nature, 1992, Elsevier, 1992

1992

[45] [45]

Sentence embedding models for similarity detection of software requirements,

S. Das, N. Deb, A. Cortesi, and N. Chaki, “Sentence embedding models for similarity detection of software requirements,”SN Computer Science, vol. 2, no. 2, p. 69, 2021

2021

[46] [46]

On termination criteria of evolutionary algorithms,

B. J. Jain, H. Pohlheim, and J. Wegener, “On termination criteria of evolutionary algorithms,” inProceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation, pp. 768–768, 2001

2001

[47] [47]

An orthogonal genetic algorithm with quantization for global numerical optimization,

Y .-W. Leung and Y . Wang, “An orthogonal genetic algorithm with quantization for global numerical optimization,”IEEE Transactions on Evolutionary computation, vol. 5, no. 1, pp. 41–53, 2001

2001

[48] [48]

An empirical study of the non-determinism of chatgpt in code generation,

S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “An empirical study of the non-determinism of chatgpt in code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–28, 2025

2025

[49] [49]

Evaluation and benchmarking of llm agents: A survey,

M. Mohammadi, Y . Li, J. Lo, and W. Yip, “Evaluation and benchmarking of llm agents: A survey,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pp. 6129– 6139, 2025

2025

[50] [50]

Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,

Q. Xu, G. Wang, L. Briand, and K. Liu, “Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,”ACM Transactions on Software Engineering and Methodology, 2026

2026

[51] [51]

Towards standardized bench- marks of llms in software modeling tasks: a conceptual framework: J. c´amara et al.,

J. C ´amara, L. Burgue ˜no, and J. Troya, “Towards standardized bench- marks of llms in software modeling tasks: a conceptual framework: J. c´amara et al.,”Software and Systems Modeling, vol. 23, no. 6, pp. 1309– 1318, 2024

2024

[52] [52]

Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices,

J. Romano, J. D. Kromrey, J. Coraggio, J. Skowronek, and L. Devine, “Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices,” inannual meeting of the Southern Association for Institutional Research, pp. 1–51, Citeseer, 2006

2006

[53] [53]

W. G. Cochran,Sampling techniques. john wiley & sons, 1977

1977

[54] [54]

Spatialvla: Exploring spatial representations for visual-language-action model,

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang,et al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025

[55] [55]

Thinkact: Vision-language-action reasoning via reinforced visual latent planning,

C.-P. Huang, Y .-H. Wu, M.-H. Chen, F. Wang, and F.-E. Yang, “Thinkact: Vision-language-action reasoning via reinforced visual latent planning,”Advances in Neural Information Processing Systems, vol. 38, pp. 82782–82802, 2026

2026

[56] [56]

Automated test oracles: A survey,

M. Pezze and C. Zhang, “Automated test oracles: A survey,” inAdvances in computers, vol. 95, pp. 1–48, Elsevier, 2014

2014

[57] [57]

Automatic generation of oracles for exceptional behaviors,

A. Goffi, A. Gorla, M. D. Ernst, and M. Pezz `e, “Automatic generation of oracles for exceptional behaviors,” inProceedings of the 25th in- ternational symposium on software testing and analysis, pp. 213–224, 2016

2016

[58] [58]

Evolutionary improvement of assertion oracles,

V . Terragni, G. Jahangirova, P. Tonella, and M. Pezz `e, “Evolutionary improvement of assertion oracles,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1178–1189, 2020

2020

[59] [59]

Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,

C. Menghi, S. Nejati, K. Gaaloul, and L. C. Briand, “Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,” inProceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp. 27–38, 2019

2019

[60] [60]

Defining and generating multi-level and uncertainty-wise test oracles for cyber-physical systems: P. valle et al.,

P. Valle, A. Arrieta, L. Han, S. Ali, and T. Yue, “Defining and generating multi-level and uncertainty-wise test oracles for cyber-physical systems: P. valle et al.,”Software and Systems Modeling, vol. 24, no. 3, pp. 679– 704, 2025

2025

[61] [61]

Toga: A neural method for test oracle generation,

E. Dinella, G. Ryan, T. Mytkowicz, and S. K. Lahiri, “Toga: A neural method for test oracle generation,” inProceedings of the 44th Interna- tional Conference on Software Engineering, pp. 2130–2141, 2022

2022

[62] [62]

Using large language models to generate junit tests: An empirical study,

M. L. Siddiq, J. C. Da Silva Santos, R. H. Tanvir, N. Ulfat, F. Al Rifat, and V . Carvalho Lopes, “Using large language models to generate junit tests: An empirical study,” inProceedings of the 28th international con- ference on evaluation and assessment in software engineering, pp. 313– 322, 2024

2024

[63] [63]

Togll: Correct and strong test oracle generation with llms,

S. B. Hossain and M. B. Dwyer, “Togll: Correct and strong test oracle generation with llms,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pp. 1475–1487, IEEE, 2025

2025

[64] [64]

Improving deep assertion generation via fine-tuning retrieval-augmented pre-trained language models,

Q. Zhang, C. Fang, Y . Zheng, Y . Zhang, Y . Zhao, R. Huang, J. Zhou, Y . Yang, T. Zheng, and Z. Chen, “Improving deep assertion generation via fine-tuning retrieval-augmented pre-trained language models,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 7, pp. 1–23, 2025

2025

[65] [65]

Chatassert: Llm-based test oracle generation with external tools assistance,

I. Hayet, A. Scott, and M. d’Amorim, “Chatassert: Llm-based test oracle generation with external tools assistance,”IEEE Transactions on Software Engineering, vol. 51, no. 1, pp. 305–319, 2024

2024

[66] [66]

Augmentest: Enhancing tests with llm-driven oracles,

S. M. Khandaker, F. Kifetew, D. Prandi, and A. Susi, “Augmentest: Enhancing tests with llm-driven oracles,” in2025 IEEE Conference on Software Testing, Verification and Validation (ICST), pp. 279–289, IEEE, 2025

2025

[67] [67]

Nexus: Execution-grounded multi-agent test oracle synthesis,

D. Huang, M. Du, J. M. Zhang, Z. Lin, M. Luo, Q. Zhang, and S.- K. Ng, “Nexus: Execution-grounded multi-agent test oracle synthesis,” arXiv preprint arXiv:2510.26423, 2025

arXiv 2025

[68] [68]

Mastor: A multi-agent approach to semantic test oracle generation for restful apis,

S. Deng, R. Huang, Z. Yang, M. Zhang, X. Xie, and R. Wang, “Mastor: A multi-agent approach to semantic test oracle generation for restful apis,”arXiv preprint arXiv:2606.10465, 2026. 25

Pith/arXiv arXiv 2026

[69] [69]

Data-driven grasp synthesis—a survey,

J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp synthesis—a survey,”IEEE Transactions on robotics, vol. 30, no. 2, pp. 289–309, 2013

2013

[70] [70]

Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,

J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,”arXiv preprint arXiv:1703.09312, 2017

Pith/arXiv arXiv 2017

[71] [71]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009

2009

[72] [72]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision, pp. 740–755, Springer, 2014

2014

[73] [73]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002

2002

[74] [74]

Rouge: A package for automatic evaluation of summaries,

C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, pp. 74–81, 2004

2004

[75] [75]

Exploring the limits of vision-language-action manipulations in cross-task generalization,

J. Zhou, K. Ye, J. Liu, T. Ma, Z. Wang, R. Qiu, K.-Y . Lin, Z. Zhao, and J. Liang, “Exploring the limits of vision-language-action manipulations in cross-task generalization,”arXiv preprint arXiv:2505.15660, 2025

arXiv 2025

[76] [76]

From intention to execution: Probing the generalization boundaries of vision-language-action mod- els,

I. Fang, J. Zhang, S. Tong, and C. Feng, “From intention to execution: Probing the generalization boundaries of vision-language-action mod- els,”arXiv preprint arXiv:2506.09930, 2025

arXiv 2025

[77] [77]

Task reconstruction and extrapolation for\pi 0using text latent,

Q. Li, “Task reconstruction and extrapolation for\pi 0using text latent,”arXiv preprint arXiv:2505.03500, 2025

Pith/arXiv arXiv 2025

[78] [78]

Rlbench: The robot learning benchmark & learning environment,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020

2020

[79] [79]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022

2022

[80] [80]

Ladev: A language-driven testing and evaluation platform for vision- language-action models in robotic manipulation,

Z. Wang, Z. Zhou, J. Song, Y . Huang, Z. Shu, and L. Ma, “Ladev: A language-driven testing and evaluation platform for vision- language-action models in robotic manipulation,”arXiv preprint arXiv:2410.05191, 2024

arXiv 2024