RESTestBench shows that LLM-generated REST API test effectiveness drops when interacting with faulty or mutated code, especially for vague requirements, indicating that high-detail requirements make direct SUT interaction unnecessary.
Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
method 1polarities
use method 1representative citing papers
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie
Behavioral Co-Versioning couples Git history with a queryable Behavioral Archive of run-time observations to enable semantic diffing and behavior-aware analysis of software evolution.
Causal fuzzing with budgeted interventions can detect residual direct and indirect influence of unlearned data that standard attribution methods miss due to proxies, cancellations, and masking.
citing papers explorer
-
RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements
RESTestBench shows that LLM-generated REST API test effectiveness drops when interacting with faulty or mutated code, especially for vague requirements, indicating that high-detail requirements make direct SUT interaction unnecessary.
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robots
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie
-
Treating Run-time Execution History as a First-Class Citizen: Co-Versioning Run-time Behavior alongside Code
Behavioral Co-Versioning couples Git history with a queryable Behavioral Archive of run-time observations to enable semantic diffing and behavior-aware analysis of software evolution.
-
Towards Reliable Testing of Machine Unlearning
Causal fuzzing with budgeted interventions can detect residual direct and indirect influence of unlearned data that standard attribution methods miss due to proxies, cancellations, and masking.