TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
Unit test up- date through LLM-driven context collection and error- type-aware refinement.arXiv preprint arXiv:2509.24419
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.SE 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
ALADDIN is a user-requirement-driven GUI test generation framework that incrementally navigates mobile app UIs and builds LLM-guided oracles to validate both correct and faulty user-requested functionalities across six apps.
citing papers explorer
-
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
-
Automated Functional Testing for Malleable Mobile Application Driven from User Intent
ALADDIN is a user-requirement-driven GUI test generation framework that incrementally navigates mobile app UIs and builds LLM-guided oracles to validate both correct and faulty user-requested functionalities across six apps.