An orchestrator-driven agentic pipeline using general coding LLMs autoformalizes 32 PutnamBench problems and the main theorems plus proofs from five STOC papers into Lean 4, with two proofs using only the kernel.
On the evaluation of large language models in unit test generation
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LLM-based methods achieve higher F1-scores than traditional approaches for equivalent mutant detection in Java and C, with fine-tuned code embeddings performing best and showing cross-lingual generalization.
LLM workflow with library-aware doubles and iterative compile-repair produces compilable unit tests for 73 of 76 OpenSIL functions and reaches 98.8% line coverage on a guided subset.
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
EvoSuite produced at least one fail-to-pass test for 36% of PRs versus 13% for GPT-4o, but both tools generated no meaningful change-capturing tests for 64% of the PRs evaluated.
A static-plus-dynamic analysis technique extracts isolated unit tests from integration tests to improve test suite structure in Node.js projects.
AdaDec improves Pass@1 accuracy of LLM code generation by up to 20.9% over greedy decoding by triggering lookahead reranking only at high-uncertainty steps on HumanEval+, MBPP+, and DevEval.
citing papers explorer
-
Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics
An orchestrator-driven agentic pipeline using general coding LLMs autoformalizes 32 PutnamBench problems and the main theorems plus proofs from five STOC papers into Lean 4, with two proofs using only the kernel.
-
Large Language Models for Multi-Lingual Equivalent Mutant Detection: An Extended Empirical Study
LLM-based methods achieve higher F1-scores than traditional approaches for equivalent mutant detection in Java and C, with fine-tuned code embeddings performing best and showing cross-lingual generalization.
-
Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware
LLM workflow with library-aware doubles and iterative compile-repair produces compilable unit tests for 73 of 76 OpenSIL functions and reaches 98.8% line coverage on a guided subset.
-
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
-
PR-Aware Automated Unit Test Generation: Challenges and Opportunities
EvoSuite produced at least one fail-to-pass test for 36% of PRs versus 13% for GPT-4o, but both tools generated no meaningful change-capturing tests for 64% of the PRs evaluated.
-
Augmenting unit test suites from integration tests
A static-plus-dynamic analysis technique extracts isolated unit tests from integration tests to improve test suite structure in Node.js projects.