An orchestrator-driven agentic pipeline using general coding LLMs autoformalizes 32 PutnamBench problems and the main theorems plus proofs from five STOC papers into Lean 4, with two proofs using only the kernel.
On the evaluation of large language models in unit test generation
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LLM-based methods achieve higher F1-scores than traditional approaches for equivalent mutant detection in Java and C, with fine-tuned code embeddings performing best and showing cross-lingual generalization.
LLM workflow with library-aware doubles and iterative compile-repair produces compilable unit tests for 73 of 76 OpenSIL functions and reaches 98.8% line coverage on a guided subset.
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
EvoSuite produced at least one fail-to-pass test for 36% of PRs versus 13% for GPT-4o, but both tools generated no meaningful change-capturing tests for 64% of the PRs evaluated.
A static-plus-dynamic analysis technique extracts isolated unit tests from integration tests to improve test suite structure in Node.js projects.
AdaDec improves Pass@1 accuracy of LLM code generation by up to 20.9% over greedy decoding by triggering lookahead reranking only at high-uncertainty steps on HumanEval+, MBPP+, and DevEval.
citing papers explorer
-
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.