On the evaluation of large language models in unit test generation

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, Junjie Chen · 2024 · arXiv 2406.18181

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics

cs.AI · 2026-06-30 · accept · novelty 7.0 · 2 refs

An orchestrator-driven agentic pipeline using general coding LLMs autoformalizes 32 PutnamBench problems and the main theorems plus proofs from five STOC papers into Lean 4, with two proofs using only the kernel.

Large Language Models for Multi-Lingual Equivalent Mutant Detection: An Extended Empirical Study

cs.SE · 2026-07-01 · unverdicted · novelty 6.0

LLM-based methods achieve higher F1-scores than traditional approaches for equivalent mutant detection in Java and C, with fine-tuned code embeddings performing best and showing cross-lingual generalization.

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

cs.SE · 2026-06-18 · unverdicted · novelty 6.0

LLM workflow with library-aware doubles and iterative compile-repair produces compilable unit tests for 73 of 76 OpenSIL functions and reaches 98.8% line coverage on a guided subset.

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.

PR-Aware Automated Unit Test Generation: Challenges and Opportunities

cs.SE · 2026-05-24 · unverdicted · novelty 5.0

EvoSuite produced at least one fail-to-pass test for 36% of PRs versus 13% for GPT-4o, but both tools generated no meaningful change-capturing tests for 64% of the PRs evaluated.

Augmenting unit test suites from integration tests

cs.SE · 2026-04-19 · unverdicted · novelty 5.0

A static-plus-dynamic analysis technique extracts isolated unit tests from integration tests to improve test suite structure in Node.js projects.

AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation

cs.SE · 2025-06-10 · unverdicted · novelty 5.0

AdaDec improves Pass@1 accuracy of LLM code generation by up to 20.9% over greedy decoding by triggering lookahead reranking only at high-uncertainty steps on HumanEval+, MBPP+, and DevEval.

citing papers explorer

Showing 6 of 6 citing papers after filters.

Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics cs.AI · 2026-06-30 · accept · none · ref 14 · 2 links
An orchestrator-driven agentic pipeline using general coding LLMs autoformalizes 32 PutnamBench problems and the main theorems plus proofs from five STOC papers into Lean 4, with two proofs using only the kernel.
Large Language Models for Multi-Lingual Equivalent Mutant Detection: An Extended Empirical Study cs.SE · 2026-07-01 · unverdicted · none · ref 111
LLM-based methods achieve higher F1-scores than traditional approaches for equivalent mutant detection in Java and C, with fine-tuned code embeddings performing best and showing cross-lingual generalization.
Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware cs.SE · 2026-06-18 · unverdicted · none · ref 1
LLM workflow with library-aware doubles and iterative compile-repair produces compilable unit tests for 73 of 76 OpenSIL functions and reaches 98.8% line coverage on a guided subset.
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization cs.SE · 2026-05-08 · unverdicted · none · ref 88
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
PR-Aware Automated Unit Test Generation: Challenges and Opportunities cs.SE · 2026-05-24 · unverdicted · none · ref 31
EvoSuite produced at least one fail-to-pass test for 36% of PRs versus 13% for GPT-4o, but both tools generated no meaningful change-capturing tests for 64% of the PRs evaluated.
Augmenting unit test suites from integration tests cs.SE · 2026-04-19 · unverdicted · none · ref 31
A static-plus-dynamic analysis technique extracts isolated unit tests from integration tests to improve test suite structure in Node.js projects.

On the evaluation of large language models in unit test generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer