Beyond correctness: Benchmarking multi-dimensional code generation for large language models

· 2024 · arXiv 2407.11470

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

cs.SE · 2025-10-17 · unverdicted · novelty 7.0

LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.

MetaLint: Easy-to-Hard Generalization for Code Linting

cs.SE · 2025-07-15 · unverdicted · novelty 7.0

MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.

Subjective Code Preferences in Experts and Large Language Models

cs.HC · 2026-05-24 · unverdicted · novelty 6.0

LLMs frequently reverse their stated coding preferences when shown actual code instead of descriptions, show positional bias, and produce more polarized ratings than human experts on complexity, commenting, modularity, and readability.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

cs.SE · 2026-05-06 · accept · novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs

cs.SE · 2026-06-07 · unverdicted · novelty 5.0

LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.

"Like Taking the Path of Least Resistance": Exploring the Impact of LLM Interaction on the Creative Process of Programming

cs.HC · 2026-05-13 · conditional · novelty 5.0

LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software cs.SE · 2025-10-17 · unverdicted · none · ref 52
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
MetaLint: Easy-to-Hard Generalization for Code Linting cs.SE · 2025-07-15 · unverdicted · none · ref 55
MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.
Subjective Code Preferences in Experts and Large Language Models cs.HC · 2026-05-24 · unverdicted · none · ref 12
LLMs frequently reverse their stated coding preferences when shown actual code instead of descriptions, show positional bias, and produce more polarized ratings than human experts on complexity, commenting, modularity, and readability.
LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs cs.SE · 2026-06-07 · unverdicted · none · ref 12
LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.

Beyond correctness: Benchmarking multi-dimensional code generation for large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer