LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
Journal of Educational and Behavioral Statistics25, 101–132 (2000)
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.SE 4verdicts
UNVERDICTED 4representative citing papers
ConcoLixir uses a reactive LLM oracle to improve line coverage in Python concolic testing by 8.6 to 17 percentage points on synthetic, real-world, and library targets.
PerGent, an agentic critique-refinement system for persona generation, reaches 96.9% expert approval in an industrial evaluation at Kinaxis and reproduces more pre-LLM expert content than single-shot baselines.
AutoSLO applies genetic programming inside a monitoring loop to evolve scaling policies that cut resource use in microservices while keeping SLO violations low and short-lived.
citing papers explorer
-
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
-
ConcoLixir: Reactive LLM Discovery Oracles for Python Concolic Testing
ConcoLixir uses a reactive LLM oracle to improve line coverage in Python concolic testing by 8.6 to 17 percentage points on synthetic, real-world, and library targets.
-
Agentic Persona Generation with Critique-Refinement: An Industrial Evaluation
PerGent, an agentic critique-refinement system for persona generation, reaches 96.9% expert approval in an industrial evaluation at Kinaxis and reproduces more pre-LLM expert content than single-shot baselines.
-
Genetic Programming for Self-Adaptive Auto-Scaling of Microservices
AutoSLO applies genetic programming inside a monitoring loop to evolve scaling policies that cut resource use in microservices while keeping SLO violations low and short-lived.