LLMs achieve strong results on syntax parsing tasks but show limited and variable performance on dynamic reasoning, with a clear performance hierarchy across model scales.
hub Canonical reference
Large language models for software engineering: Survey and open problems
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
CornerCase automates extremal testing of protocol implementations by using LLMs to extract validity constraints from specs and generating boundary test cases, uncovering 42 anomalies (26 acknowledged as bugs) in HTTP, DNS, BGP, SMTP, and QUIC implementations.
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
LLMs generated 615 vulnerable code snippets aligned with CAPEC and CWE frameworks across three languages, with 0.98 cosine similarity between model outputs.
User study reveals nine LLM failure categories in SE tasks and quantifies abandonment factors from 26 participants.
LLM code generation lacks syntactic robustness on math-formula prompts, but formula-reduction pre-processing raises it from 54.05% to 74.42%.
A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.
Smaller LLMs produce functional but limited Python code with variable quantization effects and quality/maintainability concerns that require validation before use.
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
citing papers explorer
-
Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs
LLMs achieve strong results on syntax parsing tasks but show limited and variable performance on dynamic reasoning, with a clear performance hierarchy across model scales.
-
CornerCase: Automated Extremal Testing of Protocol Implementations using LLMs
CornerCase automates extremal testing of protocol implementations by using LLMs to extract validity constraints from specs and generating boundary test cases, uncovering 42 anomalies (26 acknowledged as bugs) in HTTP, DNS, BGP, SMTP, and QUIC implementations.
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
A Taxonomy of Programming Languages for Code Generation
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
-
From Theory to Practice: Code Generation Using LLMs for CAPEC and CWE Frameworks
LLMs generated 615 vulnerable code snippets aligned with CAPEC and CWE frameworks across three languages, with 0.98 cosine similarity between model outputs.
-
"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering
User study reveals nine LLM failure categories in SE tasks and quantifies abandonment factors from 26 participants.
-
Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation
LLM code generation lacks syntactic robustness on math-formula prompts, but formula-reduction pre-processing raises it from 54.05% to 74.42%.
-
Sustainable Code Generation Using Large Language Models: A Systematic Literature Review
A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.
-
Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models
Smaller LLMs produce functional but limited Python code with variable quantization effects and quality/maintainability concerns that require validation before use.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.