LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.
hub Canonical reference
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot
Canonical reference. 76% of citing Pith papers cite this work as background.
abstract
Generative AI tools hold promise to increase human productivity. This paper presents results from a controlled experiment with GitHub Copilot, an AI pair programmer. Recruited software developers were asked to implement an HTTP server in JavaScript as quickly as possible. The treatment group, with access to the AI pair programmer, completed the task 55.8% faster than the control group. Observed heterogenous effects show promise for AI pair programmers to help people transition into software development careers.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Generative AI tools hold promise to increase human productivity. This paper presents results from a controlled experiment with GitHub Copilot, an AI pair programmer. Recruited software developers were asked to implement an HTTP server in JavaScript as quickly as possible. The treatment group, with access to the AI pair programmer, completed the task 55.8% faster than the control group. Observed heterogenous effects show promise for AI pair programmers to help people transition into software development careers.
co-cited works
roles
background 17representative citing papers
Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
A network analysis of software mentions in 1.3 million papers identifies 520 tools in eight communities and shows disciplines maintain distinct, stable tool portfolios that are crystallizing toward common sets.
AgenticFlict is a public dataset of 29K+ textual merge conflicts from AI agent PRs, collected via merge simulation on 107K processed PRs and showing a 27.67% conflict rate with variation across agents.
The paper introduces a Triple Debt Model with cognitive debt and intent debt alongside technical debt to address risks from generative AI in software development.
Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.
Coding agents reached 22-29% adoption in GitHub projects within months of release, with agent-assisted commits larger and focused on features and bug fixes.
Copilot boosts performance in brownfield tasks but decouples from comprehension unless users actively verify generated code, with verification frequency predicting understanding at r=0.96.
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
The Mise en Place methodology uses contextual grounding, collaborative specification, and task decomposition to prepare AI agents for coding tasks, demonstrated in a hackathon where two hours of prep enabled rapid parallel development of a full-stack platform.
Analysis of 9,799 human-reviewed agentic PRs shows only 35.7% of rejections reflect clear agent failures, with 31.2% due to workflow constraints and 33.1% lacking clear rationale, plus notable interaction differences across agents.
Case study reports one staff engineer with four AI agents delivering a four-person-scoped brownfield project in half the planned time under Spec-Driven Development, with high code acceptance and major cost savings.
Multi-agent LLM teams outperform human teams in creativity (d=1.50) across tasks by producing more novel ideas, with distinct semantic exploration patterns predicting success for each group.
uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.
Generative AI boosted solo entrepreneurial entry on Product Hunt after ChatGPT but teams still dominate the top quality tiers.
SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
HAAS combines governance rules with contextual bandits to adaptively allocate tasks across a five-mode autonomy spectrum, showing that moderate governance improves manufacturing outcomes and that no single setting dominates.
Freelancers use generative AI to support exploratory skill acquisition but not as their main resource due to reliability issues, leading to a shift toward survival-oriented upskilling and the emergence of invisible competencies that lack market validation.
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
A game-theoretic model shows that individually rational adoption of generative AI causes model collapse that reduces collective social welfare for important tasks, with habit formation creating spillovers from low-stakes to high-value domains.
BONSAI introduces a four-layer architecture and four-phase workflow for human-AI co-development of visual analytics applications, shown in case studies to enable efficient novel tool creation and reconstruction from paper descriptions.
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
citing papers explorer
-
LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models
LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.
-
Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%
Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
-
The software space of science
A network analysis of software mentions in 1.3 million papers identifies 520 tools in eight communities and shows disciplines maintain distinct, stable tool portfolios that are crystallizing toward common sets.
-
AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub
AgenticFlict is a public dataset of 29K+ textual merge conflicts from AI agent PRs, collected via merge simulation on 107K processed PRs and showing a 27.67% conflict rate with variation across agents.
-
From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI
The paper introduces a Triple Debt Model with cognitive debt and intent debt alongside technical debt to address risks from generative AI in software development.
-
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
-
"Tab, Tab, Bug": Security Pitfalls of Next Edit Suggestions in AI-Integrated IDEs
NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.
-
Agentic Much? Adoption of Coding Agents on GitHub
Coding agents reached 22-29% adoption in GitHub projects within months of release, with agent-assisted commits larger and focused on features and bug fixes.
-
Code Comprehension with GitHub Copilot: Performance Gains, Comprehension Trade-offs, and Behavioral Predictors in Brownfield Programming
Copilot boosts performance in brownfield tasks but decouples from comprehension unless users actively verify generated code, with verification frequency predicting understanding at r=0.96.
-
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
-
Mise en Place for Agentic Coding: Deliberate Preparation as Context Engineering Methodology
The Mise en Place methodology uses contextual grounding, collaborative specification, and task decomposition to prepare AI agents for coding tasks, demonstrated in a hackathon where two hours of prep enabled rapid parallel development of a full-stack platform.
-
Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study
Analysis of 9,799 human-reviewed agentic PRs shows only 35.7% of rejections reflect clear agent failures, with 31.2% due to workflow constraints and 33.1% lacking clear rationale, plus notable interaction differences across agents.
-
One Developer Is All You Need: A Case Study of an AI-Augmented One-Person Squad in a Brownfield Enterprise
Case study reports one staff engineer with four AI agents delivering a four-person-scoped brownfield project in half the planned time under Spec-Driven Development, with high code acceptance and major cost savings.
-
Multi-agent AI systems outperform human teams in creativity
Multi-agent LLM teams outperform human teams in creativity (d=1.50) across tasks by producing more novel ideas, with distinct semantic exploration patterns predicting success for each group.
-
uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs
uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.
-
Generative AI Fuels Solo Entrepreneurship, but Teams Still Lead at the Top
Generative AI boosted solo entrepreneurial entry on Product Hunt after ChatGPT but teams still dominate the top quality tiers.
-
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
-
HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems
HAAS combines governance rules with contextual bandits to adaptively allocate tasks across a five-mode autonomy spectrum, showing that moderate governance improves manufacturing outcomes and that no single setting dominates.
-
Upskilling with Generative AI: Practices and Challenges for Freelance Knowledge Workers
Freelancers use generative AI to support exploratory skill acquisition but not as their main resource due to reliability issues, leading to a shift toward survival-oriented upskilling and the emergence of invisible competencies that lack market validation.
-
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
-
Generative artificial intelligence reduces social welfare through model collapse
A game-theoretic model shows that individually rational adoption of generative AI causes model collapse that reduces collective social welfare for important tasks, with habit formation creating spillovers from low-stakes to high-value domains.
-
BONSAI: A Mixed-Initiative Workspace for Human-AI Co-Development of Visual Analytics Applications
BONSAI introduces a four-layer architecture and four-phase workflow for human-AI co-development of visual analytics applications, shown in case studies to enable efficient novel tool creation and reconstruction from paper descriptions.
-
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
-
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
-
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
-
Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild
AI coding assistants introduce code issues that persist in 22.7% of cases across real projects, creating measurable long-term technical debt.
-
Agentic Inequality
Introduces the concept of agentic inequality and develops a three-dimensional framework (availability, quality, quantity) to analyze how autonomous AI agents could deepen or mitigate existing divides through scalable goal delegation.
-
PatchTrack: A Comprehensive Analysis of ChatGPT's Influence on Pull Request Outcomes
Empirical analysis of 338 PRs with self-admitted ChatGPT usage shows low full integration (median 25%), selective adaptation patterns, and broader influence on developer reasoning during reviews.
-
The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot
Analysis of GitHub Copilot usage shows a 5.9% increase in project code contributions offset by 8% more coordination time, yielding net positive effects on code merges with varying impacts on core and peripheral developers.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
GPT-4 Technical Report
GPT-4 is a scaled Transformer model with post-training alignment that reaches human-level performance on academic and professional benchmarks via infrastructure enabling performance prediction from much smaller models.
-
Context-Based Adversarial Attacks on AI Code Generators: Vulnerability Analysis and Implications
Context-based adversarial attacks raise vulnerable code generation in models like GPT-4 and CodeLlama from 3.5% to 37.4%, with 60-100% transferability, and a dual-layer defense reaches 89.1% detection at low false positives.
-
The Impact of AI Coding Assistants on Software Engineering: A Longitudinal Study
Longitudinal surveys show AI coding assistants reduce time on code writing but increase supervisory verification tasks, with stable productivity perceptions yet rising reports of worsened developer experience.
-
Assistance to Autonomy: A Systematic Literature Review of Agentic AI across the Software Development Life Cycle
Systematic review of agentic AI in the SDLC finds output verifiability drives industrial adoption in later phases, with Planner-Executor-Reviewer as the dominant pattern, plus a new multi-agent LLM screening pipeline for high-volume SLRs.
-
A Generative AI Driven Interactive Narrative Serious Game for Stress Relief and Its Randomized Controlled Pilot Study
Pilot study of a ChatGPT-driven narrative game found significant stress reduction (p=0.016) and positive user experience among 20 stressed students.
-
A meta-analysis of the effect of generative AI on productivity and learning in programming
Meta-analysis of 23 studies shows moderate productivity gains from GenAI coding assistants (Hedges' g=0.33) but no significant effect on learning (g=0.14).
-
The Productivity-Reliability Paradox: Specification-Driven Governance for AI-Augmented Software Development
The Productivity-Reliability Paradox arises because AI code generators produce variable output while developers lack sufficient specification discipline, making governance models focused on specifications the binding constraint rather than model improvements.
-
Agentic AI in the Software Development Lifecycle: Architecture, Empirical Evidence, and the Reshaping of Software Engineering
Agentic AI systems are shifting software engineering from line-level code generation to delegated repository-scale execution under supervision, with SWE-bench performance rising from 1.96% to 78.4% and productivity gains of 13.6-55.8%.
-
Relationships Between Trust, Compliance, and Performance for Novice Programmers Using AI Code Generation
Among novice programmers using AI code generators, trust did not predict compliance with suggestions, while performance correlated with both compliance and increased subsequent trust.
-
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems
AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.
-
Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer
Agentic Consensus replaces code as the main artifact with a typed property graph world model that maintains commitments and evidence through synchronization operators, shifting evaluation to alignment fidelity and consensus entropy.
-
Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure
Sema Code decouples AI coding agents into a programmable npm library with eight mechanisms for isolation, queuing, compression, scheduling, permissions, and integration.
-
Generative AI and Two-Tiered Online Mental Health Communities
A quasi-natural experiment on a leading OMHC finds that generative AI integration increases counselor public posting intensity, triggers heterogeneous responses by motivation type, and produces cross-tier spillovers to paid consultations.
-
The AI Codebase Maturity Model: From Assisted Coding to Fully Autonomous Systems
The AI Codebase Maturity Model defines six sequential levels of AI-driven development based on feedback loop topologies, validated by experience reports showing 5x PR and 37x issue throughput gains from level 2 to level 6.
-
Reproducibility Beyond Artifacts: Interactional Support for Collaborative Machine Learning
Collaborative ML reproducibility requires socio-technical interactional support beyond artifacts, demonstrated via a clinical deployment and addressed by a proposed two-layer system with an AI semantic interface.
-
EcoAssist: Embedding Sustainability into AI-Assisted Frontend Development
EcoAssist embeds energy estimation and optimization into AI-assisted frontend coding, reducing website energy use by 13-16% in benchmarks while preserving developer productivity.
-
The Fast and Spurious: Developer Productivity with GenAI
Survey of 415 developers finds GenAI accelerates coding output but redistributes effort into review and verification, making net productivity gains appear spurious at current adoption levels.
-
Vibe Coding in Product Teams: Reconfiguring AI-Assisted Workflows, Prototyping, and Collaboration
Interviews reveal a four-stage vibe coding workflow that accelerates prototyping while introducing tensions between quick efficiency and reflective design intention, plus asymmetries in trust and ownership.
-
Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap
Comparative review of AI coding tool ToS shows responsibility for code quality and compliance shifted to users, with policy misalignment for autonomous agents, plus a research roadmap.
-
HiLSVA: Design and Evaluation of a Human-in-the-Loop Agentic System for Scientific Visualization
HiLSVA introduces a plan-first multi-agent LLM system for scientific visualization that incorporates explicit human oversight, stepwise provenance, and learn-at-test-time adaptation, evaluated via case studies and a 12-participant user study.