The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
Canonical reference
Still just personal assistants? - A multiple case study of generative AI adoption in software organizations
Canonical reference. 100% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
roles
background 7polarities
background 7representative citing papers
TBE identifies 32.1% of 92,011 equivalent surviving quantum mutants (29,536) via OpenQASM comparison after transpilation, reporting 100% precision and 82% accuracy on 348,299 mutants.
ReproBreak provides 449 verified locator breaks from real web test commits along with scripts to reproduce them automatically.
A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.
Using a corpus of 5542 fault-injected traces from 38 DL programs, the study finds a 0.19 balanced accuracy gap in fault diagnosis between within-program and cross-program evaluation caused by program-specific feature structures.
Mixed-methods study adapting UTAUT2 shows individual-level perceptions predict continued GenAI use in Italian SME developers (R²=0.647) while social and organisational factors do not.
Creativity in human-LLM collaborative software design emerges primarily from human traits and interactions, with LLMs providing supplementary novel ideas but occasionally hindering progress.
Documentation on testing in 160 OSS repositories shows a weak positive correlation (ρ=0.36) with higher test engagement ratios in pull requests, strengthening to moderate in high-activity repos.
STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.
Empirical study of SE podcasts via content analysis and researcher survey to evaluate their value as a resource for empirical software engineering research.
MNAL reduces human effort in bug report labeling by up to 95.8% for readability and 196% for identifiability while improving identification performance and working with various neural models.
Empirical study of 70 disaster apps finds heavy focus on response phase, limited preparedness/recovery support, and user issues in reliability/usability, yielding design recommendations.
citing papers explorer
-
Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks
The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
-
Quantum Mutant Equivalence via Transpilation
TBE identifies 32.1% of 92,011 equivalent surviving quantum mutants (29,536) via OpenQASM comparison after transpilation, reporting 100% precision and 82% accuracy on 348,299 mutants.
-
ReproBreak: A Dataset of Reproducible Web Locator Breaks
ReproBreak provides 449 verified locator breaks from real web test commits along with scripts to reproduce them automatically.
-
LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda
A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.
-
Evaluation-Strategy Gap in Fault Diagnosis of Deep Learning Programs
Using a corpus of 5542 fault-injected traces from 38 DL programs, the study finds a 0.19 balanced accuracy gap in fault diagnosis between within-program and cross-program evaluation caused by program-specific feature structures.
-
From Early Adoption to Sustained Use: Understanding GenAI Usage Among Software Developers in Italian SMEs
Mixed-methods study adapting UTAUT2 shows individual-level perceptions predict continued GenAI use in Italian SME developers (R²=0.647) while social and organisational factors do not.
-
Exploring Creativity in Human-Human-LLM Collaborative Software Design
Creativity in human-LLM collaborative software design emerges primarily from human traits and interactions, with LLMs providing supplementary novel ideas but occasionally hindering progress.
-
The Impact of Documentation on Test Engagement in Pull Requests in OSS
Documentation on testing in 160 OSS repositories shows a weak positive correlation (ρ=0.36) with higher test engagement ratios in pull requests, strengthening to moderate in high-activity repos.
-
Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts
STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.
-
Software Engineering Podcasts: An Empirical Study of Their Potential as a Research Resource
Empirical study of SE podcasts via content analysis and researcher survey to evaluate their value as a resource for empirical software engineering research.
-
Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning
MNAL reduces human effort in bug report labeling by up to 95.8% for readability and 196% for identifiability while improving identification performance and working with various neural models.
-
Engineering for Crisis Management: A User-Centred Analysis of Disaster Mobile Applications
Empirical study of 70 disaster apps finds heavy focus on response phase, limited preparedness/recovery support, and user issues in reliability/usability, yielding design recommendations.
- AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research