The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.
An analysis and survey of the development of mutation test- ing,
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
LLM code modernizers produce semantic drift in 39.7% of legacy-Python-2 cases and endorse 31.7% of those drifts in self-review, with rates varying widely across models but not tracking capability.
Noise from quantum hardware simulators significantly alters mutant detection distances, making equivalent mutants harder to separate from faults, with output-distribution metrics reaching 73.03% accuracy and 74.89% F1-score under device-specific thresholds.
Quantum circuits show high average condition (97.56%) and decision (97.63%) coverage but lower path coverage (71.84%), with probabilistic versions adding confidence levels (averages 88.87%, 88.65%, 37.18%); mutation testing reveals weak or no correlation between structural coverage and fault finding
A dual-axis quality framework ranks DL mutation operators by statistical resistance and Jaccard-based realism to real faults, enabling up to 55.6% fewer mutants on held-out validation data without dropping baseline performance.
QuanForge introduces statistical mutation killing and nine post-training mutation operators for QNNs to distinguish test suites and localize vulnerable circuit regions.
MutDafny uses 40 mutation operators on 794 real-world Dafny programs to detect weak specifications, manually confirming five such cases at a rate of one per 241 lines.
citing papers explorer
-
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks
The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.
-
Articulate but Wrong: Self-Review Failures in LLM-Based Code Modernization
LLM code modernizers produce semantic drift in 39.7% of legacy-Python-2 cases and endorse 31.7% of those drifts in self-review, with rates varying widely across models but not tracking capability.
-
Robust Mutation Analysis of Quantum Programs Under Noise
Noise from quantum hardware simulators significantly alters mutant detection distances, making equivalent mutants harder to separate from faults, with output-distribution metrics reaching 73.03% accuracy and 74.89% F1-score under device-specific thresholds.
-
Probabilistic Condition, Decision and Path Coverage of Circuit-based Quantum Programs
Quantum circuits show high average condition (97.56%) and decision (97.63%) coverage but lower path coverage (71.84%), with probabilistic versions adding confidence levels (averages 88.87%, 88.65%, 37.18%); mutation testing reveals weak or no correlation between structural coverage and fault finding
-
Quality-Driven Selective Mutation for Deep Learning
A dual-axis quality framework ranks DL mutation operators by statistical resistance and Jaccard-based realism to real faults, enabling up to 55.6% fewer mutants on held-out validation data without dropping baseline performance.
-
QuanForge: A Mutation Testing Framework for Quantum Neural Networks
QuanForge introduces statistical mutation killing and nine post-training mutation operators for QNNs to distinguish test suites and localize vulnerable circuit regions.
-
MutDafny: A Mutation-Based Approach to Assess Dafny Specifications
MutDafny uses 40 mutation operators on 794 real-world Dafny programs to detect weak specifications, manually confirming five such cases at a rate of one per 241 lines.