ScarfBench supplies 204 cross-framework Java migration tasks where the best agent passes only 15.3% of focused and 12.2% of whole-application tests.
Ahilan Ayyachamy Nadar Ponnusamy
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
citing papers explorer
-
ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java
ScarfBench supplies 204 cross-framework Java migration tasks where the best agent passes only 15.3% of focused and 12.2% of whole-application tests.
-
Reproduction Test Generation for Java SWE Issues
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
-
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.