SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
XC ode E val: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
baseline 1polarities
baseline 1representative citing papers
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.
CodeGolf Bench is a dynamic benchmark for LLM concise code generation in 60 languages, showing reasoning models reach 70.97% average human percentile on Python and C++ tasks while non-reasoning models lag.
citing papers explorer
-
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
-
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
-
Do not copy and paste! Rewriting strategies for code retrieval
Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.
-
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models
CodeGolf Bench is a dynamic benchmark for LLM concise code generation in 60 languages, showing reasoning models reach 70.97% average human percentile on Python and C++ tasks while non-reasoning models lag.
- Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment