CodeAlchemy: Synthetic Code Rewriting at Scale
Pith reviewed 2026-06-27 16:09 UTC · model grok-4.3
The pith
Synthetic rewriting of public code lets 3B models outperform 27B and 32B frontier models on code benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodeAlchemy transforms publicly sourced code into semantically-rich training data using CodeEnhance for quality-aware rewriting, CodeQA for template-based problems, CodeDev for developer tasks, CodeDialogue for multi-turn conversations, and CodeTrace for execution traces. Processing 3 corpora yields 500B+ synthetic tokens plus 350B reasoning tokens. The resulting 3B models reach 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming much larger models like 27B Gemma-3 and 32B Granite-4.0, while frontier models score only 5.6% on TraceEval.
What carries the argument
The CodeAlchemy pipeline consisting of the five data transformation strategies that enrich raw code with semantic, task-oriented, and execution-based signals.
If this is right
- Small models can achieve high performance on code tasks without needing massive scale if trained on appropriately transformed data.
- Execution trace prediction reveals significant gaps in semantic understanding for even the largest models.
- New benchmarks like DevEval and TraceEval provide better measures of real-world code capabilities.
- Synthetic data generation at this scale is feasible from public code corpora.
- Performance on HumanEval and MBPP can be pushed higher through these methods.
Where Pith is reading between the lines
- If the gains hold, focusing on data synthesis could be more efficient than scaling model size for specialized domains like code.
- This approach might generalize to other structured data types beyond code.
- One could test whether applying CodeTrace to other programming languages or libraries further improves library knowledge.
- The low scores on TraceEval suggest that current training paradigms miss key aspects of program semantics.
Load-bearing premise
The performance improvements are due to the richer semantic signals from the synthetic strategies rather than differences in training procedures, model architecture, or benchmark construction.
What would settle it
Training an identical 3B model on the original raw code corpora without the five rewriting strategies and measuring if it matches or exceeds the reported benchmark scores.
read the original abstract
Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements. We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CodeAlchemy, a framework that applies five strategies (CodeEnhance, CodeQA, CodeDev, CodeDialogue, CodeTrace) to three public code corpora across 15 languages, generating over 850B tokens of synthetic and reasoning data. It reports that 3B models trained on this data reach 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on the introduced DevEval benchmark, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models up to 10x larger, while also showing that even large models score low (e.g., 5.6% exact match) on TraceEval.
Significance. If the performance attribution to the synthetic strategies is substantiated, the work would be significant for showing that large-scale, semantically targeted synthetic code data can close capability gaps for small models on code tasks and for introducing DevEval and TraceEval as diagnostics that expose limitations in current models' execution and developer-task understanding. The reported scale (1.3M+ instrumented files, 500B+ synthetic tokens) is a notable engineering contribution.
major comments (3)
- [§4] §4 (Experimental results) and abstract: The headline performance numbers for the 3B models are presented without any ablation that trains an identical 3B model on an equal volume of raw tokens drawn from the same three source corpora (without the five rewriting/QA/trace steps). This control is required to support the central claim that the strategies supply richer semantic signals rather than gains arising from token count or unstated training details.
- [§3.5] §3.5 (CodeTrace description): The claim that CodeTrace captures control flow, state tracking, and library knowledge rests on instrumenting 1.3M+ files, but the section provides no quantitative details on execution success rate, error filtering rules, or coverage statistics across the 5K libraries; without these, it is unclear whether the 350B reasoning tokens are high-quality or dominated by failed/partial traces.
- [§4.3] §4.3 (Benchmark construction): DevEval and TraceEval are introduced as new evaluations, yet the paper does not report inter-annotator agreement, contamination checks against the training corpora, or exclusion criteria for the test items; these omissions affect the interpretability of the 8.09% win rate and 15.36 ROUGE-2 scores.
minor comments (2)
- [§2] The abstract and §2 cite prior synthetic-data work but omit direct comparison tables showing token volumes or strategy differences relative to the largest prior code-synthesis efforts.
- [§3] Notation for the five strategies is introduced in §3 but not consistently referenced in the results tables, making it hard to map which strategy contributes to which benchmark gain.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for strengthening the attribution of gains to our synthetic strategies, the documentation of CodeTrace, and the validation of the new benchmarks. We address each major comment below and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Experimental results) and abstract: The headline performance numbers for the 3B models are presented without any ablation that trains an identical 3B model on an equal volume of raw tokens drawn from the same three source corpora (without the five rewriting/QA/trace steps). This control is required to support the central claim that the strategies supply richer semantic signals rather than gains arising from token count or unstated training details.
Authors: We agree that this ablation is essential to isolate the contribution of the five strategies from raw token volume or training details. In the revised manuscript, we will add results from training an identical 3B model on an equivalent volume of raw tokens from the three source corpora using the same training configuration. This will directly support the claim that the rewriting, QA, dialogue, and trace strategies provide richer semantic signals. revision: yes
-
Referee: [§3.5] §3.5 (CodeTrace description): The claim that CodeTrace captures control flow, state tracking, and library knowledge rests on instrumenting 1.3M+ files, but the section provides no quantitative details on execution success rate, error filtering rules, or coverage statistics across the 5K libraries; without these, it is unclear whether the 350B reasoning tokens are high-quality or dominated by failed/partial traces.
Authors: We acknowledge that quantitative details on execution success, filtering, and coverage are needed to substantiate the quality of the traces. We will expand §3.5 in the revision to include these statistics (success rates, error filtering rules, and library coverage across the 5K libraries), which were generated during the instrumentation process but omitted for space. This will clarify that the 350B reasoning tokens are not dominated by failed traces. revision: yes
-
Referee: [§4.3] §4.3 (Benchmark construction): DevEval and TraceEval are introduced as new evaluations, yet the paper does not report inter-annotator agreement, contamination checks against the training corpora, or exclusion criteria for the test items; these omissions affect the interpretability of the 8.09% win rate and 15.36 ROUGE-2 scores.
Authors: We agree these details are required for full interpretability of the new benchmarks. In the revised §4.3, we will report inter-annotator agreement scores, describe the contamination checks performed against the training corpora, and specify the exclusion criteria used for test items. These additions will strengthen confidence in the reported DevEval win rate and TraceEval ROUGE-2 scores. revision: yes
Circularity Check
No significant circularity; purely empirical claims without derivations or self-referential reductions
full rationale
The paper describes an empirical pipeline for generating 500B+ synthetic tokens via five named strategies (CodeEnhance, CodeQA, etc.) applied to public corpora, followed by model training and benchmark reporting (HumanEval 83.5%, MBPP 63.2%, etc.). No equations, parameter fits, uniqueness theorems, or derivation chains appear in the provided text. Performance attribution is presented as an experimental outcome rather than a quantity defined in terms of the input data or prior self-citations. The central premise (synthetic rewriting supplies richer signals) is an empirical hypothesis tested via reported results, not a self-definitional or fitted-input reduction. This matches the default expectation of no circularity for non-derivational empirical work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
URL https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf . Technical report. OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508.10925. ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
The proxy pre-resolves DNS entries and blocks all non-whitelisted domains
Setup stage:When dependencies are needed, we provide controlled network access through a custom SOCKS5 proxy that enforces a whitelist of trusted package registries including PyPI, npm, Maven Central, crates.io, RubyGems, and others. The proxy pre-resolves DNS entries and blocks all non-whitelisted domains. The package caches (pip, npm, cargo, Maven, etc....
-
[3]
The instrumented code then executes in complete isolation
Main stage:After package installation, we terminate all processes from the setup stage and remove network access entirely usingunshare –net, creating a fresh network namespace with no connectivity. The instrumented code then executes in complete isolation. Resource limits.We enforce strict resource constraints usingprlimit: 30GB maximum RAM (both virtual ...
2024
-
[4]
**Zero Score Check** -- If any disqualifying conditions apply, assign`Rating: [[0]]`immediately and explain
-
[5]
**Context Assessment** -- Identify code type and adjust expectations
-
[6]
**Quality Evaluation** -- Rate 1--10 using weighted criteria with anchors
-
[7]
**Training Value Analysis** -- Assess learning benefit for LLMs
-
[8]
* **Auto-Generated**: Generation markers (`DO NOT EDIT`,`generated by`, etc.) or obvious boilerplate
**Final Rating** -- Output in exact format:`Rating: [[X]]` --- ## **Zero Score Conditions (Rating: 0)** Assign **0** if **any** apply: * **Pure Config/Data** (>75% content): JSON, YAML, .env, SQL dumps, schemas without logic (exception: small embedded datasets tightly coupled with algorithms). * **Auto-Generated**: Generation markers (`DO NOT EDIT`,`gener...
-
[9]
**Reference examples** - showing target style/format
-
[10]
in the provided code file
**Code file** - for inspiration (not available during evaluation) ## TASK Study the examples, examine code file, and generate new instances that: - Match the **style, format, schema, and difficulty** (or harder) of the examples as shown within`START OF EXAMPLE`and`END OF EXAMPLE`. If examples omit full solutions, omit them too - follow their format exactl...
-
[11]
**Concrete References**: Use actual function/class/variable names from the code
-
[12]
**Transferable Skills**: Focus on applicable patterns, not memorization
-
[13]
**Realistic Tasks**: Reflect genuine developer needs
-
[14]
**Unambiguous**: Clear success criteria, no external context needed ### Task Coverage Prioritize applicable ones; ensure >=5 main categories covered across outputs: * Code comprehension (flow, decisions, trade-offs, comparison, equivalence understanding) * Debugging (real/hypothetical issues) * Feature extension (natural additions) * Refactoring & moderni...
-
[15]
**Scope:** Function -> class -> file -> system
-
[16]
**Scenario:** Understanding, debugging, extending, reviewing, migration
-
[17]
**Constraints:** Time/space limits, compatibility, dependency restrictions
-
[18]
**Audience:** Self, reviewer, junior dev, domain expert
-
[19]
**Format:** Inline code, snippets, mixed prose, pure paragraphs (no lists), conversational flow - rotate actively
-
[20]
**Phrasing:** Questions, imperatives, conversational
-
[21]
**Specificity:** Targeted fixes -> open-ended improvements
-
[22]
Explain this code
**Difficulty:** 25% simple, 35% moderate, 40% complex ### Important Notes * Avoid near-duplicate prompts testing the same skill * Include realistic imperfections in 30-40%: typos, poor code style, legacy patterns, vague phrasing, informal tone * 20% should be **long** (200+ tokens) * Prefer referencing relationships between code elements over isolated sni...
-
[23]
**Technically Correct**: Provide working code, accurate explanations, and analysis grounded in the actual source
-
[24]
**Honest**: Identify errors, challenge false premises, and prioritize sound advice over blind compliance
-
[25]
**Adaptive**: Match the prompt’s complexity, tone, and format needs naturally ## Handling Problematic Requests When prompts request broken, infeasible, or anti-pattern approaches: 28
-
[26]
Explain the specific issue with concrete evidence from the code
-
[27]
Recommend the correct approach and rationale
-
[28]
blocking release
If educational value exists, show the requested approach with explicit warnings about trade-offs ## Quality Standards - Reference actual functions/classes/variables from the source file - Match response depth to prompt complexity (don’t over-explain simple requests) - Make reasonable assumptions for underspecified prompts and state them clearly - Vary for...
-
[29]
**Seniority:** Would this require a senior/staff-level engineer?
-
[30]
**Dimensions Added:** >=2 new reasoning or scope dimensions?
-
[31]
**Feasibility:** Achievable given the original code context?
-
[32]
prompt": r
**Relevance (Invention only):** Naturally extends the codebase’s domain? ## Output Format ```python [ { "prompt": r"""Input to code LLM with snippets""", "evolution_type": "Mutation: Constraint Stacking | Crossover: Sequential Chaining | Hybrid | Invention", "why_harder": "One sentence.", "categories": ["Task", "Sub-task", "Sub-sub-task"], "difficulty": "...
-
[33]
**Checkpoint threshold**:`sum > chkpt+500; chkpt=sum`
-
[34]
**Stack deltas**:`|len(stack)-prev| > 5`
-
[35]
**Statistical**:`median(buf)-mean(buf) > std(buf)`
-
[36]
**Uniqueness ratio**:`unique(arr)/len(arr) < 0.8`
-
[37]
**Relative change**:`|val-prev|/prev > 0.15`
-
[38]
**Conditional structure**:`depth%2==0 ? nodes>10 : leaves>5`
-
[39]
**Irregular math**:`(cnt*cnt) % depth == 0`
-
[40]
**Derived aggregates**:`rank < p10(vals)`
-
[41]
**Cross-trace coupling**:`traces["ERR"] > 5; thresh*=2`
-
[42]
**Peak tracking**:`depth > peak*1.3; peak=depth`
-
[43]
**Historical filter**:`count(hist, x<0) > len(hist)/3`
-
[44]
**Window eviction**:`len(win) < prevLen`
-
[45]
**Decay tracking**:`val < prev*0.95`
-
[46]
**Direction violation**:`prevDelta>0 && delta<-thresh`
-
[47]
trace_patterns_used
**Bit population**:`popcount(mask) in primes && depth>5` **NEVER USE** - Fixed intervals:`i%7==0` - Exact equality on numbers:`x==123` - Power-of-2 indices:`i&(i-1)==0` - Uniform logging:`for x: trace(x)` - High-frequency conditions:`sum>100`,`x%1000!=0`,`size>5` - Single-var irregularity:`isPrime(x)`,`bits(x)==3`on simple counters - Opaque trace content:...
-
[48]
source_filename
Instrumented code with`TRACE:<TYPE>:<LOC>:<STATE>`statements 2.`trace_patterns_used`JSON showing the instrumented patterns **CONSTRAINTS**: - 3-5 tests (each with 60-100 logical data elements) - Inputs inline only (CLI args, stdin, echo pipe, heredoc) - Do not require manual test/data file creation - Deterministic (no randomness) - Do not modify instrumen...
-
[49]
The conversion to HTML is performed atomically: the script writes to a temporary file in the same directory, then renames it to`$HTML`in a single atomic`mv`operation
-
[50]
A lock file`${HTML}.lock`is created at the start of the conversion and removed at the end; if the lock already exists, the script waits up to 10 s and retries, aborting with an informative error if the lock cannot be obtained
-
[51]
The script must also check that the filename does not contain`;`or`&&`before proceeding
All external commands (`pandoc`,`chmod`,`fixbold.py`) are invoked with fully quoted arguments to prevent injection. The script must also check that the filename does not contain`;`or`&&`before proceeding
-
[52]
The script logs each step to`stderr`with timestamps, including lock acquisition, temporary file creation, pandoc invocation, and final rename
-
[53]
Usage: ${PROG} markdownfile
The final HTML file must retain the original permissions of`header.html`and then be set to`644`. Explain how each change mitigates race conditions and injection risks, and provide the modified script segment with comments. #! /bin/bash # script to turn labXX.md markdown file into html # # J. Knerr # Fall 2017 #---- check usage; give helpful message if wro...
2017
-
[54]
**Argument validation** - only one file is accepted, and any`;`or`&&`in the name causes an immediate exit
-
[55]
**Names are derived** -`BASE`is the stem of the Markdown file;`$HTML`is the target,`$TMP`is a unique temp file in the same directory, and`$LOCK`is a per-output lock file
-
[56]
$$" "$LOCK
**Lock acquisition** -`ln -s "$$" "$LOCK"`is an atomic operation. If the link already exists, the script sleeps and retries for up to 10 seconds
-
[57]
Every external command receives its arguments quoted
**Temporaries** -`header.html`is copied into`$TMP`; the Pandoc output is appended to the same file. Every external command receives its arguments quoted
-
[58]
**Post-processing** -`fixbold.py`is run on the temporary file
-
[59]
$TMP" "$HTML
**Atomic rename** -`mv "$TMP" "$HTML"`replaces the old output (if any) in a single, atomic step
-
[60]
**Permissions** - The final file is explicitly set to`644`, matching the original header permissions
-
[61]
solution.cpp
**Cleanup** - The lock file is removed, and a timestamped message confirms success. All log messages go to`stderr`with a timestamp, making it trivial to trace what happened even when the script runs in the background or is invoked by cron. Sample D4Data sample fromCodeTrace You are provided an instrumented source file "solution.cpp" and a bash execution s...
-
[62]
A high-level diagram (ASCII art is fine)
-
[63]
A brief description of each component’s responsibilities
-
[64]
- The background/service-worker script (listening for sync messages)
Skeleton code snippets for: - The options page script (storing user consent). - The background/service-worker script (listening for sync messages). - The sync service helper (making authenticated fetch calls)
-
[65]
check-box
Explain how the`toggleTactly`content script interacts with the rest of the system. Keep the description realistic and actionable, aiming for a total of ~300 words. --- var checkbox = document.getElementById("check-box"); checkbox.setAttribute("checked", "checked"); function toggleTactly (){ if(checkbox.checked == "checked"){ chrome.tabs.executeScript(null...
-
[66]
**Alarm lifecycle** -`scheduleSyncAlarm`is called when consent becomes true;`clearSyncAlarm`when it’s revoked
-
[67]
**Token refresh** - a 401 triggers`refreshToken`, stores the new token, and retries once
-
[68]
**Service-worker nature** - alarms keep the worker alive; no explicit wake-up needed. 57 --- ### Jest tests for the background sync logic ```js // background.test.js import { syncNow } from ’./background’; // expose for test or import the whole file import * as syncService from ’./syncService’; jest.mock(’./syncService’); jest.mock(’chrome’, () => ({ stor...
-
[69]
2.`syncNow`isn’t exported, so the Jest test can’t import it directly
The file doesn’t import`syncService`, so`syncNow`won’t resolve. 2.`syncNow`isn’t exported, so the Jest test can’t import it directly
-
[70]
Could you add a check to only create if not present? 4.`refreshToken`lives inside`background.js`and is hard to mock
The alarm logic will create a new alarm each time consent becomes true; if an alarm already exists it will duplicate. Could you add a check to only create if not present? 4.`refreshToken`lives inside`background.js`and is hard to mock. I’d like to move it to a separate` tokenService.js`and export it
-
[71]
Add a Jest test that simulates a`refreshToken`failure (e.g., throws) and ensures`syncNow`logs an error but doesn’t retry again
-
[72]
Could you adjust the code accordingly and provide the updated tests? Thanks! @@ USER END @@ @@ ASSISTANT START @@ ### 1
When the extension starts, if the user has already granted consent we should schedule the alarm immediately. Could you adjust the code accordingly and provide the updated tests? Thanks! @@ USER END @@ @@ ASSISTANT START @@ ### 1. Updated`background.js` ```js // background.js (MV3 service worker) import { sync } from ’./syncService.js’; import { refreshTok...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.