pith. sign in

arxiv: 2606.10087 · v1 · pith:ZM42HKS5new · submitted 2026-06-08 · 💻 cs.CL · cs.LG

CodeAlchemy: Synthetic Code Rewriting at Scale

Pith reviewed 2026-06-27 16:09 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords synthetic datacode generationlarge language modelscode rewritingexecution tracesbenchmarkingdata augmentationprogram understanding
0
0 comments X

The pith

Synthetic rewriting of public code lets 3B models outperform 27B and 32B frontier models on code benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CodeAlchemy, a framework that applies five specific strategies to convert raw code into semantically richer synthetic training data at massive scale. By generating over 850 billion tokens including execution traces and multi-turn dialogues from code in 15 languages, the authors train small models that achieve state-of-the-art results on standard code benchmarks and new ones they introduce. A sympathetic reader would care because this suggests that data quality and format can matter more than model size for building capable code models, potentially lowering the cost of high-performing systems. The new benchmarks highlight gaps in current models' ability to track execution and handle developer tasks.

Core claim

CodeAlchemy transforms publicly sourced code into semantically-rich training data using CodeEnhance for quality-aware rewriting, CodeQA for template-based problems, CodeDev for developer tasks, CodeDialogue for multi-turn conversations, and CodeTrace for execution traces. Processing 3 corpora yields 500B+ synthetic tokens plus 350B reasoning tokens. The resulting 3B models reach 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming much larger models like 27B Gemma-3 and 32B Granite-4.0, while frontier models score only 5.6% on TraceEval.

What carries the argument

The CodeAlchemy pipeline consisting of the five data transformation strategies that enrich raw code with semantic, task-oriented, and execution-based signals.

If this is right

  • Small models can achieve high performance on code tasks without needing massive scale if trained on appropriately transformed data.
  • Execution trace prediction reveals significant gaps in semantic understanding for even the largest models.
  • New benchmarks like DevEval and TraceEval provide better measures of real-world code capabilities.
  • Synthetic data generation at this scale is feasible from public code corpora.
  • Performance on HumanEval and MBPP can be pushed higher through these methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the gains hold, focusing on data synthesis could be more efficient than scaling model size for specialized domains like code.
  • This approach might generalize to other structured data types beyond code.
  • One could test whether applying CodeTrace to other programming languages or libraries further improves library knowledge.
  • The low scores on TraceEval suggest that current training paradigms miss key aspects of program semantics.

Load-bearing premise

The performance improvements are due to the richer semantic signals from the synthetic strategies rather than differences in training procedures, model architecture, or benchmark construction.

What would settle it

Training an identical 3B model on the original raw code corpora without the five rewriting strategies and measuring if it matches or exceeds the reported benchmark scores.

read the original abstract

Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements. We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CodeAlchemy, a framework that applies five strategies (CodeEnhance, CodeQA, CodeDev, CodeDialogue, CodeTrace) to three public code corpora across 15 languages, generating over 850B tokens of synthetic and reasoning data. It reports that 3B models trained on this data reach 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on the introduced DevEval benchmark, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models up to 10x larger, while also showing that even large models score low (e.g., 5.6% exact match) on TraceEval.

Significance. If the performance attribution to the synthetic strategies is substantiated, the work would be significant for showing that large-scale, semantically targeted synthetic code data can close capability gaps for small models on code tasks and for introducing DevEval and TraceEval as diagnostics that expose limitations in current models' execution and developer-task understanding. The reported scale (1.3M+ instrumented files, 500B+ synthetic tokens) is a notable engineering contribution.

major comments (3)
  1. [§4] §4 (Experimental results) and abstract: The headline performance numbers for the 3B models are presented without any ablation that trains an identical 3B model on an equal volume of raw tokens drawn from the same three source corpora (without the five rewriting/QA/trace steps). This control is required to support the central claim that the strategies supply richer semantic signals rather than gains arising from token count or unstated training details.
  2. [§3.5] §3.5 (CodeTrace description): The claim that CodeTrace captures control flow, state tracking, and library knowledge rests on instrumenting 1.3M+ files, but the section provides no quantitative details on execution success rate, error filtering rules, or coverage statistics across the 5K libraries; without these, it is unclear whether the 350B reasoning tokens are high-quality or dominated by failed/partial traces.
  3. [§4.3] §4.3 (Benchmark construction): DevEval and TraceEval are introduced as new evaluations, yet the paper does not report inter-annotator agreement, contamination checks against the training corpora, or exclusion criteria for the test items; these omissions affect the interpretability of the 8.09% win rate and 15.36 ROUGE-2 scores.
minor comments (2)
  1. [§2] The abstract and §2 cite prior synthetic-data work but omit direct comparison tables showing token volumes or strategy differences relative to the largest prior code-synthesis efforts.
  2. [§3] Notation for the five strategies is introduced in §3 but not consistently referenced in the results tables, making it hard to map which strategy contributes to which benchmark gain.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for strengthening the attribution of gains to our synthetic strategies, the documentation of CodeTrace, and the validation of the new benchmarks. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental results) and abstract: The headline performance numbers for the 3B models are presented without any ablation that trains an identical 3B model on an equal volume of raw tokens drawn from the same three source corpora (without the five rewriting/QA/trace steps). This control is required to support the central claim that the strategies supply richer semantic signals rather than gains arising from token count or unstated training details.

    Authors: We agree that this ablation is essential to isolate the contribution of the five strategies from raw token volume or training details. In the revised manuscript, we will add results from training an identical 3B model on an equivalent volume of raw tokens from the three source corpora using the same training configuration. This will directly support the claim that the rewriting, QA, dialogue, and trace strategies provide richer semantic signals. revision: yes

  2. Referee: [§3.5] §3.5 (CodeTrace description): The claim that CodeTrace captures control flow, state tracking, and library knowledge rests on instrumenting 1.3M+ files, but the section provides no quantitative details on execution success rate, error filtering rules, or coverage statistics across the 5K libraries; without these, it is unclear whether the 350B reasoning tokens are high-quality or dominated by failed/partial traces.

    Authors: We acknowledge that quantitative details on execution success, filtering, and coverage are needed to substantiate the quality of the traces. We will expand §3.5 in the revision to include these statistics (success rates, error filtering rules, and library coverage across the 5K libraries), which were generated during the instrumentation process but omitted for space. This will clarify that the 350B reasoning tokens are not dominated by failed traces. revision: yes

  3. Referee: [§4.3] §4.3 (Benchmark construction): DevEval and TraceEval are introduced as new evaluations, yet the paper does not report inter-annotator agreement, contamination checks against the training corpora, or exclusion criteria for the test items; these omissions affect the interpretability of the 8.09% win rate and 15.36 ROUGE-2 scores.

    Authors: We agree these details are required for full interpretability of the new benchmarks. In the revised §4.3, we will report inter-annotator agreement scores, describe the contamination checks performed against the training corpora, and specify the exclusion criteria used for test items. These additions will strengthen confidence in the reported DevEval win rate and TraceEval ROUGE-2 scores. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical claims without derivations or self-referential reductions

full rationale

The paper describes an empirical pipeline for generating 500B+ synthetic tokens via five named strategies (CodeEnhance, CodeQA, etc.) applied to public corpora, followed by model training and benchmark reporting (HumanEval 83.5%, MBPP 63.2%, etc.). No equations, parameter fits, uniqueness theorems, or derivation chains appear in the provided text. Performance attribution is presented as an experimental outcome rather than a quantity defined in terms of the input data or prior self-citations. The central premise (synthetic rewriting supplies richer signals) is an empirical hypothesis tested via reported results, not a self-definitional or fitted-input reduction. This matches the default expectation of no circularity for non-derivational empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the unstated premise that the five synthetic strategies produce high-quality semantically rich data whose benefits are not artifacts of benchmark construction or training procedure; no explicit free parameters, standard mathematical axioms, or new postulated entities are named in the abstract.

pith-pipeline@v0.9.1-grok · 5786 in / 1257 out tokens · 29308 ms · 2026-06-27T16:09:52.197663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    URL https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf . Technical report. OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508.10925. ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin...

  2. [2]

    The proxy pre-resolves DNS entries and blocks all non-whitelisted domains

    Setup stage:When dependencies are needed, we provide controlled network access through a custom SOCKS5 proxy that enforces a whitelist of trusted package registries including PyPI, npm, Maven Central, crates.io, RubyGems, and others. The proxy pre-resolves DNS entries and blocks all non-whitelisted domains. The package caches (pip, npm, cargo, Maven, etc....

  3. [3]

    The instrumented code then executes in complete isolation

    Main stage:After package installation, we terminate all processes from the setup stage and remove network access entirely usingunshare –net, creating a fresh network namespace with no connectivity. The instrumented code then executes in complete isolation. Resource limits.We enforce strict resource constraints usingprlimit: 30GB maximum RAM (both virtual ...

  4. [4]

    **Zero Score Check** -- If any disqualifying conditions apply, assign`Rating: [[0]]`immediately and explain

  5. [5]

    **Context Assessment** -- Identify code type and adjust expectations

  6. [6]

    **Quality Evaluation** -- Rate 1--10 using weighted criteria with anchors

  7. [7]

    **Training Value Analysis** -- Assess learning benefit for LLMs

  8. [8]

    * **Auto-Generated**: Generation markers (`DO NOT EDIT`,`generated by`, etc.) or obvious boilerplate

    **Final Rating** -- Output in exact format:`Rating: [[X]]` --- ## **Zero Score Conditions (Rating: 0)** Assign **0** if **any** apply: * **Pure Config/Data** (>75% content): JSON, YAML, .env, SQL dumps, schemas without logic (exception: small embedded datasets tightly coupled with algorithms). * **Auto-Generated**: Generation markers (`DO NOT EDIT`,`gener...

  9. [9]

    **Reference examples** - showing target style/format

  10. [10]

    in the provided code file

    **Code file** - for inspiration (not available during evaluation) ## TASK Study the examples, examine code file, and generate new instances that: - Match the **style, format, schema, and difficulty** (or harder) of the examples as shown within`START OF EXAMPLE`and`END OF EXAMPLE`. If examples omit full solutions, omit them too - follow their format exactl...

  11. [11]

    **Concrete References**: Use actual function/class/variable names from the code

  12. [12]

    **Transferable Skills**: Focus on applicable patterns, not memorization

  13. [13]

    **Realistic Tasks**: Reflect genuine developer needs

  14. [14]

    **Unambiguous**: Clear success criteria, no external context needed ### Task Coverage Prioritize applicable ones; ensure >=5 main categories covered across outputs: * Code comprehension (flow, decisions, trade-offs, comparison, equivalence understanding) * Debugging (real/hypothetical issues) * Feature extension (natural additions) * Refactoring & moderni...

  15. [15]

    **Scope:** Function -> class -> file -> system

  16. [16]

    **Scenario:** Understanding, debugging, extending, reviewing, migration

  17. [17]

    **Constraints:** Time/space limits, compatibility, dependency restrictions

  18. [18]

    **Audience:** Self, reviewer, junior dev, domain expert

  19. [19]

    **Format:** Inline code, snippets, mixed prose, pure paragraphs (no lists), conversational flow - rotate actively

  20. [20]

    **Phrasing:** Questions, imperatives, conversational

  21. [21]

    **Specificity:** Targeted fixes -> open-ended improvements

  22. [22]

    Explain this code

    **Difficulty:** 25% simple, 35% moderate, 40% complex ### Important Notes * Avoid near-duplicate prompts testing the same skill * Include realistic imperfections in 30-40%: typos, poor code style, legacy patterns, vague phrasing, informal tone * 20% should be **long** (200+ tokens) * Prefer referencing relationships between code elements over isolated sni...

  23. [23]

    **Technically Correct**: Provide working code, accurate explanations, and analysis grounded in the actual source

  24. [24]

    **Honest**: Identify errors, challenge false premises, and prioritize sound advice over blind compliance

  25. [25]

    **Adaptive**: Match the prompt’s complexity, tone, and format needs naturally ## Handling Problematic Requests When prompts request broken, infeasible, or anti-pattern approaches: 28

  26. [26]

    Explain the specific issue with concrete evidence from the code

  27. [27]

    Recommend the correct approach and rationale

  28. [28]

    blocking release

    If educational value exists, show the requested approach with explicit warnings about trade-offs ## Quality Standards - Reference actual functions/classes/variables from the source file - Match response depth to prompt complexity (don’t over-explain simple requests) - Make reasonable assumptions for underspecified prompts and state them clearly - Vary for...

  29. [29]

    **Seniority:** Would this require a senior/staff-level engineer?

  30. [30]

    **Dimensions Added:** >=2 new reasoning or scope dimensions?

  31. [31]

    **Feasibility:** Achievable given the original code context?

  32. [32]

    prompt": r

    **Relevance (Invention only):** Naturally extends the codebase’s domain? ## Output Format ```python [ { "prompt": r"""Input to code LLM with snippets""", "evolution_type": "Mutation: Constraint Stacking | Crossover: Sequential Chaining | Hybrid | Invention", "why_harder": "One sentence.", "categories": ["Task", "Sub-task", "Sub-sub-task"], "difficulty": "...

  33. [33]

    **Checkpoint threshold**:`sum > chkpt+500; chkpt=sum`

  34. [34]

    **Stack deltas**:`|len(stack)-prev| > 5`

  35. [35]

    **Statistical**:`median(buf)-mean(buf) > std(buf)`

  36. [36]

    **Uniqueness ratio**:`unique(arr)/len(arr) < 0.8`

  37. [37]

    **Relative change**:`|val-prev|/prev > 0.15`

  38. [38]

    **Conditional structure**:`depth%2==0 ? nodes>10 : leaves>5`

  39. [39]

    **Irregular math**:`(cnt*cnt) % depth == 0`

  40. [40]

    **Derived aggregates**:`rank < p10(vals)`

  41. [41]

    **Cross-trace coupling**:`traces["ERR"] > 5; thresh*=2`

  42. [42]

    **Peak tracking**:`depth > peak*1.3; peak=depth`

  43. [43]

    **Historical filter**:`count(hist, x<0) > len(hist)/3`

  44. [44]

    **Window eviction**:`len(win) < prevLen`

  45. [45]

    **Decay tracking**:`val < prev*0.95`

  46. [46]

    **Direction violation**:`prevDelta>0 && delta<-thresh`

  47. [47]

    trace_patterns_used

    **Bit population**:`popcount(mask) in primes && depth>5` **NEVER USE** - Fixed intervals:`i%7==0` - Exact equality on numbers:`x==123` - Power-of-2 indices:`i&(i-1)==0` - Uniform logging:`for x: trace(x)` - High-frequency conditions:`sum>100`,`x%1000!=0`,`size>5` - Single-var irregularity:`isPrime(x)`,`bits(x)==3`on simple counters - Opaque trace content:...

  48. [48]

    source_filename

    Instrumented code with`TRACE:<TYPE>:<LOC>:<STATE>`statements 2.`trace_patterns_used`JSON showing the instrumented patterns **CONSTRAINTS**: - 3-5 tests (each with 60-100 logical data elements) - Inputs inline only (CLI args, stdin, echo pipe, heredoc) - Do not require manual test/data file creation - Deterministic (no randomness) - Do not modify instrumen...

  49. [49]

    The conversion to HTML is performed atomically: the script writes to a temporary file in the same directory, then renames it to`$HTML`in a single atomic`mv`operation

  50. [50]

    A lock file`${HTML}.lock`is created at the start of the conversion and removed at the end; if the lock already exists, the script waits up to 10 s and retries, aborting with an informative error if the lock cannot be obtained

  51. [51]

    The script must also check that the filename does not contain`;`or`&&`before proceeding

    All external commands (`pandoc`,`chmod`,`fixbold.py`) are invoked with fully quoted arguments to prevent injection. The script must also check that the filename does not contain`;`or`&&`before proceeding

  52. [52]

    The script logs each step to`stderr`with timestamps, including lock acquisition, temporary file creation, pandoc invocation, and final rename

  53. [53]

    Usage: ${PROG} markdownfile

    The final HTML file must retain the original permissions of`header.html`and then be set to`644`. Explain how each change mitigates race conditions and injection risks, and provide the modified script segment with comments. #! /bin/bash # script to turn labXX.md markdown file into html # # J. Knerr # Fall 2017 #---- check usage; give helpful message if wro...

  54. [54]

    **Argument validation** - only one file is accepted, and any`;`or`&&`in the name causes an immediate exit

  55. [55]

    **Names are derived** -`BASE`is the stem of the Markdown file;`$HTML`is the target,`$TMP`is a unique temp file in the same directory, and`$LOCK`is a per-output lock file

  56. [56]

    $$" "$LOCK

    **Lock acquisition** -`ln -s "$$" "$LOCK"`is an atomic operation. If the link already exists, the script sleeps and retries for up to 10 seconds

  57. [57]

    Every external command receives its arguments quoted

    **Temporaries** -`header.html`is copied into`$TMP`; the Pandoc output is appended to the same file. Every external command receives its arguments quoted

  58. [58]

    **Post-processing** -`fixbold.py`is run on the temporary file

  59. [59]

    $TMP" "$HTML

    **Atomic rename** -`mv "$TMP" "$HTML"`replaces the old output (if any) in a single, atomic step

  60. [60]

    **Permissions** - The final file is explicitly set to`644`, matching the original header permissions

  61. [61]

    solution.cpp

    **Cleanup** - The lock file is removed, and a timestamped message confirms success. All log messages go to`stderr`with a timestamp, making it trivial to trace what happened even when the script runs in the background or is invoked by cron. Sample D4Data sample fromCodeTrace You are provided an instrumented source file "solution.cpp" and a bash execution s...

  62. [62]

    A high-level diagram (ASCII art is fine)

  63. [63]

    A brief description of each component’s responsibilities

  64. [64]

    - The background/service-worker script (listening for sync messages)

    Skeleton code snippets for: - The options page script (storing user consent). - The background/service-worker script (listening for sync messages). - The sync service helper (making authenticated fetch calls)

  65. [65]

    check-box

    Explain how the`toggleTactly`content script interacts with the rest of the system. Keep the description realistic and actionable, aiming for a total of ~300 words. --- var checkbox = document.getElementById("check-box"); checkbox.setAttribute("checked", "checked"); function toggleTactly (){ if(checkbox.checked == "checked"){ chrome.tabs.executeScript(null...

  66. [66]

    **Alarm lifecycle** -`scheduleSyncAlarm`is called when consent becomes true;`clearSyncAlarm`when it’s revoked

  67. [67]

    **Token refresh** - a 401 triggers`refreshToken`, stores the new token, and retries once

  68. [68]

    **Service-worker nature** - alarms keep the worker alive; no explicit wake-up needed. 57 --- ### Jest tests for the background sync logic ```js // background.test.js import { syncNow } from ’./background’; // expose for test or import the whole file import * as syncService from ’./syncService’; jest.mock(’./syncService’); jest.mock(’chrome’, () => ({ stor...

  69. [69]

    2.`syncNow`isn’t exported, so the Jest test can’t import it directly

    The file doesn’t import`syncService`, so`syncNow`won’t resolve. 2.`syncNow`isn’t exported, so the Jest test can’t import it directly

  70. [70]

    Could you add a check to only create if not present? 4.`refreshToken`lives inside`background.js`and is hard to mock

    The alarm logic will create a new alarm each time consent becomes true; if an alarm already exists it will duplicate. Could you add a check to only create if not present? 4.`refreshToken`lives inside`background.js`and is hard to mock. I’d like to move it to a separate` tokenService.js`and export it

  71. [71]

    Add a Jest test that simulates a`refreshToken`failure (e.g., throws) and ensures`syncNow`logs an error but doesn’t retry again

  72. [72]

    Could you adjust the code accordingly and provide the updated tests? Thanks! @@ USER END @@ @@ ASSISTANT START @@ ### 1

    When the extension starts, if the user has already granted consent we should schedule the alarm immediately. Could you adjust the code accordingly and provide the updated tests? Thanks! @@ USER END @@ @@ ASSISTANT START @@ ### 1. Updated`background.js` ```js // background.js (MV3 service worker) import { sync } from ’./syncService.js’; import { refreshTok...