Token Optimization Strategies for LLM-Based Oracle-to-PostgreSQL Migration
Pith reviewed 2026-06-29 09:28 UTC · model grok-4.3
The pith
Adaptive routing reduces input tokens by 8.72% and output tokens by 5.49% in LLM Oracle-to-PostgreSQL migration while keeping semantic match at 88.40%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Token optimization in LLM-based Oracle2PostgreSQL migration must be evaluated as a constrained transformation problem balancing cost, syntactic validity, semantic preservation, and structural fidelity. Among the twelve strategies tested, adaptive routing provides the best practical trade-off, reducing input tokens by 8.72% and output tokens by 5.49% while maintaining 88.40% Semantic Match and increasing Token Efficiency by 6.67%. Aggressive schema distillation increases Token Efficiency by 132.22% but decreases Semantic Match by 44.50 percentage points.
What carries the argument
Adaptive routing, the strategy that dynamically selects among optimization paths according to query characteristics to balance token reduction against semantic preservation.
If this is right
- Mild context pruning preserves semantic quality nearly at the baseline level, reaching 89.75% Semantic Match on the 100-query sample.
- Aggressive schema distillation raises Token Efficiency substantially but produces a 44.50-percentage-point drop in Semantic Match.
- Direct inclusion of large schema and procedural artefacts raises cost and risks quality degradation, so selective strategies are required.
- Token optimization must account for dialect-specific semantic differences and the risk of drift during transformation.
Where Pith is reading between the lines
- The same routing logic could be tested on migrations involving other SQL dialects or non-SQL codebases.
- Production systems would likely need to combine automated routing with human review to catch cases where metrics miss semantic errors.
- Future experiments could measure whether the observed token savings hold when the input includes full stored procedures rather than isolated queries.
Load-bearing premise
That automated metrics such as Semantic Match accurately reflect true semantic preservation in the migrated queries and that results from samples of 10 and 100 queries generalize to full production workloads.
What would settle it
Apply the top-performing strategies to a production-scale collection of Oracle queries, then compare the automated Semantic Match scores against manual expert verification of whether the PostgreSQL output preserves the original query intent and behavior.
read the original abstract
LLMs are increasingly used for software modernization, code translation, and database migration. However, LLM-based Oracle2PostgreSQL migration remains constrained by high token consumption, long-context degradation, dialect-specific semantic differences, and the risk of semantic drift during query transformation. Direct inclusion of large Oracle SQL/PL-SQL artefacts, schema definitions, procedural logic, and migration instructions into the model context increases cost and may reduce generation quality. This paper shows token optimization as a constrained transformation problem in LLM-based Oracle2PostgreSQL migration. The study formalizes and evaluates twelve token optimization strategies: baseline representation, context pruning, minification, DSL-based semantic compression, metadata augmentation, context refactoring, schema distillation, adaptive routing, AST-based minification, identifier masking, output constraint enforcement, and hybrid optimization. The strategies are evaluated on samples of 10 and 100 Oracle SQL queries using Valid Syntax Rate, Exact Match, Semantic Match, CodeBLEU, and Token Efficiency. The results show that mild context pruning preserves semantic quality almost at the baseline level, achieving 89.75% Semantic Match on the 100-query sample compared with 89.80% for the unoptimized baseline. Adaptive routing provides the best practical trade-off, reducing input tokens by 8.72% and output tokens by 5.49% while maintaining 88.40% Semantic Match and increasing Token Efficiency by 6.67%. Aggressive schema distillation increases Token Efficiency by 132.22% but results in a 44.50-percentage-point decrease in Semantic Match. The findings demonstrate that token optimization cannot be treated as simple prompt shortening; it must be evaluated as a multi-objective migration problem balancing cost, syntactic validity, semantic preservation, and structural fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates twelve token optimization strategies (baseline, context pruning, minification, DSL compression, metadata augmentation, refactoring, schema distillation, adaptive routing, AST minification, identifier masking, output constraints, hybrid) for LLM-based Oracle-to-PostgreSQL migration. It reports empirical results on samples of 10 and 100 queries using Valid Syntax Rate, Exact Match, Semantic Match, CodeBLEU, and Token Efficiency, concluding that mild pruning nearly preserves baseline semantic quality (89.75% vs 89.80% Semantic Match) while adaptive routing offers the best trade-off (8.72% input / 5.49% output token reduction, 88.40% Semantic Match, +6.67% Token Efficiency) and aggressive distillation boosts efficiency at the cost of semantic match.
Significance. If the automated metrics prove reliable proxies for migration correctness, the systematic comparison of twelve strategies supplies actionable guidance on balancing token cost against syntactic validity and semantic fidelity in LLM-driven database modernization. The multi-strategy design and explicit multi-objective framing are strengths.
major comments (3)
- [Abstract / Evaluation] Abstract and Evaluation section: Semantic Match is reported at 88.40% for adaptive routing and 89.75% for mild pruning on the n=100 sample, yet the manuscript supplies no definition, computation procedure, or correlation study with human judgment or execution equivalence on actual PostgreSQL instances; without this, the central trade-off claim cannot be assessed.
- [Abstract] Abstract: Results are presented on samples of 10 and 100 queries with no description of query selection criteria, stratification by procedural complexity or schema dependence, statistical significance tests, or error bars; this directly undermines the generalization to “real migration workloads” asserted in the abstract.
- [Abstract] Abstract: The headline conclusion that adaptive routing is the “best practical trade-off” rests entirely on the unvalidated automated scores (Semantic Match, CodeBLEU, Token Efficiency); no execution-based equivalence checks or expert review of generated PostgreSQL code are reported, leaving the semantic-preservation component of the multi-objective claim unsupported.
minor comments (2)
- The twelve strategies are listed but their precise implementations (e.g., exact pruning rules, DSL grammar, routing heuristics) are not cross-referenced to any appendix or repository, hindering reproducibility.
- Notation for the five metrics is introduced without explicit formulas or pseudocode, even though the paper positions the work as a constrained optimization problem.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and outline targeted revisions to improve transparency and rigor without altering the core empirical findings.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: Semantic Match is reported at 88.40% for adaptive routing and 89.75% for mild pruning on the n=100 sample, yet the manuscript supplies no definition, computation procedure, or correlation study with human judgment or execution equivalence on actual PostgreSQL instances; without this, the central trade-off claim cannot be assessed.
Authors: We agree that an explicit definition and computation procedure for Semantic Match is required. The manuscript relies on this metric without sufficient elaboration. We will add a new subsection under Evaluation that defines Semantic Match as the proportion of outputs preserving core semantic elements (table/column references, predicates, and result cardinality) via automated structural comparison, and we will cite the exact procedure used. We will also add an explicit limitations paragraph noting the absence of human correlation studies or PostgreSQL execution equivalence checks. These changes will make the trade-off claims assessable. revision: yes
-
Referee: [Abstract] Abstract: Results are presented on samples of 10 and 100 queries with no description of query selection criteria, stratification by procedural complexity or schema dependence, statistical significance tests, or error bars; this directly undermines the generalization to “real migration workloads” asserted in the abstract.
Authors: We acknowledge the missing methodological details. The queries were sampled from a corpus of 500 representative Oracle statements, but selection criteria and any stratification were omitted. We will expand the Evaluation section to describe the sampling process, note the absence of formal statistical tests due to sample size, and include error bars on revised figures where variance can be computed. We will also revise the abstract to replace the phrase “real migration workloads” with “typical migration workloads” to avoid overgeneralization. revision: yes
-
Referee: [Abstract] Abstract: The headline conclusion that adaptive routing is the “best practical trade-off” rests entirely on the unvalidated automated scores (Semantic Match, CodeBLEU, Token Efficiency); no execution-based equivalence checks or expert review of generated PostgreSQL code are reported, leaving the semantic-preservation component of the multi-objective claim unsupported.
Authors: We agree that the multi-objective claim would be stronger with runtime validation. The study deliberately uses established automated proxies (Semantic Match, CodeBLEU) drawn from the code-translation literature. We will revise the abstract and add a dedicated Limitations section that explicitly qualifies all semantic claims as being based on these automated scores rather than execution equivalence or expert review. This framing preserves the reported trade-off while making its evidential basis transparent; new execution experiments lie outside the current scope. revision: partial
Circularity Check
No circularity: purely empirical comparison of strategies on external metrics
full rationale
The paper reports experimental results from applying twelve token optimization strategies to samples of 10 and 100 Oracle queries, measuring outcomes with externally defined metrics (Valid Syntax Rate, Exact Match, Semantic Match, CodeBLEU, Token Efficiency). No equations, derivations, predictions, or first-principles claims appear that reduce to fitted parameters, self-definitions, or self-citation chains. All reported percentages (e.g., 8.72% input token reduction, 88.40% Semantic Match) are direct observations from the evaluation runs, not outputs forced by construction from the inputs. The work is self-contained against external benchmarks and contains no load-bearing self-citations or ansatzes.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
If maximum semantic preservation is required, use Baseline or Context Pruning
-
[2]
If scalable migration is required with moderate token savings, use Adaptive Routing
-
[3]
If the input is simple DDL with low procedural complexity, Distillation may be used cautiously
-
[4]
If identifiers carry domain semantics, avoid Identifier Masking
-
[5]
If output cost is the main constraint, use Prompt Restricted only after verifying VSR and SM
-
[6]
Avoid applying aggressive compression uniformly across heterogeneous SQL/PL-SQL artefacts
-
[7]
C ONCLUSIONS This paper investigated token optimization strategies for LLM-based Oracle-to-PostgreSQL migration
Always combine token metrics with semantic and syntactic metrics. C ONCLUSIONS This paper investigated token optimization strategies for LLM-based Oracle-to-PostgreSQL migration. The study formalized token optimization as a constrained transformation problem inside specification-driven development and evaluated twelve strategies using VSR , EM , SM , Code...
-
[8]
LLM-powered database migration: A framework for knowledge graph system evolution,
S. Zhao, Q. Zhang, and M. Lan, “LLM-powered database migration: A framework for knowledge graph system evolution,” Alexandria Engineering Journal, vol. 130, pp. 198–207, 2025, doi: 10.1016/j.aej.2025.08.014. [2] D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding, and J. Zhou, “Text-to-SQL empowered by large language models: A benchmark evaluation,” Proceedi...
-
[9]
Unixcoder: Unified cross-modal pre-training for code representation,
D. Guo et al., “UniXcoder: Unified cross-modal pre-training for code representation,” in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2022, pp. 7212–7225, doi: 10.18653/v1/2022.acl-long.499. [14] X. Wang et al., “SynCoBERT: Syntax-guided multi-modal contrastive pre-training for code representation,” arXiv:2108.04556, 2021. ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.