Knowledge-Graph-Driven Data Synthesis for Low-Resource Software Development: A HarmonyOS Case Study

Mingwei Liu , Zheng Pei , Yanlin Wang , Zihao Wang , Zikang Li , Enci Lin , Xin Peng , Zibin Zheng

Authors on Pith no claims yet

classification 💻 cs.SE

keywords dataframeworkapikg4syncodedevelopmentfine-tuningharmonyoslow-resource

read the original abstract

In low-resource framework development (e.g., HarmonyOS), large language models (LLMs) often lack sufficient pre-training exposure, resulting in poor code generation performance. Although they generally preserve programming logic across languages, they frequently fail on framework-specific APIs and syntax, revealing a gap between learned algorithmic knowledge and unfamiliar framework conventions. Consequently, even advanced models such as GPT-4o struggle to produce correct code without prior exposure. Inspired by these challenges, we propose APIKG4Syn, a framework that leverages API knowledge graphs to synthesize API-oriented question-code pairs without requiring executable environments. It incorporates both single-API and multi-API information, with the latter guided by uncertainty estimation (UE) and Monte Carlo Tree Search (MCTS), to construct high-quality fine-tuning data. For evaluation, we select HarmonyOS as a case study due to its accessible documentation and growing ecosystem, and build the first benchmark for its code generation. Experimental results show that fine-tuning Qwen2.5-Coder-7B with APIKG4Syn achieves a pass@1 of 25.00%, outperforming untuned GPT-4o (17.59%). We further observe that larger volumes of data generated by APIKG4Syn consistently lead to better fine-tuning performance, and that the optimal Single-API to Multi-API ratio is 8:2. Ablation studies also confirm the necessity and effectiveness of each component in our framework. These findings highlight the effectiveness of API-oriented data in enhancing LLM performance for low-resource software development scenarios.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.