Mining Useful General Data for Low-Resource Domain Adaptation

Hongcheng Liu; Pingjie Wang; Shuo Tang; Yanfeng Wang; Yaxin Du; Yusheng Liao; Yu Wang; Ziqing Fan

arxiv: 2511.07380 · v2 · pith:6RYCS34Nnew · submitted 2025-11-10 · 💻 cs.CL

Mining Useful General Data for Low-Resource Domain Adaptation

Pingjie Wang , Hongcheng Liu , Yusheng Liao , Ziqing Fan , Yaxin Du , Shuo Tang , Yanfeng Wang , Yu Wang This is my paper

classification 💻 cs.CL

keywords datadomainadaptationgeneral-domainfine-tuninglow-resourcentk-selectoruseful

0 comments

read the original abstract

Adapting large language models (LLMs) to low-resource domains remains challenging due to the scarcity of domain-specific data. While in-domain data is limited, there exists a vast amount of general-domain data that shares similar question-answer formats and reasoning patterns with domain tasks. This observation raises an important question: can useful general-domain data be mined to improve low-resource domain adaptation? Our initial findings show that general-domain chain-of-thought data contains useful auxiliary signals for domain adaptation, even without careful selection. This observation motivates a new paradigm for domain adaptation beyond exclusive reliance on domain-specific data. To systematically identify the most beneficial general-domain samples, we propose NTK-Selector, motivated by the Neural Tangent Kernel's ability to capture alignment in training dynamics. Since directly applying NTK to pretrained LLMs is impractical, we introduce a Jacobian-free NTK approximation and empirically demonstrate stable NTK-like behavior during fine-tuning. Extensive experiments across medical, financial, legal, and psychological domains demonstrate that NTK-Selector consistently outperforms domain-only fine-tuning and existing data selection baselines. In particular, NTK-Selector achieves gains of +8.7 and +5.1 points on Llama3-8B-Instruct and Qwen3-8B, respectively, compared to only +0.8 and +0.9 points from domain-only fine-tuning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Gradient-Flow Optimization as Dynamic Random-Effects Inference: Testing and Early Stopping with Applications to Deep Learning
stat.ML 2026-05 unverdicted novelty 7.0

Fixed-operator squared-error gradient flow is exactly equivalent to the empirical-Bayes posterior mean under a matching random-effects model, enabling REML-based early stopping with asymptotic prediction optimality gu...
On the Difficulty of Learning a Meta-network for Training Data Selection
cs.LG 2026-05 unverdicted novelty 4.0

Analysis of MTS reveals poor GSNR and uninformative features; larger batches and distribution-based features yield 5.49% and 2.89% gains on benchmarks.