A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
Few-shot fine-tuning vs
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
Experiments on four models and three datasets show SFT increases sensitivity to easy contexts while later stages (DPO, RLVR) can reinforce or reverse those preferences depending on the dataset.
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
On a controlled Turkish dataset of 147 examples, few-shot prompting lets some LLMs match or beat a supervised BERT baseline for LVC detection, though results are highly sensitive to prompt design.
LLaMA 3.1 extracts visual rating scores from Dutch neuroradiology reports with 87-96% balanced accuracy but only 66-80% on numerical counts, with few-shot prompting raising the latter to 81-92%.
citing papers explorer
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
-
Emergence of Context Characteristics Sensitivity in Large Language Models
Experiments on four models and three datasets show SFT increases sensitivity to easy contexts while later stages (DPO, RLVR) can reinforce or reverse those preferences depending on the dataset.
-
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
-
Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification
On a controlled Turkish dataset of 147 examples, few-shot prompting lets some LLMs match or beat a supervised BERT baseline for LVC detection, though results are highly sensitive to prompt design.
-
Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model
LLaMA 3.1 extracts visual rating scores from Dutch neuroradiology reports with 87-96% balanced accuracy but only 66-80% on numerical counts, with few-shot prompting raising the latter to 81-92%.
- Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation