arxiv: 2604.17794 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Recognition: unknown

Bridging the Reasoning Gap in Vietnamese with Small Language Models via Test-Time Scaling

Bui Nguyen Quoc Trinh, Bui The Trung, Do Minh Duc, Nguyen Van Vinh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords small language modelsVietnamesesupervised fine-tuningreasoning gapchain-of-thoughtelementary mathematicstest-time scalingLLM judge

0 comments

The pith

Supervised fine-tuning unlocks coherent reasoning explanations in small Vietnamese language models with a 77% quality gain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Small language models often hold latent knowledge for tasks like Vietnamese elementary math but struggle to express it coherently. This paper shows that supervised fine-tuning on a localized high-fidelity dataset serves as the key unlocker for turning raw calculations into clear, pedagogical explanations. It also demonstrates that simpler chain-of-thought prompting with self-consistency outperforms more structured agentic approaches because the latter overloads limited model capacity. These results offer a practical path for deploying capable reasoning on resource-constrained devices in non-English languages.

Core claim

The Qwen3-1.7B base model shows robust latent knowledge (Accuracy: 4.05/5.00) yet suffers from a severe formatting gap in communication. Supervised Fine-Tuning on the Vi-S1K dataset functions as a critical reasoning unlocker, producing a 77% improvement in Explanation Quality and connecting raw calculation to pedagogical coherence. Prompting analysis finds that frameworks like ReAct impose a cognitive tax on small capacity, while pure Chain-of-Thought combined with Self-Consistency performs better, establishing SFT with simplified test-time scaling as the preferred strategy for edge deployment.

What carries the argument

Supervised Fine-Tuning (SFT) on the Vi-S1K localized reasoning dataset, which acts as a reasoning unlocker to convert latent calculation ability into coherent explanations.

Load-bearing premise

The LLM-as-a-Judge protocol reliably measures explanation quality and pedagogical coherence without introducing bias from the judge model or translation artifacts in the dataset.

What would settle it

A blind human evaluation of explanation samples from the base model and the SFT model on the Vi-Elementary-Bench to determine whether the reported 77% gain in quality holds under human assessment.

Figures

Figures reproduced from arXiv: 2604.17794 by Bui Nguyen Quoc Trinh, Bui The Trung, Do Minh Duc, Nguyen Van Vinh.

**Figure 1.** Figure 1: The automated data construction pipeline for Vi-S1K. The process leverages Gemini 2.5 Flash-Lite with context-aware system prompts to preserve reasoning chains (CoT), followed by rigorous terminology normalization and cultural adaptation layers to ensure pedagogical alignment with the Vietnamese educational curriculum. • Stage 1: Context-Aware Translation via Gemini 2.5 Flash-Lite - We developed a dedicate… view at source ↗

**Figure 3.** Figure 3: The effect of Test-Time Scaling via Self-Consistency. Both models benefit from increasing the number of reasoning paths (k), with the Finetuned model consistently outperforming the Base model across all configurations. VI. DISCUSSION A. The Mechanism of Success in Small Models The Base model's performance (Accuracy 4.05) challenges the assumption that SLMs cannot reason. Instead, it suggests they possess "… view at source ↗

**Figure 2.** Figure 2: Comparison of Accuracy and Explanation Quality between the Base Qwen3-1.7B and the Finetuned model. While Accuracy sees a modest gain (+0.50), Explanation Quality improves drastically (+2.00), demonstrating the "Reasoning Unlocker" effect of the Vi-S1K dataset. However, the scores did not reach perfection (e.g., 4.9 or 5.0). The model still exhibits occasional hallucinations in complex logic puzzles, refle… view at source ↗

read the original abstract

The democratization of ubiquitous AI hinges on deploying sophisticated reasoning capabilities on resource-constrained devices. However, Small Language Models (SLMs) often face a "reasoning gap", particularly in non-English languages like Vietnamese, where they struggle to maintain coherent chains of thought. This paper investigates Test-Time Scaling strategies for the Qwen3-1.7B architecture within the context of Vietnamese Elementary Mathematics. We introduce Vi-S1K, a high-fidelity reasoning dataset localized via a Gemini 2.5 Flash-Lite powered pipeline, and Vi-Elementary-Bench, a dual-resource benchmark for rigorous evaluation. Using an LLM-as-a-Judge protocol, we reveal that the base model possesses robust latent knowledge (Accuracy: 4.05/5.00) but suffers from a severe "formatting gap" in communication. Supervised Fine-Tuning (SFT) acts as a critical "reasoning unlocker", yielding a 77% improvement in Explanation Quality and bridging the gap between raw calculation and pedagogical coherence. Furthermore, our analysis of prompting strategies uncovers a significant trade-off: structured frameworks like ReAct impose a "cognitive tax" on the 1.7B parameter capacity, degrading performance relative to pure Chain-of-Thought (CoT) combined with Self-Consistency. These findings establish a deployment hierarchy for SLMs, demonstrating that SFT combined with simplified test-time scaling is superior to complex agentic workflows for edge-based reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New Vietnamese math datasets and some practical prompting trade-offs for a 1.7B model, but the 77% explanation-quality gain rests on an LLM judge with no visible validation details.

read the letter

The main takeaway is that this paper supplies two new Vietnamese elementary-math resources and some usable numbers on how SFT and prompting choices affect a 1.7B model, yet the headline performance lift is difficult to trust without more on the measurement method. They created Vi-S1K by running a Gemini localization pipeline on reasoning examples and paired it with Vi-Elementary-Bench. Those datasets are the clearest addition; data for non-English math reasoning at this level is still thin, so anyone working on low-resource SLMs will find them worth downloading and checking. The experiments on Qwen3-1.7B separate raw calculation accuracy from explanation quality and show that SFT closes the communication gap while ReAct-style prompting adds overhead that hurts the small model relative to plain CoT plus self-consistency. That trade-off is a concrete, deployment-relevant observation. The paper also notes that the base model already holds decent latent knowledge but formats answers poorly, which matches what many people see when they try small models on edge devices. The soft spot is the evaluation. The 77% improvement in explanation quality comes from an LLM-as-a-judge protocol, but the abstract supplies no rubric, no judge prompt, no human correlation numbers, and no check for bias toward the Gemini-generated training style. Without those controls the delta could be an artifact rather than a real gain in pedagogical coherence. Baselines, data splits, and statistical tests are also absent from the summary, so the comparisons are hard to audit. This work is aimed at people building or deploying small reasoning models for education in Vietnamese or similar low-resource settings. The datasets alone give it enough substance to justify referee time, even if the results section needs tighter validation. I would send it to peer review.

Referee Report

3 major / 1 minor

Summary. This paper explores test-time scaling strategies for the Qwen3-1.7B small language model on Vietnamese elementary mathematics tasks. It introduces the Vi-S1K dataset, created through a Gemini-powered localization pipeline, and the Vi-Elementary-Bench benchmark. The authors claim that the base model demonstrates high latent knowledge (4.05/5) but poor communication skills, and that supervised fine-tuning (SFT) serves as a 'reasoning unlocker' providing a 77% improvement in explanation quality according to an LLM-as-a-Judge evaluation. They also report that simpler prompting like Chain-of-Thought with Self-Consistency outperforms complex agentic approaches such as ReAct due to a 'cognitive tax' on small models.

Significance. If the reported improvements are robustly validated, the work would be significant for advancing reasoning capabilities in small models for non-English languages. It offers practical insights into when SFT is beneficial versus relying on test-time methods, and the new dataset and benchmark could facilitate further research in multilingual AI. The emphasis on deployment hierarchies for resource-constrained settings addresses an important gap in the field.

major comments (3)

[Abstract] The 77% improvement in Explanation Quality is a key result, but the abstract does not specify the LLM judge model, the detailed rubric for assessing 'pedagogical coherence', any human validation or inter-annotator agreement, or statistical tests. This information is necessary to substantiate the claim that SFT bridges the gap between raw calculation and coherent explanations.
[Abstract] Details on experimental setup are missing, including the baselines compared against, the train/test splits for Vi-S1K, the size of the dataset, and how Vi-Elementary-Bench is constructed. These omissions make it difficult to evaluate the reliability of the performance trade-offs between prompting strategies.
[Abstract] The use of Gemini 2.5 Flash-Lite for localizing the dataset raises questions about potential stylistic bias in the evaluation if the judge model shares similar characteristics; the paper should address whether the judge favors outputs aligned with the localization style.

minor comments (1)

The abstract employs colloquial phrases such as 'reasoning unlocker' and 'cognitive tax'; these should be defined more formally in the introduction or methods section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of clarity and rigor in our presentation. We address each major comment point by point below, committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] The 77% improvement in Explanation Quality is a key result, but the abstract does not specify the LLM judge model, the detailed rubric for assessing 'pedagogical coherence', any human validation or inter-annotator agreement, or statistical tests. This information is necessary to substantiate the claim that SFT bridges the gap between raw calculation and coherent explanations.

Authors: We agree that the abstract, due to length constraints, omits key methodological details supporting this central claim. The LLM judge model, rubric for pedagogical coherence, human validation procedures, inter-annotator agreement, and statistical tests are described in full in Section 4.2 and the appendices. We will revise the abstract to name the judge model and briefly reference the evaluation protocol, directing readers to the main text for complete substantiation. revision: yes
Referee: [Abstract] Details on experimental setup are missing, including the baselines compared against, the train/test splits for Vi-S1K, the size of the dataset, and how Vi-Elementary-Bench is constructed. These omissions make it difficult to evaluate the reliability of the performance trade-offs between prompting strategies.

Authors: We acknowledge that the abstract lacks these experimental details, which are essential for assessing the reported trade-offs. The baselines, Vi-S1K train/test splits, dataset size, and Vi-Elementary-Bench construction are fully specified in Sections 3.1, 3.2, and 4.1. We will update the abstract with a concise summary of the setup and key elements to improve evaluability while maintaining brevity. revision: yes
Referee: [Abstract] The use of Gemini 2.5 Flash-Lite for localizing the dataset raises questions about potential stylistic bias in the evaluation if the judge model shares similar characteristics; the paper should address whether the judge favors outputs aligned with the localization style.

Authors: This concern about possible stylistic bias is valid and merits explicit treatment. Localization used Gemini 2.5 Flash-Lite, while the judge follows a distinct model and rubric centered on reasoning content and coherence. We will add a dedicated paragraph in the Limitations section of the revised manuscript to discuss this potential issue, describe mitigation steps via prompt design, and report any relevant sensitivity analyses. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements support the claims.

full rationale

The paper's derivation proceeds from dataset construction (Vi-S1K via Gemini pipeline), benchmark creation (Vi-Elementary-Bench), model training (SFT on Qwen3-1.7B), and evaluation via LLM-as-a-Judge protocol. The 77% Explanation Quality improvement is reported as a measured delta between base-model and SFT outputs on the same judge rubric, not a quantity presupposed by definition or obtained by fitting a parameter that is then renamed as a prediction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the core result. The chain from inputs (dataset, models, prompts) to outputs (accuracy and quality scores) remains externally falsifiable against the benchmark and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the unverified reliability of the LLM judge and the fidelity of the Gemini-localized dataset; no free parameters or new entities are introduced.

axioms (1)

domain assumption LLM-as-a-Judge protocol produces unbiased and accurate scores for explanation quality and pedagogical coherence
Used to quantify the 77% improvement and base-model latent knowledge score of 4.05/5.00

pith-pipeline@v0.9.0 · 5571 in / 1202 out tokens · 47129 ms · 2026-05-10T04:32:05.692603+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Scaling laws for neural language models,

J. Kaplan et al., “Scaling laws for neural language models,” ArXiv Prepr. ArXiv200108361, 2020, Accessed: Nov. Input Question (Vietnamese) Đề bài: "Hiện nay anh 15 tuổi. Năm trước tuổi anh gấp đôi tuổi em. Hỏi năm nay em bao nhiêu tuổi?" (Translation: Currently the brother is 15 years old. Last year, the brother was twice as old as the younger sibling. Ho...

2020
[2]

Năm trước, tuổi của anh là: 15 - 1 = 14 (tuổi)
[3]

Năm trước, tuổi của em là: 14 : 2 = 7 (tuổi)
[4]

Scaling Laws for Neural Language Models

Năm nay, tuổi của em là: 7 + 1 = 8 (tuổi). Đáp số: 8 tuổi. Analysis: The model correctly identifies the logic steps but struggles to maintain the strictly required Thought/Action/Observation structure, leading to "context crowding" and eventual generation failure. Analysis: The finetuned CoT model focuses purely on the arithmetic logic. The solution is co...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” Adv. Neural Inf. Process. Syst., vol. 35, pp. 24824–24837, 2022

2022
[6]

s1: Simple test-time scaling,

N. Muennighoff et al., “s1: Simple test-time scaling,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 20286–20332. Accessed: Nov. 23, 2025. [Online]. Available: https://aclanthology.org/2025.emnlp-main.1025/

2025
[8]

React: Synergizing reasoning and acting in language models,

S. Yao et al., “React: Synergizing reasoning and acting in language models,” in The eleventh international conference on learning representations, 2022. Accessed: Nov. 13, 2025. [Online]. Available: https://openreview.net/forum?id=WE_vluYUL-X

2022
[9]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 16, 2021, arXiv: arXiv:2106.09685. doi: 10.48550/arXiv.2106.09685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021
[10]

Language Models are Few-Shot Learners

T. B. Brown et al., “Language Models are Few-Shot Learners,” July 22, 2020, arXiv: arXiv:2005.14165. doi: 10.48550/arXiv.2005.14165

work page internal anchor Pith review doi:10.48550/arxiv.2005.14165 2020
[11]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

X. Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models,” Mar. 07, 2023, arXiv: arXiv:2203.11171. doi: 10.48550/arXiv.2203.11171

work page Pith review doi:10.48550/arxiv.2203.11171 2023
[12]

Gemini: A Family of Highly Capable Multimodal Models

G. Team et al., “Gemini: A Family of Highly Capable Multimodal Models,” May 09, 2025, arXiv: arXiv:2312.11805. doi: 10.48550/arXiv.2312.11805

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2025
[13]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng et al., “Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena,” Dec. 24, 2023, arXiv: arXiv:2306.05685. doi: 10.48550/arXiv.2306.05685. APPENDIX A. System Prompt for LLM-as-a-Judge Evaluation The following prompt was used to instruct Gemini 2.5 Flash-Lite in evaluating student model outputs. The content is presented in its original Vietnamese ...

work page internal anchor Pith review doi:10.48550/arxiv.2306.05685 2023
[14]

Đánh giá cả quá trình giải và kết quả cuối

Độ chính xác (Accuracy): So sánh kết quả cuối cùng với đáp án chuẩn. Đánh giá cả quá trình giải và kết quả cuối. - 5 điểm: Kết quả cuối cùng đúng hoàn toàn; cách giải đúng, không có sai sót quan trọng. - 4 điểm: Kết quả đúng; cách gi ải có l ỗi nhỏ (nhưng không làm thay đổi kết quả và không gây hiểu nhầm nghiêm trọng). - 3 điểm: Kết quả đúng nhưng cách gi...
[15]

- 5 điểm: Giải quyết đầy đủ tất cả yêu cầu của đề bài; không bỏ sót câu hỏi phụ; có kết luận rõ ràng

Tính đầy đủ (Completeness): Đánh giá việc giải quyết tất cả các yêu cầu của bài toán. - 5 điểm: Giải quyết đầy đủ tất cả yêu cầu của đề bài; không bỏ sót câu hỏi phụ; có kết luận rõ ràng. - 4 điểm: Hầu hết yêu cầu được giải quyết; có thể thiếu một chi tiết nhỏ hoặc kết luận chưa thật rõ ràng nhưng không làm mất ý chính. - 3 điểm: Đã giải quyết được yêu cầ...
[16]

- 5 điểm: Diễn đạt rất rõ ràng, mạch lạc; sử dụng thuật ngữ toán học tiếng Việt đúng và tự nhiên; mỗi bước đều dễ hiểu với học sinh tiểu học

Kh ả năng di ễn gi ải (Explanation): Khả năng gi ải thích rõ ràng bằng tiếng Việt. - 5 điểm: Diễn đạt rất rõ ràng, mạch lạc; sử dụng thuật ngữ toán học tiếng Việt đúng và tự nhiên; mỗi bước đều dễ hiểu với học sinh tiểu học. - 4 điểm: Nhìn chung rõ ràng, ch ỉ có vài chỗ diễn đạt hơi lủng củng nhưng vẫn dễ hiểu; thuật ngữ dùng đúng hầu hết. - 3 điểm: Giải ...
[17]

- 4 điểm: Logic chủ yếu đúng; có một vài chỗ chưa giải thích kỹ hoặc nhảy bước nhưng không dẫn tới sai lầm nghiêm trọng

Tính logic (Argumentation / Logical Structure): Lý luận có ch ặt ch ẽ không? Có hallucination (b ịa chuy ện) không? Hội thảo khoa học Quốc gia về Trí tuệ nhân tạo (FJCAI) - Cần Thơ, 27-28/3/2026 10 - 5 điểm: Logic hoàn toàn chặt chẽ từ đầu đến cuối; các bước suy lu ận h ợp lý, không mâu thu ẫn; không có hallucination; không dùng công thức sai. - 4 điểm: L...

2026
[18]

Bài giải: …

Phù hợp ngữ cảnh văn hóa (Cultural Context): Đánh giá mức độ phù hợp với bối cảnh văn hóa Việt Nam. - 5 điểm: Lời giải hoàn toàn phù hợp với cách trình bày thường thấy trong sách giáo khoa và giáo viên tiểu học Việt Nam; dùng dạng câu “Bài giải: …”, “Vậy số … là: …”, v.v. - 4 điểm: Chủ yếu phù h ợp; có vài cách di ễn đạt hơi “tây” nhưng vẫn chấp nhận được...
[19]

Độ chính xác (Accuracy): ?/5
[20]

Tính đầy đủ (Completeness): ?/5
[21]

Khả năng diễn giải (Explanation): ?/5
[22]

Tính logic (Argumentation): ?/5
[23]

Phù hợp ngữ cảnh văn hóa (Cultural Context): ?/5 Điểm trung bình: ?/5 Giải thích cho từng tiêu chí (ngắn gọn): “