Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation

Deshan Sumanathilaka; Lakshan Cooray; Pattigadapa Venkatesh Raju

arxiv: 2602.00665 · v3 · submitted 2026-01-31 · 💻 cs.CL · cs.AI

Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation

Lakshan Cooray , Deshan Sumanathilaka , Pattigadapa Venkatesh Raju This is my paper

Pith reviewed 2026-05-16 08:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords small language modelsmulti-turn question answeringcustomer servicecontext summarizationdialogue continuitysynthetic data evaluationinstruction-tuned modelscomparative study

0 comments

The pith

Some small language models achieve near-large-model performance on multi-turn customer-service QA using context summaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates whether small language models can effectively handle multi-turn question answering in customer service scenarios when provided with summarized conversation history. It tests nine instruction-tuned SLMs against three commercial LLMs on synthetic data using similarity metrics and qualitative analysis. The findings indicate significant variation among SLMs, with some performing close to LLMs in maintaining dialogue continuity and context, while others fall short. This matters because SLMs offer a computationally efficient alternative to LLMs for practical deployments where resources are limited. A conversation stage-based analysis further reveals how models behave across different phases of interactions.

Core claim

The study demonstrates that instruction-tuned low-parameterized SLMs exhibit notable variation in their ability to handle context-summarized multi-turn customer-service QA. Some models achieve performance levels comparable to commercial LLMs in preserving essential conversational state and responding appropriately, as measured by lexical and semantic similarity metrics along with human and LLM-as-judge evaluations. Others struggle with dialogue continuity and contextual alignment. The use of a history summarization strategy is central to enabling this evaluation.

What carries the argument

A history summarization strategy that condenses prior dialogue turns to maintain essential conversational state for ongoing multi-turn interactions.

If this is right

SLMs can serve as efficient alternatives in resource-constrained customer-service QA systems when they perform near LLM levels.
Model selection is critical as performance varies widely among instruction-tuned SLMs.
Qualitative analysis across conversation stages can identify specific weaknesses in dialogue handling.
History summarization preserves context sufficiently for some models to maintain coherence in extended interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deploying the stronger SLMs could lower computational costs and latency in real customer service applications.
Testing these models on actual user-generated dialogues rather than synthetic ones would provide stronger validation.
Similar summarization techniques might improve SLM performance in other multi-turn dialogue domains like technical support.
Further refinement of summarization could narrow gaps to full LLM performance in continuity tasks.

Load-bearing premise

The synthetic data and history summarization strategy used in the evaluation sufficiently represent the contextual demands and continuity challenges of real-world multi-turn customer-service interactions.

What would settle it

A direct comparison showing that even the best SLM scores substantially lower than LLMs on a dataset of real customer-service conversations would disprove the near-LLM performance claim.

read the original abstract

Customer-service question answering (QA) systems increasingly rely on conversational language understanding. While Large Language Models (LLMs) achieve strong performance, their high computational cost and deployment constraints limit practical use in resource-constrained environments. Small Language Models (SLMs) provide a more efficient alternative, yet their effectiveness for multi-turn customer-service QA remains underexplored, particularly in scenarios requiring dialogue continuity and contextual understanding. This study investigates instruction-tuned SLMs for context-summarized multi-turn customer-service QA, using a history summarization strategy to preserve essential conversational state. We also introduce a conversation stage-based qualitative analysis to evaluate model behavior across different phases of customer-service interactions. Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods. Results show notable variation across SLMs, with some models demonstrating near-LLM performance, while others struggle to maintain dialogue continuity and contextual alignment. These findings highlight both the potential and current limitations of low-parameterized language models for real-world customer-service QA systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Some small models get close to LLMs on this synthetic customer-service task, but the lack of real-data validation keeps the practical takeaway limited.

read the letter

The paper's core finding is that a subset of instruction-tuned SLMs can match or approach commercial LLMs on multi-turn customer-service QA when conversation history is summarized, while others lose continuity and context. They run nine small models against three large ones using lexical and semantic metrics, plus human and LLM-as-judge scoring, and break results down by conversation stage. That stage analysis is a useful addition because it shows where models start to drift as the dialogue lengthens. The setup is straightforward to follow and the synthetic data pipeline is described at a level that supports replication of the exact experiment. Credit to the authors for focusing on a concrete deployment constraint rather than chasing another general benchmark. The main limitation is the data itself. All results come from generated conversations, and the paper does not include a fidelity check or transfer test on actual customer logs. Without that, the observed variation and the cases of near-LLM performance could be tied to how clean or scripted the synthetic turns are. Real service dialogues often have interruptions, topic jumps, and incomplete utterances that the current setup may not reproduce. The abstract also stays high-level on the numbers, so the full paper needs to show the actual scores, variance, and statistical comparisons to make the variation claim convincing. This work is aimed at teams trying to run support systems on modest hardware. It gives a practical signal on which small models are worth testing first in that domain. It deserves peer review because the experiment is well-scoped and the question has clear engineering value, even if the synthetic-data gap will need attention in revision.

Referee Report

3 major / 1 minor

Summary. The manuscript evaluates nine instruction-tuned small language models (SLMs) against three commercial LLMs on context-summarized multi-turn customer-service QA. It generates synthetic dialogues, applies a history summarization strategy to retain conversational state, and measures performance with lexical/semantic similarity metrics plus human and LLM-as-judge qualitative assessments, including a conversation-stage analysis. Results indicate substantial variation across SLMs, with some approaching LLM-level performance while others fail to maintain continuity and contextual alignment.

Significance. If the synthetic data and summarization approach prove representative, the work usefully demonstrates that selected low-parameter models can deliver practical efficiency gains for customer-service QA without sacrificing much dialogue coherence. The stage-based qualitative framework is a constructive addition for diagnosing where models lose context. The absence of real-log validation, however, caps the strength of claims about deployment readiness.

major comments (3)

[Synthetic Data Construction and Evaluation Setup] The central claim that certain SLMs achieve near-LLM performance rests on synthetic multi-turn conversations. No fidelity metrics (e.g., rates of topic drift, state loss, or interruption patterns) or transfer experiment on authentic customer-service transcripts are reported, leaving open the possibility that observed continuity results are artifacts of the generation process rather than intrinsic model capability.
[Results and Quantitative Reporting] The abstract and results sections describe 'notable variation' and 'near-LLM performance' without supplying concrete metric values, confidence intervals, or statistical tests. This omission prevents readers from judging the practical magnitude of the reported differences between SLMs and LLMs.
[Methodology: History Summarization] The history summarization strategy is presented at a high level; missing details include the exact summarization prompt template, token-budget rules, and any ablation showing how summarization affects downstream continuity scores. These elements are load-bearing for the continuity claims.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one representative quantitative result (e.g., average semantic similarity for the best SLM versus the LLM baseline).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive feedback on our manuscript. We address each of the major comments point by point below, providing clarifications and committing to revisions where appropriate.

read point-by-point responses

Referee: [Synthetic Data Construction and Evaluation Setup] The central claim that certain SLMs achieve near-LLM performance rests on synthetic multi-turn conversations. No fidelity metrics (e.g., rates of topic drift, state loss, or interruption patterns) or transfer experiment on authentic customer-service transcripts are reported, leaving open the possibility that observed continuity results are artifacts of the generation process rather than intrinsic model capability.

Authors: We agree that validating the synthetic data is important. Our synthetic dialogues were generated using a multi-stage process designed to emulate real customer-service scenarios, including topic progression and state tracking. To address this, we will add fidelity metrics such as topic drift rates and state loss percentages in the revised manuscript. Regarding transfer to authentic transcripts, privacy regulations prevent us from accessing real customer-service logs, so we cannot perform such an experiment. We will instead discuss this as a limitation and propose it as future work. revision: partial
Referee: [Results and Quantitative Reporting] The abstract and results sections describe 'notable variation' and 'near-LLM performance' without supplying concrete metric values, confidence intervals, or statistical tests. This omission prevents readers from judging the practical magnitude of the reported differences between SLMs and LLMs.

Authors: We acknowledge the need for specific quantitative details. Although the full paper contains tables with exact scores, we will update the abstract and main results section to highlight key numerical values (e.g., average semantic similarity scores for top SLMs vs. LLMs), include confidence intervals, and report statistical significance tests. This revision will make the magnitude of differences clear. revision: yes
Referee: [Methodology: History Summarization] The history summarization strategy is presented at a high level; missing details include the exact summarization prompt template, token-budget rules, and any ablation showing how summarization affects downstream continuity scores. These elements are load-bearing for the continuity claims.

Authors: Thank you for pointing this out. We will include the full summarization prompt template in the appendix, specify the token budget (e.g., summary limited to 400 tokens within a 2048 context window), and add an ablation experiment showing the impact of summarization on continuity metrics like context alignment scores. These additions will be made in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical SLM evaluation

full rationale

The paper reports a purely empirical comparison of nine instruction-tuned SLMs against three LLMs on synthetic multi-turn customer-service QA tasks. Evaluation relies on lexical/semantic similarity metrics, human judgments, and LLM-as-a-judge protocols applied to generated conversations and history summaries. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. Central claims about performance variation rest on observable experimental outcomes rather than any reduction to inputs by construction. The synthetic data generation and summarization strategy is presented as a methodological choice, not a self-defining loop, and the results are framed as externally verifiable through the stated metrics and judges.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the validity of synthetic data as a proxy for real customer-service dialogues and on the chosen lexical/semantic metrics plus human/LLM judges as faithful measures of contextual understanding. No free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5512 in / 1156 out tokens · 45172 ms · 2026-05-16T08:58:04.718454+00:00 · methodology

Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)