pith. machine review for the scientific record. sign in

arxiv: 2605.11317 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Recognition: no theorem link

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multi-turn dialogueLLM servingsmall language modellocal response manifoldLoRA fine-tuningsoft promptsmodel adaptationgated switching
0
0 comments X

The pith

SOMA adapts smaller language models to conversation-specific regions after initial turns to cut serving costs while preserving quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework called SOMA for efficient multi-turn LLM serving. It uses the first few dialogue turns to estimate a local response manifold, which captures the semantic directions of the ongoing conversation. Then it adapts a smaller surrogate model to this local region using soft prompts and localized LoRA fine-tuning, allowing the small model to handle subsequent turns. A gate decides when to switch and allows rollback if drift is detected. This matters because full history concatenation on large models is expensive in latency and memory, and existing methods struggle with the quality-efficiency trade-off.

Core claim

The paper claims that by estimating a local response manifold from early turns and distilling it into a small language model via anti-degeneration controlled soft prompts and LoRA adaptation, the surrogate can serve the rest of the multi-turn conversation efficiently with maintained quality, supported by a theoretical analysis and a gating mechanism for drift.

What carries the argument

The local response manifold, defined as the set of likely response directions in the current conversation context, which is mined using soft prompts that maximize divergence between large and small models, then distilled into localized LoRA fine-tuning of the surrogate model.

If this is right

  • Multi-turn sessions incur lower latency and memory costs after the initial turns by routing to the adapted small model.
  • Response quality remains comparable because the surrogate is tuned specifically to the local semantic region.
  • The gate with rollback provides safety against quality drops due to conversation drift.
  • The theoretical analysis validates the key components like the divergence maximization and anti-degeneration control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could extend to other adaptive serving scenarios, such as switching between models based on query complexity in single-turn settings.
  • If the manifold estimation proves robust across domains, it might reduce reliance on large models in long interactive applications like virtual assistants.
  • Future work could test the framework with different small-large model pairs to see how size gap affects adaptation success.

Load-bearing premise

A stable local response manifold exists and can be reliably estimated from only the early turns of a session, allowing the surrogate model and gate to handle later turns without undetected quality loss.

What would settle it

If experiments show that after the switch, response quality degrades significantly in a substantial fraction of sessions without the gate triggering rollback, or if the estimated manifold does not capture the full range of later responses.

Figures

Figures reproduced from arXiv: 2605.11317 by Qiong Wu, Tyler Derr, Xueqi Cheng, Xugui Zhou, Yushun Dong, Zhengyi Zhou.

Figure 1
Figure 1. Figure 1: Relative average token count per turn, normalized by Turn 1. Across four dialogue datasets, token usage drops after the early turns and then forms a long tail. Efficient multi-turn serving depends on how the dia￾logue state evolves over time. In standard LLM serv￾ing, each new request is processed together with the full previous history, so the computational cost grows with the length of the conversation. … view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency and component analysis. SOMA reduces token usage after switching, and [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Instructions for the LLM to filter context-dependent dialogue. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Instructions for the LLM judge to evaluate the response similarity. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 9
Figure 9. Figure 9: Variance of response similarity versus warm-start window [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SOMA, a framework for efficient multi-turn LLM serving. It estimates a local response manifold from early conversation turns by mining divergence-maximizing cases via soft prompts between a large LLM and a small surrogate, distills these into localized LoRA fine-tuning of the surrogate (removing prompts at inference), and uses a simple gate for one-time switching with rollback on detected drift. The work includes theoretical analysis of key components and reports extensive experiments demonstrating effectiveness, with code released.

Significance. If the central assumptions hold, SOMA addresses a practical deployment challenge by trading off quality and efficiency in multi-turn settings through early-turn adaptation of smaller models, potentially lowering latency, memory, and API costs. The open-sourced code and theoretical analysis are strengths that support reproducibility and deeper understanding of the components.

major comments (2)
  1. [Abstract and theoretical analysis] The central claim rests on the stability of the local response manifold estimated from early turns and the gate's ability to prevent undetected quality drift. However, the theoretical analysis does not provide a formal definition of the manifold, bounds on its coverage radius, or characterization of the gate's false-negative rate under gradual topic shifts or session evolution (see abstract description of the framework and theoretical analysis section).
  2. [Experiments] The experiments claim to show effectiveness of the efficiency-quality tradeoff, but the manuscript provides no details on baselines, evaluation metrics, error bars, session-length distributions, or how data were selected to test drift scenarios. This makes it impossible to assess whether the surrogate plus gate maintains quality without significant undetected degradation (see experiments section).
minor comments (2)
  1. [Abstract] The abstract could more precisely state the scope of the theoretical analysis (e.g., which components receive formal treatment) and the exact conditions under which the gate triggers rollback.
  2. [Method] Notation for the soft-prompt mining and anti-degeneration control could be clarified with explicit equations or pseudocode in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the theoretical foundations and experimental reporting. We address each major comment below and have revised the manuscript to incorporate additional formalization and details.

read point-by-point responses
  1. Referee: [Abstract and theoretical analysis] The central claim rests on the stability of the local response manifold estimated from early turns and the gate's ability to prevent undetected quality drift. However, the theoretical analysis does not provide a formal definition of the manifold, bounds on its coverage radius, or characterization of the gate's false-negative rate under gradual topic shifts or session evolution (see abstract description of the framework and theoretical analysis section).

    Authors: We agree that a more rigorous formalization would strengthen the central claim. The existing theoretical analysis examines stability through the divergence-maximizing soft prompts and anti-degeneration control, but does not supply an explicit set-theoretic definition of the manifold or radius bounds. In the revision we have added: (i) a formal definition of the local response manifold as the divergence ball around early-turn response embeddings; (ii) coverage-radius bounds derived from the Lipschitz constant of the response mapping and the prompt-optimization objective; and (iii) a characterization of the gate's false-negative rate under gradual shifts, modeled via a Markovian topic-evolution process with concentration inequalities. These additions appear in the updated theoretical analysis section. revision: yes

  2. Referee: [Experiments] The experiments claim to show effectiveness of the efficiency-quality tradeoff, but the manuscript provides no details on baselines, evaluation metrics, error bars, session-length distributions, or how data were selected to test drift scenarios. This makes it impossible to assess whether the surrogate plus gate maintains quality without significant undetected degradation (see experiments section).

    Authors: We acknowledge that the experimental section lacked sufficient transparency. In the revised manuscript we have inserted: (1) explicit descriptions of all baselines (full-context LLM, prompt-only surrogates, and competing adaptation methods); (2) the complete set of metrics (response quality via automated and human ratings, latency, memory footprint, and drift-detection accuracy); (3) error bars computed over five independent runs with different random seeds; (4) session-length statistics (mean, variance, and distribution) drawn from the evaluation corpora; and (5) a precise account of drift-scenario construction, including both synthetic topic-shift dialogues and real multi-turn sessions with evolving context. These changes allow readers to evaluate the quality-stability tradeoff directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework steps are independent of inputs.

full rationale

The paper defines a concrete pipeline: early-turn manifold estimation via divergence-maximizing soft prompts, anti-degeneration stabilization, LoRA distillation of mined cases, and a one-time gate with rollback. Theoretical analysis is supplied for components, and effectiveness is shown via experiments. No equation or claim reduces a prediction to a fitted input by construction, nor does any load-bearing premise rest on a self-citation chain whose validity is internal to the paper. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted or audited. The 'local response manifold' is a conceptual construct but is not presented as a new postulated entity with independent evidence.

pith-pipeline@v0.9.0 · 5510 in / 1097 out tokens · 39431 ms · 2026-05-13T01:33:38.336710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Multi-character dialogue dataset

    agentlans. Multi-character dialogue dataset. Hugging Face Dataset, 2024. CC-BY-4.0 license

  3. [3]

    Beyond the bubble: How context -aware memory systems are changing the game in 2025

    Shalini Ananda. Beyond the bubble: How context -aware memory systems are changing the game in 2025. Tribe AI Applied AI Blog, April 2025

  4. [4]

    Introducing claude

    Anthropic. Introducing claude. Anthropic Blog, March 2023. https://www.anthropic. com/news/introducing-claude

  5. [5]

    Prompt caching with claude

    Anthropic. Prompt caching with claude. https://www.anthropic.com/news/ prompt-caching, August 2024. Accessed 2025-06-12

  6. [6]

    Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

  7. [7]

    Towards efficient multi-llm inference: Characterization and analysis of llm routing and hierarchical techniques.arXiv preprint arXiv:2506.06579, 2025

    Adarsh Prasad Behera, Jaya Prakash Champati, Roberto Morabito, Sasu Tarkoma, and James Gross. Towards efficient multi-llm inference: Characterization and analysis of llm routing and hierarchical techniques.arXiv preprint arXiv:2506.06579, 2025

  8. [8]

    Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396, 2003

    Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396, 2003

  9. [9]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  10. [10]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

  11. [11]

    Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations.arXiv preprint arXiv:2402.11975, 2024

    Nuo Chen, Hongguang Li, Juhua Huang, Baoyuan Wang, and Jia Li. Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations.arXiv preprint arXiv:2402.11975, 2024

  12. [12]

    Diffusion maps.Applied and computational harmonic analysis, 21(1):5–30, 2006

    Ronald R Coifman and Stéphane Lafon. Diffusion maps.Applied and computational harmonic analysis, 21(1):5–30, 2006

  13. [13]

    Hybrid llm: Cost-efficient and quality- aware query routing.arXiv preprint arXiv:2404.14618, 2024

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality- aware query routing.arXiv preprint arXiv:2404.14618, 2024

  14. [14]

    Towards next-generation intelligent assistants leveraging llm techniques

    Xin Luna Dong, Seungwhan Moon, Yifan Ethan Xu, Kshitiz Malik, and Zhou Yu. Towards next-generation intelligent assistants leveraging llm techniques. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5792–5793, 2023

  15. [15]

    Non-uniform point cloud upsampling via local manifold distribution.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 8(1):1–15, 2025

    Yao Hui Fang and Xing Ce Wang. Non-uniform point cloud upsampling via local manifold distribution.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 8(1):1–15, 2025

  16. [16]

    {Cost-Efficient} large language model serving for multi- turn conversations with {CachedAttention}

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. {Cost-Efficient} large language model serving for multi- turn conversations with {CachedAttention}. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 111–126, 2024

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

  18. [18]

    Hipporag: Neurobiologically inspired long-term memory for large language models

    Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  19. [19]

    Decoupling strategy and generation in negotiation dialogues.arXiv preprint arXiv:1808.09637, 2018

    He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. Decoupling strategy and generation in negotiation dialogues.arXiv preprint arXiv:1808.09637, 2018

  20. [20]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  21. [21]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  22. [22]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

  23. [23]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  24. [24]

    Accelerating llm serving for multi-turn dialogues with efficient resource management

    Jinwoo Jeong and Jeongseob Ahn. Accelerating llm serving for multi-turn dialogues with efficient resource management. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 1–15, 2025

  25. [25]

    LLMs Get Lost In Multi-Turn Conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120, 2025

  26. [26]

    Repetition in repetition out: Towards understanding neural text degeneration from the data perspective.Advances in Neural Information Processing Systems, 36:72888–72903, 2023

    Huayang Li, Tian Lan, Zihao Fu, Deng Cai, Lemao Liu, Nigel Collier, Taro Watanabe, and Yixuan Su. Repetition in repetition out: Towards understanding neural text degeneration from the data perspective.Advances in Neural Information Processing Systems, 36:72888–72903, 2023

  27. [27]

    Llm as clinical graph structure refiner: Enhancing representation learning in eeg seizure diagnosis, 2026

    Lincan Li, Zheng Chen, and Yushun Dong. Llm as clinical graph structure refiner: Enhancing representation learning in eeg seizure diagnosis, 2026

  28. [28]

    Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

    Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717, 2025

  29. [29]

    arXiv preprint arXiv:2404.00971 , year=

    Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. Exploring and evaluating hallucinations in llm-powered code generation. arXiv preprint arXiv:2404.00971, 2024

  30. [30]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

  31. [31]

    Locally typical sampling

    Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. Locally typical sampling. Transactions of the Association for Computational Linguistics, 11:102–121, 2023

  32. [32]

    Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation.arXiv preprint arXiv:2311.04177, 2023

    Eric Melz. Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation.arXiv preprint arXiv:2311.04177, 2023

  33. [33]

    Accelerating multi-turn llm serving with multi-tier caching and smarter scheduling

    Don Moon. Accelerating multi-turn llm serving with multi-tier caching and smarter scheduling. Medium (Byte-Sized AI), April 2025

  34. [34]

    Enhancing recommendation systems with hybrid manifold regu- larized knowledge graph

    Giang Ngo and Nhi NY V o. Enhancing recommendation systems with hybrid manifold regu- larized knowledge graph. In2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–8. IEEE, 2023

  35. [35]

    Gonzalez, M Waleed Kadous, and Ion Stoica

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024. 11

  36. [36]

    gpt-oss-120b & gpt-oss-20b model card, 2025

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025

  37. [37]

    Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression.arXiv preprint arXiv:2403.12968, 2024

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression.arXiv preprint arXiv:2403.12968, 2024

  38. [38]

    Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering.arXiv preprint arXiv:2312.06024, 2023

    Soya Park and Chinmay Kulkarni. Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering.arXiv preprint arXiv:2312.06024, 2023

  39. [39]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  40. [40]

    Nonlinear dimensionality reduction by locally linear embedding.science, 290(5500):2323–2326, 2000

    Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding.science, 290(5500):2323–2326, 2000

  41. [41]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

  42. [42]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

  43. [43]

    Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023

    Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023

  44. [44]

    Dialogue act modeling for automatic tagging and recognition of conversational speech.Computational linguistics, 26(3):339–373, 2000

    Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. Dialogue act modeling for automatic tagging and recognition of conversational speech.Computational linguistics, 26(3):339–373, 2000

  45. [45]

    Zernet: Convolutional neural networks on arbitrary surfaces via zernike local tangent space estimation

    Zhiyu Sun, Ethan Rooke, Jerome Charton, Yusen He, Jia Lu, and Stephen Baek. Zernet: Convolutional neural networks on arbitrary surfaces via zernike local tangent space estimation. InComputer Graphics Forum, volume 39, pages 204–216. Wiley Online Library, 2020

  46. [46]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  47. [47]

    Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

  48. [48]

    Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025

    Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, and Liang Ding. Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025

  49. [49]

    Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

    Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

  50. [50]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024.URL https://arxiv. org/abs/2309.17453, page 1, 2024

  51. [51]

    Loco: Local contrastive representation learning.Advances in neural information processing systems, 33:11142–11153, 2020

    Yuwen Xiong, Mengye Ren, and Raquel Urtasun. Loco: Local contrastive representation learning.Advances in neural information processing systems, 33:11142–11153, 2020

  52. [52]

    Remedi: Resources for multi-domain, multi-service, medical dialogues

    Guojun Yan, Jiahuan Pei, Pengjie Ren, Zhaochun Ren, Xin Xin, Huasheng Liang, Maarten De Rijke, and Zhumin Chen. Remedi: Resources for multi-domain, multi-service, medical dialogues. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3013–3024, 2022

  53. [53]

    A survey on recent advances in llm-based multi-turn dialogue systems.arXiv preprint arXiv:2402.18013, 2024

    Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems.arXiv preprint arXiv:2402.18013, 2024. 12

  54. [54]

    Healthmamba: An uncertainty-aware spa- tiotemporal graph state space model for effective and reliable healthcare facility visit prediction, 2026

    Dahai Yu, Lin Jiang, Rongchao Xu, and Guang Wang. Healthmamba: An uncertainty-aware spa- tiotemporal graph state space model for effective and reliable healthcare facility visit prediction, 2026

  55. [55]

    Contrastive learning of global and local video representations.Advances in Neural Information Processing Systems, 34:7025–7040, 2021

    Zhaoyang Zeng, Daniel McDuff, Yale Song, et al. Contrastive learning of global and local video representations.Advances in Neural Information Processing Systems, 34:7025–7040, 2021

  56. [56]

    Linear local tangent space alignment and application to face recognition.Neurocomputing, 70(7-9):1547–1553, 2007

    Tianhao Zhang, Jie Yang, Deli Zhao, and Xinliang Ge. Linear local tangent space alignment and application to face recognition.Neurocomputing, 70(7-9):1547–1553, 2007

  57. [57]

    Amplifying your social media presence: Personalized influential content generation with llms.arXiv preprint arXiv:2505.01698, 2025

    Yuying Zhao, Yu Wang, Xueqi Cheng, Anne Marie Tumlin, Yunchao Liu, Damin Xia, Meng Jiang, and Tyler Derr. Amplifying your social media presence: Personalized influential content generation with llms.arXiv preprint arXiv:2505.01698, 2025

  58. [58]

    wiggle room

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 13 Technical Appendices and Supplementary Material A Notations This section summarizes all n...