arxiv: 2605.11317 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Recognition: no theorem link

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

Xueqi Cheng , Qiong Wu , Zhengyi Zhou , Xugui Zhou , Tyler Derr , Yushun Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-turn dialogueLLM servingsmall language modellocal response manifoldLoRA fine-tuningsoft promptsmodel adaptationgated switching

0 comments

The pith

SOMA adapts smaller language models to conversation-specific regions after initial turns to cut serving costs while preserving quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework called SOMA for efficient multi-turn LLM serving. It uses the first few dialogue turns to estimate a local response manifold, which captures the semantic directions of the ongoing conversation. Then it adapts a smaller surrogate model to this local region using soft prompts and localized LoRA fine-tuning, allowing the small model to handle subsequent turns. A gate decides when to switch and allows rollback if drift is detected. This matters because full history concatenation on large models is expensive in latency and memory, and existing methods struggle with the quality-efficiency trade-off.

Core claim

The paper claims that by estimating a local response manifold from early turns and distilling it into a small language model via anti-degeneration controlled soft prompts and LoRA adaptation, the surrogate can serve the rest of the multi-turn conversation efficiently with maintained quality, supported by a theoretical analysis and a gating mechanism for drift.

What carries the argument

The local response manifold, defined as the set of likely response directions in the current conversation context, which is mined using soft prompts that maximize divergence between large and small models, then distilled into localized LoRA fine-tuning of the surrogate model.

If this is right

Multi-turn sessions incur lower latency and memory costs after the initial turns by routing to the adapted small model.
Response quality remains comparable because the surrogate is tuned specifically to the local semantic region.
The gate with rollback provides safety against quality drops due to conversation drift.
The theoretical analysis validates the key components like the divergence maximization and anti-degeneration control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could extend to other adaptive serving scenarios, such as switching between models based on query complexity in single-turn settings.
If the manifold estimation proves robust across domains, it might reduce reliance on large models in long interactive applications like virtual assistants.
Future work could test the framework with different small-large model pairs to see how size gap affects adaptation success.

Load-bearing premise

A stable local response manifold exists and can be reliably estimated from only the early turns of a session, allowing the surrogate model and gate to handle later turns without undetected quality loss.

What would settle it

If experiments show that after the switch, response quality degrades significantly in a substantial fraction of sessions without the gate triggering rollback, or if the estimated manifold does not capture the full range of later responses.

Figures

Figures reproduced from arXiv: 2605.11317 by Qiong Wu, Tyler Derr, Xueqi Cheng, Xugui Zhou, Yushun Dong, Zhengyi Zhou.

**Figure 1.** Figure 1: Relative average token count per turn, normalized by Turn 1. Across four dialogue datasets, token usage drops after the early turns and then forms a long tail. Efficient multi-turn serving depends on how the dialogue state evolves over time. In standard LLM serving, each new request is processed together with the full previous history, so the computational cost grows with the length of the conversation. … view at source ↗

**Figure 2.** Figure 2: Efficiency and component analysis. SOMA reduces token usage after switching, and [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Instructions for the LLM to filter context-dependent dialogue. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Instructions for the LLM judge to evaluate the response similarity. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 9.** Figure 9: Variance of response similarity versus warm-start window [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SOMA adapts a small surrogate via early-turn LoRA distillation and a rollback gate to cut multi-turn serving costs, but the quality hold depends on an untested assumption that the local manifold stays stable.

read the letter

The main point is that SOMA estimates a local response manifold from the opening turns of a conversation, then switches to a smaller model fine-tuned with localized LoRA on divergence-maximizing cases mined via soft prompts, plus an anti-degeneration control and a simple gate that rolls back to the large model on detected drift. They add a theoretical analysis for the pieces and release code at the GitHub link. That combination of prompt-based mining, prompt-free inference after distillation, and the one-time gate looks like the concrete new element relative to plain prompt tuning or standard LoRA serving work. The framing of the cost problem with full-history concatenation is clear and practical, and shipping the code counts as real evidence that others can check the implementation directly. Experiments are claimed to demonstrate effectiveness, which is the right direction for a deployment-focused paper. The soft spots sit mainly around the central assumption. The abstract gives no formal definition of the manifold, no radius or coverage analysis, and no numbers on how often the gate misses gradual topic shifts or slow drift. If real sessions move outside the early-turn estimate, the surrogate could degrade without the gate catching it, and the stress-test note on false-negative rates for the gate lands as a legitimate open question. Without seeing the exact baselines, metrics, session lengths, or error bars in the full experiments, it is hard to tell how much the results actually close that gap. This is aimed at engineers and researchers who build or tune multi-turn chat systems and care about latency and API spend. A reader who wants concrete serving tricks and reproducible code will find something usable here even if the evaluation needs tightening. It is worth sending for peer review so the experiments and drift analysis get proper scrutiny rather than desk-rejecting an idea that targets a genuine deployment pain point.

Referee Report

2 major / 2 minor

Summary. The paper proposes SOMA, a framework for efficient multi-turn LLM serving. It estimates a local response manifold from early conversation turns by mining divergence-maximizing cases via soft prompts between a large LLM and a small surrogate, distills these into localized LoRA fine-tuning of the surrogate (removing prompts at inference), and uses a simple gate for one-time switching with rollback on detected drift. The work includes theoretical analysis of key components and reports extensive experiments demonstrating effectiveness, with code released.

Significance. If the central assumptions hold, SOMA addresses a practical deployment challenge by trading off quality and efficiency in multi-turn settings through early-turn adaptation of smaller models, potentially lowering latency, memory, and API costs. The open-sourced code and theoretical analysis are strengths that support reproducibility and deeper understanding of the components.

major comments (2)

[Abstract and theoretical analysis] The central claim rests on the stability of the local response manifold estimated from early turns and the gate's ability to prevent undetected quality drift. However, the theoretical analysis does not provide a formal definition of the manifold, bounds on its coverage radius, or characterization of the gate's false-negative rate under gradual topic shifts or session evolution (see abstract description of the framework and theoretical analysis section).
[Experiments] The experiments claim to show effectiveness of the efficiency-quality tradeoff, but the manuscript provides no details on baselines, evaluation metrics, error bars, session-length distributions, or how data were selected to test drift scenarios. This makes it impossible to assess whether the surrogate plus gate maintains quality without significant undetected degradation (see experiments section).

minor comments (2)

[Abstract] The abstract could more precisely state the scope of the theoretical analysis (e.g., which components receive formal treatment) and the exact conditions under which the gate triggers rollback.
[Method] Notation for the soft-prompt mining and anti-degeneration control could be clarified with explicit equations or pseudocode in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the theoretical foundations and experimental reporting. We address each major comment below and have revised the manuscript to incorporate additional formalization and details.

read point-by-point responses

Referee: [Abstract and theoretical analysis] The central claim rests on the stability of the local response manifold estimated from early turns and the gate's ability to prevent undetected quality drift. However, the theoretical analysis does not provide a formal definition of the manifold, bounds on its coverage radius, or characterization of the gate's false-negative rate under gradual topic shifts or session evolution (see abstract description of the framework and theoretical analysis section).

Authors: We agree that a more rigorous formalization would strengthen the central claim. The existing theoretical analysis examines stability through the divergence-maximizing soft prompts and anti-degeneration control, but does not supply an explicit set-theoretic definition of the manifold or radius bounds. In the revision we have added: (i) a formal definition of the local response manifold as the divergence ball around early-turn response embeddings; (ii) coverage-radius bounds derived from the Lipschitz constant of the response mapping and the prompt-optimization objective; and (iii) a characterization of the gate's false-negative rate under gradual shifts, modeled via a Markovian topic-evolution process with concentration inequalities. These additions appear in the updated theoretical analysis section. revision: yes
Referee: [Experiments] The experiments claim to show effectiveness of the efficiency-quality tradeoff, but the manuscript provides no details on baselines, evaluation metrics, error bars, session-length distributions, or how data were selected to test drift scenarios. This makes it impossible to assess whether the surrogate plus gate maintains quality without significant undetected degradation (see experiments section).

Authors: We acknowledge that the experimental section lacked sufficient transparency. In the revised manuscript we have inserted: (1) explicit descriptions of all baselines (full-context LLM, prompt-only surrogates, and competing adaptation methods); (2) the complete set of metrics (response quality via automated and human ratings, latency, memory footprint, and drift-detection accuracy); (3) error bars computed over five independent runs with different random seeds; (4) session-length statistics (mean, variance, and distribution) drawn from the evaluation corpora; and (5) a precise account of drift-scenario construction, including both synthetic topic-shift dialogues and real multi-turn sessions with evolving context. These changes allow readers to evaluate the quality-stability tradeoff directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework steps are independent of inputs.

full rationale

The paper defines a concrete pipeline: early-turn manifold estimation via divergence-maximizing soft prompts, anti-degeneration stabilization, LoRA distillation of mined cases, and a one-time gate with rollback. Theoretical analysis is supplied for components, and effectiveness is shown via experiments. No equation or claim reduces a prediction to a fitted input by construction, nor does any load-bearing premise rest on a self-citation chain whose validity is internal to the paper. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted or audited. The 'local response manifold' is a conceptual construct but is not presented as a new postulated entity with independent evidence.

pith-pipeline@v0.9.0 · 5510 in / 1097 out tokens · 39431 ms · 2026-05-13T01:33:38.336710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Multi-character dialogue dataset

agentlans. Multi-character dialogue dataset. Hugging Face Dataset, 2024. CC-BY-4.0 license

work page 2024
[3]

Beyond the bubble: How context -aware memory systems are changing the game in 2025

Shalini Ananda. Beyond the bubble: How context -aware memory systems are changing the game in 2025. Tribe AI Applied AI Blog, April 2025

work page 2025
[4]

Introducing claude

Anthropic. Introducing claude. Anthropic Blog, March 2023. https://www.anthropic. com/news/introducing-claude

work page 2023
[5]

Prompt caching with claude

Anthropic. Prompt caching with claude. https://www.anthropic.com/news/ prompt-caching, August 2024. Accessed 2025-06-12

work page 2024
[6]

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

work page arXiv 2024
[7]

Towards efficient multi-llm inference: Characterization and analysis of llm routing and hierarchical techniques.arXiv preprint arXiv:2506.06579, 2025

Adarsh Prasad Behera, Jaya Prakash Champati, Roberto Morabito, Sasu Tarkoma, and James Gross. Towards efficient multi-llm inference: Characterization and analysis of llm routing and hierarchical techniques.arXiv preprint arXiv:2506.06579, 2025

work page arXiv 2025
[8]

Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396, 2003

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396, 2003

work page 2003
[9]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[10]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

work page 2024
[11]

Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations.arXiv preprint arXiv:2402.11975, 2024

Nuo Chen, Hongguang Li, Juhua Huang, Baoyuan Wang, and Jia Li. Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations.arXiv preprint arXiv:2402.11975, 2024

work page arXiv 2024
[12]

Diffusion maps.Applied and computational harmonic analysis, 21(1):5–30, 2006

Ronald R Coifman and Stéphane Lafon. Diffusion maps.Applied and computational harmonic analysis, 21(1):5–30, 2006

work page 2006
[13]

Hybrid llm: Cost-efficient and quality- aware query routing.arXiv preprint arXiv:2404.14618, 2024

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality- aware query routing.arXiv preprint arXiv:2404.14618, 2024

work page arXiv 2024
[14]

Towards next-generation intelligent assistants leveraging llm techniques

Xin Luna Dong, Seungwhan Moon, Yifan Ethan Xu, Kshitiz Malik, and Zhou Yu. Towards next-generation intelligent assistants leveraging llm techniques. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5792–5793, 2023

work page 2023
[15]

Non-uniform point cloud upsampling via local manifold distribution.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 8(1):1–15, 2025

Yao Hui Fang and Xing Ce Wang. Non-uniform point cloud upsampling via local manifold distribution.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 8(1):1–15, 2025

work page 2025
[16]

{Cost-Efficient} large language model serving for multi- turn conversations with {CachedAttention}

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. {Cost-Efficient} large language model serving for multi- turn conversations with {CachedAttention}. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 111–126, 2024

work page 2024
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Hipporag: Neurobiologically inspired long-term memory for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[19]

Decoupling strategy and generation in negotiation dialogues.arXiv preprint arXiv:1808.09637, 2018

He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. Decoupling strategy and generation in negotiation dialogues.arXiv preprint arXiv:1808.09637, 2018

work page arXiv 2018
[20]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[23]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[24]

Accelerating llm serving for multi-turn dialogues with efficient resource management

Jinwoo Jeong and Jeongseob Ahn. Accelerating llm serving for multi-turn dialogues with efficient resource management. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 1–15, 2025

work page 2025
[25]

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120, 2025

work page internal anchor Pith review arXiv 2025
[26]

Repetition in repetition out: Towards understanding neural text degeneration from the data perspective.Advances in Neural Information Processing Systems, 36:72888–72903, 2023

Huayang Li, Tian Lan, Zihao Fu, Deng Cai, Lemao Liu, Nigel Collier, Taro Watanabe, and Yixuan Su. Repetition in repetition out: Towards understanding neural text degeneration from the data perspective.Advances in Neural Information Processing Systems, 36:72888–72903, 2023

work page 2023
[27]

Llm as clinical graph structure refiner: Enhancing representation learning in eeg seizure diagnosis, 2026

Lincan Li, Zheng Chen, and Yushun Dong. Llm as clinical graph structure refiner: Enhancing representation learning in eeg seizure diagnosis, 2026

work page 2026
[28]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

arXiv preprint arXiv:2404.00971 , year=

Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. Exploring and evaluating hallucinations in llm-powered code generation. arXiv preprint arXiv:2404.00971, 2024

work page arXiv 2024
[30]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Locally typical sampling

Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. Locally typical sampling. Transactions of the Association for Computational Linguistics, 11:102–121, 2023

work page 2023
[32]

Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation.arXiv preprint arXiv:2311.04177, 2023

Eric Melz. Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation.arXiv preprint arXiv:2311.04177, 2023

work page arXiv 2023
[33]

Accelerating multi-turn llm serving with multi-tier caching and smarter scheduling

Don Moon. Accelerating multi-turn llm serving with multi-tier caching and smarter scheduling. Medium (Byte-Sized AI), April 2025

work page 2025
[34]

Enhancing recommendation systems with hybrid manifold regu- larized knowledge graph

Giang Ngo and Nhi NY V o. Enhancing recommendation systems with hybrid manifold regu- larized knowledge graph. In2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–8. IEEE, 2023

work page 2023
[35]

Gonzalez, M Waleed Kadous, and Ion Stoica

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024. 11

work page 2024
[36]

gpt-oss-120b & gpt-oss-20b model card, 2025

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025

work page 2025
[37]

Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression.arXiv preprint arXiv:2403.12968, 2024

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression.arXiv preprint arXiv:2403.12968, 2024

work page arXiv 2024
[38]

Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering.arXiv preprint arXiv:2312.06024, 2023

Soya Park and Chinmay Kulkarni. Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering.arXiv preprint arXiv:2312.06024, 2023

work page arXiv 2023
[39]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[40]

Nonlinear dimensionality reduction by locally linear embedding.science, 290(5500):2323–2326, 2000

Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding.science, 290(5500):2323–2326, 2000

work page 2000
[41]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

work page 2023
[42]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

work page 2024
[43]

Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023

work page arXiv 2023
[44]

Dialogue act modeling for automatic tagging and recognition of conversational speech.Computational linguistics, 26(3):339–373, 2000

Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. Dialogue act modeling for automatic tagging and recognition of conversational speech.Computational linguistics, 26(3):339–373, 2000

work page 2000
[45]

Zernet: Convolutional neural networks on arbitrary surfaces via zernike local tangent space estimation

Zhiyu Sun, Ethan Rooke, Jerome Charton, Yusen He, Jia Lu, and Stephen Baek. Zernet: Convolutional neural networks on arbitrary surfaces via zernike local tangent space estimation. InComputer Graphics Forum, volume 39, pages 204–216. Wiley Online Library, 2020

work page 2020
[46]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

work page 2008
[48]

Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025

Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, and Liang Ding. Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025

work page 2025
[49]

Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

work page arXiv 1908
[50]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024.URL https://arxiv. org/abs/2309.17453, page 1, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Loco: Local contrastive representation learning.Advances in neural information processing systems, 33:11142–11153, 2020

Yuwen Xiong, Mengye Ren, and Raquel Urtasun. Loco: Local contrastive representation learning.Advances in neural information processing systems, 33:11142–11153, 2020

work page 2020
[52]

Remedi: Resources for multi-domain, multi-service, medical dialogues

Guojun Yan, Jiahuan Pei, Pengjie Ren, Zhaochun Ren, Xin Xin, Huasheng Liang, Maarten De Rijke, and Zhumin Chen. Remedi: Resources for multi-domain, multi-service, medical dialogues. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3013–3024, 2022

work page 2022
[53]

A survey on recent advances in llm-based multi-turn dialogue systems.arXiv preprint arXiv:2402.18013, 2024

Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems.arXiv preprint arXiv:2402.18013, 2024. 12

work page arXiv 2024
[54]

Healthmamba: An uncertainty-aware spa- tiotemporal graph state space model for effective and reliable healthcare facility visit prediction, 2026

Dahai Yu, Lin Jiang, Rongchao Xu, and Guang Wang. Healthmamba: An uncertainty-aware spa- tiotemporal graph state space model for effective and reliable healthcare facility visit prediction, 2026

work page 2026
[55]

Contrastive learning of global and local video representations.Advances in Neural Information Processing Systems, 34:7025–7040, 2021

Zhaoyang Zeng, Daniel McDuff, Yale Song, et al. Contrastive learning of global and local video representations.Advances in Neural Information Processing Systems, 34:7025–7040, 2021

work page 2021
[56]

Linear local tangent space alignment and application to face recognition.Neurocomputing, 70(7-9):1547–1553, 2007

Tianhao Zhang, Jie Yang, Deli Zhao, and Xinliang Ge. Linear local tangent space alignment and application to face recognition.Neurocomputing, 70(7-9):1547–1553, 2007

work page 2007
[57]

Amplifying your social media presence: Personalized influential content generation with llms.arXiv preprint arXiv:2505.01698, 2025

Yuying Zhao, Yu Wang, Xueqi Cheng, Anne Marie Tumlin, Yunchao Liu, Damin Xia, Meng Jiang, and Tyler Derr. Amplifying your social media presence: Personalized influential content generation with llms.arXiv preprint arXiv:2505.01698, 2025

work page arXiv 2025
[58]

wiggle room

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 13 Technical Appendices and Supplementary Material A Notations This section summarizes all n...

work page 2023