RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

Anirudh Sekar

arxiv: 2606.09937 · v1 · pith:MVBWYJXHnew · submitted 2026-06-07 · 💻 cs.LG · cs.AI· cs.CL

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

Anirudh Sekar This is my paper

Pith reviewed 2026-06-27 18:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords KV cache sharingearly exitmulti-step reasoningLLM inference optimizationtraining-free accelerationattention similarityentropy stabilization

0 comments

The pith

RKSC shares KV caches across similar reasoning branches and exits early on high-confidence generations to cut multi-step LLM inference time by 3x with 0.37% added error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RKSC, a training-free framework that removes repeated computation in multi-branch LLM reasoning by sharing prefix KV caches only when hidden states are similar and by terminating verification passes when generation confidence is decisive or layer entropy stabilizes. It builds three components: ASKS for similarity-based sharing that generalizes exact prefix caching, CGEE for two forms of early exit, and RSBCM for bounded cache eviction. Experiments across five model families, four benchmarks, and one thousand problems report a mean 3.008x speedup over a no-cache baseline and 1.66x over vLLM-style prefix caching while producing only six errors in 1616 verify calls. The method requires no fine-tuning or architecture changes.

Core claim

RKSC eliminates structural redundancies in multi-branch reasoning by computing each prefix KV cache once and broadcasting it to semantically similar branches via hidden-state cosine similarity, while skipping or shortening verification forward passes when per-branch confidence is decisive or per-layer entropy has stabilized, all managed by attention-weighted depth-priority eviction to keep cache size bounded.

What carries the argument

ASKS broadcasts a single computed prefix KV cache to branches whose hidden states exceed a cosine-similarity threshold; CGEE combines full-pass skipping on decisive confidence with mid-layer termination when entropy stabilizes.

If this is right

Prefix KV computation can be performed once per semantic cluster rather than once per branch.
Verification forward passes can be omitted entirely when all branches show high confidence.
Verification can stop at an intermediate layer once entropy no longer changes.
Cache size remains bounded without losing high-value entries through attention-weighted eviction.
The same mechanisms apply to any transformer backbone without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same similarity and entropy signals might be reused to prune low-value branches before they generate many tokens.
Combining RKSC with speculative decoding could compound the latency reduction on long reasoning chains.
The low error rate on the tested problems suggests the proxies may generalize to other search-based inference patterns such as beam search or tree-of-thoughts.

Load-bearing premise

Cosine similarity of hidden states and stabilization of per-layer entropy are reliable signals that sharing the cache or exiting early will not alter the final answer on unseen problems.

What would settle it

A controlled test on a held-out benchmark where RKSC produces more than a few percent additional incorrect final answers compared with full verification would show the proxies are unsafe.

Figures

Figures reproduced from arXiv: 2606.09937 by Anirudh Sekar.

**Figure 1.** Figure 1: RKSC pipeline overview. A single root forward computes the shared KV cache C and root hidden states. ASKS broadcasts C only to branches with similarity σb ≥ τ . All B branches decode in one batched call while accumulating generation confidence. CGEE selects among three paths: verify skip (confidence decisive), early-exit verify (entropy stabilised at l ∗ ), or full verify. 2.1. Problem Setting We consid… view at source ↗

read the original abstract

We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the prefix KV cache once and broadcasts it to all semantically similar branches via hidden-state cosine similarity, strictly generalising the token-exact prefix caching used by vLLM and SGLang. CGEE (Confidence-Gated Early Exit) applies two complementary exit mechanisms: (1) it skips the verification forward pass entirely when generation confidence is decisive across branches, and (2) it terminates the verification pass at an intermediate layer when per-layer entropy stabilises, using lightweight hooks on the transformer backbone. RSBCM (Reasoning-Selective Block Cache Manager) prevents unbounded cache growth via attention-weighted depth-priority eviction. Across five model families (7B-10B), four benchmarks, and 1,000 evaluated problems, RKSC achieves a mean speedup of 3.008x over the No-KV baseline (peak 3.990x), a 1.66x mean improvement over vLLM-equivalent prefix caching, with a CGEE-induced error rate of only 0.37% (6 errors out of 1,616 verify calls). No fine-tuning or architecture changes are required. Code is available at https://github.com/AnirudhSekar/RKSC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RKSC generalizes exact-match KV caching to hidden-state cosine similarity and layers on two early-exit rules, but the 3x speedup and 0.37% error rest on details that are not inspectable from the abstract.

read the letter

The main point is that this paper takes the exact-prefix caching already in vLLM and SGLang and replaces the exact-match test with cosine similarity on hidden states so that semantically close reasoning branches can share the KV cache. It adds CGEE, which skips or truncates verification passes when confidence is high or per-layer entropy stops changing, plus a simple attention-weighted eviction rule to keep the cache from growing forever.

The concrete new pieces are the attention-similarity sharing rule, the dual confidence-plus-entropy exit conditions, and the block cache manager that prioritizes depth. The work stays training-free and ships code, which is useful for anyone already running multi-branch setups like tree search or agent trajectories. The reported numbers (roughly 3x over no cache, 1.66x over prefix caching, six errors in 1,616 verify calls) are the kind of practical figure people care about for deployment cost.

The soft spots are exactly where the stress-test note flags them. The abstract gives no per-benchmark accuracy tables against the full-compute baseline, no description of how the similarity or entropy thresholds were chosen, and no breakdown of whether the six errors are the only quality deviations. If the cosine proxy or entropy stabilization correlates poorly with final answer correctness on out-of-distribution branches, the speedup numbers become unsafe to rely on. Without those checks visible, the central claim stays provisional.

This is for people who maintain LLM serving stacks and need incremental wins on reasoning workloads. A reader who already knows vLLM-style caching will see immediately what is being extended and where the risk lies.

It is worth sending to peer review. The ideas are implementable, the target problem is real, and the evaluation gaps are fixable with standard additional tables rather than fundamental redesign.

Referee Report

2 major / 2 minor

Summary. The paper introduces RKSC, a training-free inference framework for multi-branch LLM reasoning that eliminates redundancies via three components: ASKS (which shares prefix KV caches across branches using hidden-state cosine similarity, generalizing exact prefix caching), CGEE (which skips or early-exits verification passes based on decisive generation confidence or stabilized per-layer entropy), and RSBCM (an attention-weighted eviction policy to bound cache growth). Experiments across five 7B-10B model families, four benchmarks, and 1,000 problems report a mean 3.008x speedup over a No-KV baseline (peak 3.990x), 1.66x over vLLM-style prefix caching, and a 0.37% CGEE-induced error rate (6 errors in 1,616 verify calls), with no fine-tuning required.

Significance. If the empirical claims hold, RKSC offers a practical, training-free way to accelerate multi-step reasoning pipelines by safely sharing computation across similar branches and exiting early, while generalizing existing prefix-caching techniques. The availability of code at the cited GitHub repository is a positive factor for reproducibility.

major comments (2)

[Abstract / Experiments] Abstract and experimental results: the central claim that RKSC preserves answer quality (and thus that the reported speedups are safe) rests on 6 errors out of 1,616 verify calls, yet no per-benchmark end-to-end accuracy tables versus the No-KV baseline are provided, nor is there analysis confirming that the 6 errors are the only quality deviations across all 1,000 problems.
[ASKS / CGEE] ASKS and CGEE sections: the paper does not specify how the cosine-similarity threshold for KV sharing or the entropy-stabilization criterion for early exit are chosen (fixed values, per-model tuning, or otherwise), which is load-bearing for assessing whether the proxies reliably avoid altering final answers on unseen reasoning paths.

minor comments (2)

[RSBCM] The description of RSBCM could clarify the exact attention-weighting formula and eviction priority rule with a short pseudocode snippet or equation.
[Experiments] Figure captions or tables should explicitly state the exact vLLM configuration replicated for the 1.66x comparison (e.g., whether it includes the same block size or eviction policy).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements for stronger empirical validation and methodological clarity.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental results: the central claim that RKSC preserves answer quality (and thus that the reported speedups are safe) rests on 6 errors out of 1,616 verify calls, yet no per-benchmark end-to-end accuracy tables versus the No-KV baseline are provided, nor is there analysis confirming that the 6 errors are the only quality deviations across all 1,000 problems.

Authors: We agree that the manuscript would benefit from more granular evidence on answer quality preservation. The current version reports only the aggregate CGEE-induced error rate without per-benchmark end-to-end accuracy tables or explicit confirmation that the six errors represent all deviations. In the revised manuscript, we will add tables comparing final answer accuracy for RKSC against the No-KV baseline on each of the four benchmarks. We will also include an analysis section verifying that these six errors account for all quality deviations observed across the 1,000 problems. These additions will more directly support the safety of the reported speedups. revision: yes
Referee: [ASKS / CGEE] ASKS and CGEE sections: the paper does not specify how the cosine-similarity threshold for KV sharing or the entropy-stabilization criterion for early exit are chosen (fixed values, per-model tuning, or otherwise), which is load-bearing for assessing whether the proxies reliably avoid altering final answers on unseen reasoning paths.

Authors: We acknowledge that the manuscript does not explicitly describe how the cosine-similarity threshold in ASKS and the entropy-stabilization criterion in CGEE were selected. In the revised version, we will clarify that both are fixed hyperparameters determined through preliminary experiments on a held-out validation subset drawn from the benchmarks. We will report the exact values employed in our experiments and add a brief discussion of their robustness, including any sensitivity checks performed. This will enable readers to better evaluate the reliability of these proxies on unseen reasoning paths. revision: yes

Circularity Check

0 steps flagged

No circularity: all claims are direct empirical measurements from benchmarks

full rationale

The paper introduces RKSC as a training-free framework with ASKS, CGEE, and RSBCM components, then reports measured speedups (3.008x mean) and error rates (0.37%) across models and benchmarks. No equations, derivations, or self-citations appear in the provided text that reduce any reported quantity to a fitted input or prior result by construction. The central claims rest on external benchmark runs rather than internal redefinitions or self-referential proofs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described. The framework implicitly depends on similarity and entropy thresholds whose values are not stated.

pith-pipeline@v0.9.1-grok · 5787 in / 1146 out tokens · 22451 ms · 2026-06-27T18:27:38.397873+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 11 internal anchors

[1]

Qwen3 Technical Report

URL https: //arxiv.org/abs/2505.09388. Cai, T., Li, Y ., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

URL https://arxiv.org/abs/2401.10774. Chen, K., Tan, X., Yu, M., and Xu, H. Memshare: Memory efficient inference for large reasoning models through kv cache reuse, 2025a. URL https://arxiv.org/ abs/2507.21433. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., Wang, R., Tu, Z., Mi, H., and Yu, D. Do not think th...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

doi: 10.18653/v1/ 2024.findings-acl.348

doi: 10.18653/v1/ 2024.acl-long.681. URL http://dx.doi.org/10. 18653/v1/2024.acl-long.681. Guo, D. et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, September

work page doi:10.18653/v1/ 2024
[5]

doi: 10.1038/s41586-025-09422-z

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi. org/10.1038/s41586-025-09422-z. Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021a. Hendrycks, D., Burns, C., Basart, S., Zou, A.,...

work page doi:10.1038/s41586-025-09422-z
[6]

Mistral 7B

URL https: //arxiv.org/abs/2310.06825. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Ef- ficient memory management for large language model serving with pagedattention,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Efficient Memory Management for Large Language Model Serving with PagedAttention

URL https:// arxiv.org/abs/2309.06180. Leviathan, Y ., Kalman, M., and Matias, Y . Fast inference from transformers via speculative decoding,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Fast Inference from Transformers via Speculative Decoding

URL https://arxiv.org/abs/2211.17192. Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Let's Verify Step by Step

URL https: //arxiv.org/abs/2305.20050. Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R. Y . Y ., Zhu, A., Yang, L., Shi, X., Shi, C., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. Specinfer: Accelerating large language model serving with tree-based speculative inference and ver- ification. InProceedings of the 29th ACM Interna-...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

URL http: //dx.doi.org/10.1145/3620666.3651335

doi: 10.1145/3620666.3651335. URL http://dx.doi. org/10.1145/3620666.3651335. Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark,

work page doi:10.1145/3620666.3651335
[11]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

URL https://arxiv.org/abs/2311.12022. Schuster, T., Fisch, A., Gupta, J., Dehghani, M., Bahri, D., Tran, V . Q., Tay, Y ., and Metzler, D. Confident adap- tive language modeling,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

org/abs/2207.07061

URL https://arxiv. org/abs/2207.07061. Team, Q. Qwen2.5: A party of foundation models, Septem- ber 2024a. URL https://qwenlm.github.io/ blog/qwen2.5/. Team, T. The falcon 3 family of open models, December 2024b. 9 RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., K...

work page arXiv
[13]

Attention Is All You Need

URL https://arxiv.org/ abs/1706.03762. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency im- proves chain of thought reasoning in language mod- els,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

URLhttps://arxiv.org/abs/2004.12993. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men...

work page arXiv 2004
[15]

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T

URL https://arxiv.org/ abs/2504.15895. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models,

work page arXiv
[16]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

URL https://arxiv.org/abs/2305.10601. Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Bar- rett, C., and Sheng, Y . Sglang: Efficient execution of structured language model programs,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

SGLang: Efficient Execution of Structured Language Model Programs

URL https://arxiv.org/abs/2312.07104. 10 RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit The appendix is organised as follows. Appendix A contains the full theoretical analysis: the latency model, KV sharing speedup and threshold proposition, the combined RKSC speedup corollary, and the dual-level CGEE derivation. Appendix B gives detailed...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Qwen3 Technical Report

URL https: //arxiv.org/abs/2505.09388. Cai, T., Li, Y ., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

URL https://arxiv.org/abs/2401.10774. Chen, K., Tan, X., Yu, M., and Xu, H. Memshare: Memory efficient inference for large reasoning models through kv cache reuse, 2025a. URL https://arxiv.org/ abs/2507.21433. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., Wang, R., Tu, Z., Mi, H., and Yu, D. Do not think th...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

doi: 10.18653/v1/ 2024.findings-acl.348

doi: 10.18653/v1/ 2024.acl-long.681. URL http://dx.doi.org/10. 18653/v1/2024.acl-long.681. Guo, D. et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, September

work page doi:10.18653/v1/ 2024

[5] [5]

doi: 10.1038/s41586-025-09422-z

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi. org/10.1038/s41586-025-09422-z. Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021a. Hendrycks, D., Burns, C., Basart, S., Zou, A.,...

work page doi:10.1038/s41586-025-09422-z

[6] [6]

Mistral 7B

URL https: //arxiv.org/abs/2310.06825. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Ef- ficient memory management for large language model serving with pagedattention,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Efficient Memory Management for Large Language Model Serving with PagedAttention

URL https:// arxiv.org/abs/2309.06180. Leviathan, Y ., Kalman, M., and Matias, Y . Fast inference from transformers via speculative decoding,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Fast Inference from Transformers via Speculative Decoding

URL https://arxiv.org/abs/2211.17192. Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Let's Verify Step by Step

URL https: //arxiv.org/abs/2305.20050. Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R. Y . Y ., Zhu, A., Yang, L., Shi, X., Shi, C., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. Specinfer: Accelerating large language model serving with tree-based speculative inference and ver- ification. InProceedings of the 29th ACM Interna-...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

URL http: //dx.doi.org/10.1145/3620666.3651335

doi: 10.1145/3620666.3651335. URL http://dx.doi. org/10.1145/3620666.3651335. Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark,

work page doi:10.1145/3620666.3651335

[11] [11]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

URL https://arxiv.org/abs/2311.12022. Schuster, T., Fisch, A., Gupta, J., Dehghani, M., Bahri, D., Tran, V . Q., Tay, Y ., and Metzler, D. Confident adap- tive language modeling,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

org/abs/2207.07061

URL https://arxiv. org/abs/2207.07061. Team, Q. Qwen2.5: A party of foundation models, Septem- ber 2024a. URL https://qwenlm.github.io/ blog/qwen2.5/. Team, T. The falcon 3 family of open models, December 2024b. 9 RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., K...

work page arXiv

[13] [13]

Attention Is All You Need

URL https://arxiv.org/ abs/1706.03762. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency im- proves chain of thought reasoning in language mod- els,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

URLhttps://arxiv.org/abs/2004.12993. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men...

work page arXiv 2004

[15] [15]

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T

URL https://arxiv.org/ abs/2504.15895. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models,

work page arXiv

[16] [16]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

URL https://arxiv.org/abs/2305.10601. Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Bar- rett, C., and Sheng, Y . Sglang: Efficient execution of structured language model programs,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

SGLang: Efficient Execution of Structured Language Model Programs

URL https://arxiv.org/abs/2312.07104. 10 RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit The appendix is organised as follows. Appendix A contains the full theoretical analysis: the latency model, KV sharing speedup and threshold proposition, the combined RKSC speedup corollary, and the dual-level CGEE derivation. Appendix B gives detailed...

work page internal anchor Pith review Pith/arXiv arXiv